Model that processes and understands unstructured human language, which is a useful process for answering questions, retrieving information, and topic modeling.
This large BERT model identifies cased English. This model is large as it was pretrained on a large corpus of English data, i.e., it has been finetuned on CoNLL-2003, a dataset focused on named entity recognition (NER). The model is also cased, meaning that it is capable of recognizing capitalization differences, e.g., english vs. English. This model is ready to use and achieves state-of-the-art performance is NER tasks.
NER is an application of Natural Language Processing (NLP) that processes and understands unstructured human language. It is also known as entity identification, entity chunking, and entity extraction. NER extraction is a powerful process for answering questions, retrieving information, and topic modeling.
The original German language BERT-NER model was released on Oct. 2020, and it consists of a German BERT language model trained collaboratively by the makers of the original German BERT (aka. "bert-base-german-cased") and the dbmdz BERT (aka. bert-base-german-dbmdz-cased). Since this release, other BERT models using other languages have been based on this process, and they have been used extensively in research.
Limitations
Note that this model is limited by its training dataset of entity-annotated news articles from a specific span of time. This may not generalize well for all use cases in different domains. Furthermore, the model occasionally tags sub-word tokens as entities and post-processing of results may be necessary to handle those cases.
More Info
This BERT NER model was developed by the MDZ Digital Library team (dbmdz) at the Bayerische Staatsbibliothek (Bavarian State Library):
This dataset concerns language-independent NER. It focus on four types of named entities:
People
Locations
Organization
Other miscellaneous entities that do not belong in the previous three groups.
The contents of the dataset are four columns separated by a single space. Each word is in a separate line and there is an empty line after each sentence. Each line contains four items, (1) words or tokens, (2) part-of-speech (POS) tag, (3) syntactic chunk tag, and (4) the named entity tag.
The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2 tagging scheme, whereas the original dataset uses IOB1.
The training dataset distinguishes between the beginning and continuation of an entity so that if there are back-to-back entities of the same type, the model can output where the second entity begins. As in the dataset, each token will be classified as one of the following classes:
Abbreviation
Description
O
Outside of a named entity
B-MIS
Beginning of a miscellaneous entity right after another miscellaneous entity
I-MIS
Miscellaneous entity
B-PER
Beginning of a person’s name right after another person’s name
I-PER
Person’s name
B-ORG
Beginning of an organization right after another organization
I-ORG
organization
B-LOC
Beginning of a location right after another location
I-LOC
Location
The CoNLL-2003 English dataset was derived from the Reuters corpus which consists of Reuters news stories.
We describe the CoNLL-2003 shared task: language-independent named entity recognition. We give background information on the data sets (English and German) and the evaluation method, present a general overview of the systems that have taken part in the task and discuss their performance.
Dataset Instances
conll2003
Size of downloaded dataset files: 4.63 MB
Size of the generated dataset: 9.78 MB
Total amount of disk used: 14.41 MB
The original data files have -DOCSTART- lines used to separate documents. -DOCSTART- is a special line that acts as a boundary between two different documents.
"The English data is a collection of news wire articles from the Reuters Corpus. The annotation has been done by people of the University of Antwerp. Because of copyright reasons we only make available the annotations. In order to build the complete data sets you will need access to the Reuters Corpus. It can be obtained for research purposes without any charge from NIST."
The copyrights are defined below, from the Reuters Corpus page:
"The stories in the Reuters Corpus are under the copyright of Reuters Ltd and/or Thomson Reuters, and their use is governed by the following agreements:
This agreement must be signed by all researchers using the Reuters Corpus at your organization, and kept on file at your organization."
Citation Information
@inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
title = "Introduction to the {C}o{NLL}-2003 Shared Task: Language-Independent Named Entity Recognition",
author = "Tjong Kim Sang, Erik F. and
De Meulder, Fien",
booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at {HLT}-{NAACL} 2003",
year = "2003",
url = "https://www.aclweb.org/anthology/W03-0419",
pages = "142--147",
}
This model was trained on a single NVIDIA V100 GPU with recommended hyper-parameters from the original BERT paper which trained & evaluated the model on CoNLL-2003 NER task, yielding the following results:
metric
dev
test
f1
95.7
91.7
precision
95.3
91.2
recall
96.1
92.3
The test metrics are a little lower than the official Google BERT results which encoded document context & experimented with CRF. More on replicating the original results can be found on this GitHub issue
Research
The original paper was accepted for Coling-2020, the 28th International Conference on Computational Linguistics.
Authors: Branden Chan, Stefan Schweter, Timo Moller
Abstract
In this work we present the experiments which lead to the creation of our BERT and ELECTRA based German language models, GBERT and GELECTRA. By varying the input training data, model size, and the presence of Whole Word Masking (WWM) we were able to attain SoTA performance across a set of document classification and named entity recognition (NER) tasks for both models of base and large size. We adopt an evaluation driven approach in training these models and our results indicate that both adding more data and utilizing WWM improve model performance. By benchmarking against existing German models, we show that these models are the best German models to date. Our trained models will be made publicly available to the research community.
Additional papers that have relied on one or more of these models may be found here: GitHub
Author
This BERT NER model was developed by the MDZ Digital Library team (dbmdz) at the Bayerische Staatsbibliothek (Bavarian State Library).
Model that processes and understands unstructured human language, which is a useful process for answering questions, retrieving information, and topic modeling.