- Community
- Model
- ner-english
Notes
Named Entity Recognition (NER) Model – English
Introduction
This large BERT model identifies cased English. This model is large as it was pretrained on a large corpus of English data, i.e., it has been finetuned on CoNLL-2003, a dataset focused on named entity recognition (NER). The model is also cased, meaning that it is capable of recognizing capitalization differences, e.g., english vs. English. This model is ready to use and achieves state-of-the-art performance is NER tasks.
NER is an application of Natural Language Processing (NLP) that processes and understands unstructured human language. It is also known as entity identification, entity chunking, and entity extraction. NER extraction is a powerful process for answering questions, retrieving information, and topic modeling.
The original German language BERT-NER model was released on Oct. 2020, and it consists of a German BERT language model trained collaboratively by the makers of the original German BERT (aka. "bert-base-german-cased") and the dbmdz BERT (aka. bert-base-german-dbmdz-cased). Since this release, other BERT models using other languages have been based on this process, and they have been used extensively in research.
Limitations
Note that this model is limited by its training dataset of entity-annotated news articles from a specific span of time. This may not generalize well for all use cases in different domains. Furthermore, the model occasionally tags sub-word tokens as entities and post-processing of results may be necessary to handle those cases.
More Info
This BERT NER model was developed by the MDZ Digital Library team (dbmdz) at the Bayerische Staatsbibliothek (Bavarian State Library):
The original Hugging Face model page is lacking information, but it has been included for reference purposes: dbmdz/bert-large-cased-finetuned-conll03-english
Since there is great overlap with other BERT-based NER models, we recommend relying in similar models for additional information:
Dataset
The data used to train this model is the CoNLL-2003 (Conference on Computational Natual Language Learning 2003) shared task dataset.
This dataset concerns language-independent NER. It focus on four types of named entities:
- People
- Locations
- Organization
- Other miscellaneous entities that do not belong in the previous three groups.
The contents of the dataset are four columns separated by a single space. Each word is in a separate line and there is an empty line after each sentence. Each line contains four items, (1) words or tokens, (2) part-of-speech (POS) tag, (3) syntactic chunk tag, and (4) the named entity tag.
The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2 tagging scheme, whereas the original dataset uses IOB1.
The training dataset distinguishes between the beginning and continuation of an entity so that if there are back-to-back entities of the same type, the model can output where the second entity begins. As in the dataset, each token will be classified as one of the following classes:
Abbreviation | Description |
---|---|
O | Outside of a named entity |
B-MIS | Beginning of a miscellaneous entity right after another miscellaneous entity |
I-MIS | Miscellaneous entity |
B-PER | Beginning of a person’s name right after another person’s name |
I-PER | Person’s name |
B-ORG | Beginning of an organization right after another organization |
I-ORG | organization |
B-LOC | Beginning of a location right after another location |
I-LOC | Location |
The CoNLL-2003 English dataset was derived from the Reuters corpus which consists of Reuters news stories.
# of training examples per entity type
Dataset | LOC | MISC | ORG | PER |
---|---|---|---|---|
Train | 7140 | 3438 | 6321 | 6600 |
Dev | 1837 | 922 | 1341 | 1842 |
Test | 1668 | 702 | 1661 | 1617 |
# of articles/sentences/tokens per dataset
Dataset | Articles | Sentences | Tokens |
---|---|---|---|
Train | 946 | 14,987 | 203,621 |
Dev | 216 | 3,466 | 51,362 |
Test | 231 | 3,684 | 46,435 |
Paper
Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition
Authors: Erik F. Tjong Kim Sang, Fien De Meulder
Abstract
We describe the CoNLL-2003 shared task: language-independent named entity recognition. We give background information on the data sets (English and German) and the evaluation method, present a general overview of the systems that have taken part in the task and discuss their performance.
Dataset Instances
conll2003
- Size of downloaded dataset files: 4.63 MB
- Size of the generated dataset: 9.78 MB
- Total amount of disk used: 14.41 MB
The original data files have -DOCSTART- lines used to separate documents. -DOCSTART- is a special line that acts as a boundary between two different documents.
Data Fields
id: a string feature.
tokens: a list of string features.
pos_tags: a list of classification labels (int).
- Full tagset with indices:
{'"': 0, "''": 1, '#': 2, '$': 3, '(': 4, ')': 5, ',': 6, '.': 7, ':': 8, '``': 9, 'CC': 10, 'CD': 11, 'DT': 12, 'EX': 13, 'FW': 14, 'IN': 15, 'JJ': 16, 'JJR': 17, 'JJS': 18, 'LS': 19, 'MD': 20, 'NN': 21, 'NNP': 22, 'NNPS': 23, 'NNS': 24, 'NN|SYM': 25, 'PDT': 26, 'POS': 27, 'PRP': 28, 'PRP$': 29, 'RB': 30, 'RBR': 31, 'RBS': 32, 'RP': 33, 'SYM': 34, 'TO': 35, 'UH': 36, 'VB': 37, 'VBD': 38, 'VBG': 39, 'VBN': 40, 'VBP': 41, 'VBZ': 42, 'WDT': 43, 'WP': 44, 'WP$': 45, 'WRB': 46}
- Full tagset with indices:
chunk_tags: a list of classification labels (int).
- Full tagset with indices:
{'O': 0, 'B-ADJP': 1, 'I-ADJP': 2, 'B-ADVP': 3, 'I-ADVP': 4, 'B-CONJP': 5, 'I-CONJP': 6, 'B-INTJ': 7, 'I-INTJ': 8,'B-LST': 9, 'I-LST': 10, 'B-NP': 11, 'I-NP': 12, 'B-PP': 13, 'I-PP': 14, 'B-PRT': 15, 'I-PRT': 16, 'B-SBAR': 17, 'I-SBAR': 18, 'B-UCP': 19, 'I-UCP': 20, 'B-VP': 21, 'I-VP': 22}
- Full tagset with indices:
ner_tags: a list of classification labels (int).
- Full tagset with indices:
{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
- Full tagset with indices:
Data Splits
Name | Train | Validation | Test |
---|---|---|---|
conll2003 | 14041 | 3250 | 3453 |
Data Formatting Example
id (string) | tokens (json) | pos_tags (json) | chunk_tags (json) | ner_tags (json) |
---|---|---|---|---|
0 | [ "EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", "." ] | [ 22, 42, 16, 21, 35, 37, 16, 21, 7 ] | [ 11, 21, 11, 12, 21, 22, 11, 12, 0 ] | [ 3, 0, 7, 0, 0, 0, 7, 0, 0 ] |
1 | [ "Peter", "Blackburn" ] | [ 22, 22 ] | [ 11, 12 ] | [ 1, 2 ] |
2 | [ "BRUSSELS", "1996-08-22" ] | [ 22, 11 ] | [ 11, 12 ] | [ 5, 0 ] |
3 | [ "The", "European", "Commission", "said", "on", "Thursday", "it", "disagreed", "with", "German", "advice", "to", "consumers", "to", "shun", "British", "lamb", "until", "scientists", "determine", "whether", "mad", "cow", "disease", "can", "be", "transmitted", "to", "sheep", "." ] | [ 12, 22, 22, 38, 15, 22, 28, 38, 15, 16, 21, 35, 24, 35, 37, 16, 21, 15, 24, 41, 15, 16, 21, 21, 20, 37, 40, 35, 21, 7 ] | [ 11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 13, 11, 21, 22, 11, 12, 17, 11, 21, 17, 11, 12, 12, 21, 22, 22, 13, 11, 0 ] | [ 0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] |
Licensing Information
From the CoNLL-2003 shared task page:
"The English data is a collection of news wire articles from the Reuters Corpus. The annotation has been done by people of the University of Antwerp. Because of copyright reasons we only make available the annotations. In order to build the complete data sets you will need access to the Reuters Corpus. It can be obtained for research purposes without any charge from NIST."
The copyrights are defined below, from the Reuters Corpus page:
"The stories in the Reuters Corpus are under the copyright of Reuters Ltd and/or Thomson Reuters, and their use is governed by the following agreements:
This agreement must be signed by the person responsible for the data at your organization, and sent to NIST.
This agreement must be signed by all researchers using the Reuters Corpus at your organization, and kept on file at your organization."
Citation Information
@inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
title = "Introduction to the {C}o{NLL}-2003 Shared Task: Language-Independent Named Entity Recognition",
author = "Tjong Kim Sang, Erik F. and
De Meulder, Fien",
booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at {HLT}-{NAACL} 2003",
year = "2003",
url = "https://www.aclweb.org/anthology/W03-0419",
pages = "142--147",
}
More Information
- Language-Independent Named Entity Recognition (II) documentation from the University of Antwerp
- Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition
Benchmarks
This model was trained on a single NVIDIA V100 GPU with recommended hyper-parameters from the original BERT paper which trained & evaluated the model on CoNLL-2003 NER task, yielding the following results:
metric | dev | test |
---|---|---|
f1 | 95.7 | 91.7 |
precision | 95.3 | 91.2 |
recall | 96.1 | 92.3 |
The test metrics are a little lower than the official Google BERT results which encoded document context & experimented with CRF. More on replicating the original results can be found on this GitHub issue
Research
The original paper was accepted for Coling-2020, the 28th International Conference on Computational Linguistics.
Title: German’s Next Language Model
Authors: Branden Chan, Stefan Schweter, Timo Moller
Abstract
In this work we present the experiments which lead to the creation of our BERT and ELECTRA based German language models, GBERT and GELECTRA. By varying the input training data, model size, and the presence of Whole Word Masking (WWM) we were able to attain SoTA performance across a set of document classification and named entity recognition (NER) tasks for both models of base and large size. We adopt an evaluation driven approach in training these models and our results indicate that both adding more data and utilizing WWM improve model performance. By benchmarking against existing German models, we show that these models are the best German models to date. Our trained models will be made publicly available to the research community.
Additional papers that have relied on one or more of these models may be found here: GitHub
Author
This BERT NER model was developed by the MDZ Digital Library team (dbmdz) at the Bayerische Staatsbibliothek (Bavarian State Library).
- ID
- Namener_english_v2
- Model Type IDText Token Classifier
- DescriptionModel that processes and understands unstructured human language, which is a useful process for answering questions, retrieving information, and topic modeling.
- Last UpdatedSep 05, 2022
- PrivacyPUBLIC
- Toolkit
- License
- Share
- Badge