• Community
  • Model
  • ner-english

ner-english

Model that processes and understands unstructured human language, which is a useful process for answering questions, retrieving information, and topic modeling.

Notes

Named Entity Recognition (NER) Model – English

Introduction

This large BERT model identifies cased English. This model is large as it was pretrained on a large corpus of English data, i.e., it has been finetuned on CoNLL-2003, a dataset focused on named entity recognition (NER). The model is also cased, meaning that it is capable of recognizing capitalization differences, e.g., english vs. English. This model is ready to use and achieves state-of-the-art performance is NER tasks.

NER is an application of Natural Language Processing (NLP) that processes and understands unstructured human language. It is also known as entity identification, entity chunking, and entity extraction. NER extraction is a powerful process for answering questions, retrieving information, and topic modeling.

The original German language BERT-NER model was released on Oct. 2020, and it consists of a German BERT language model trained collaboratively by the makers of the original German BERT (aka. "bert-base-german-cased") and the dbmdz BERT (aka. bert-base-german-dbmdz-cased). Since this release, other BERT models using other languages have been based on this process, and they have been used extensively in research.

Limitations

Note that this model is limited by its training dataset of entity-annotated news articles from a specific span of time. This may not generalize well for all use cases in different domains. Furthermore, the model occasionally tags sub-word tokens as entities and post-processing of results may be necessary to handle those cases.

More Info


Dataset

The data used to train this model is the CoNLL-2003 (Conference on Computational Natual Language Learning 2003) shared task dataset.

This dataset concerns language-independent NER. It focus on four types of named entities:

  • People
  • Locations
  • Organization
  • Other miscellaneous entities that do not belong in the previous three groups.

The contents of the dataset are four columns separated by a single space. Each word is in a separate line and there is an empty line after each sentence. Each line contains four items, (1) words or tokens, (2) part-of-speech (POS) tag, (3) syntactic chunk tag, and (4) the named entity tag.

The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2 tagging scheme, whereas the original dataset uses IOB1.

The training dataset distinguishes between the beginning and continuation of an entity so that if there are back-to-back entities of the same type, the model can output where the second entity begins. As in the dataset, each token will be classified as one of the following classes:

AbbreviationDescription
OOutside of a named entity
B-MISBeginning of a miscellaneous entity right after another miscellaneous entity
I-MISMiscellaneous entity
B-PERBeginning of a person’s name right after another person’s name
I-PERPerson’s name
B-ORGBeginning of an organization right after another organization
I-ORGorganization
B-LOCBeginning of a location right after another location
I-LOCLocation

The CoNLL-2003 English dataset was derived from the Reuters corpus which consists of Reuters news stories.

# of training examples per entity type

DatasetLOCMISCORGPER
Train7140343863216600
Dev183792213411842
Test166870216611617

# of articles/sentences/tokens per dataset

DatasetArticlesSentencesTokens
Train94614,987203,621
Dev2163,46651,362
Test2313,68446,435

Paper

Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition

Authors: Erik F. Tjong Kim Sang, Fien De Meulder

Abstract

We describe the CoNLL-2003 shared task: language-independent named entity recognition. We give background information on the data sets (English and German) and the evaluation method, present a general overview of the systems that have taken part in the task and discuss their performance.

Dataset Instances

conll2003

  • Size of downloaded dataset files: 4.63 MB
  • Size of the generated dataset: 9.78 MB
  • Total amount of disk used: 14.41 MB

The original data files have -DOCSTART- lines used to separate documents. -DOCSTART- is a special line that acts as a boundary between two different documents.

Data Fields

  • id: a string feature.

  • tokens: a list of string features.

  • pos_tags: a list of classification labels (int).

    • Full tagset with indices:
      {'"': 0, "''": 1, '#': 2, '$': 3, '(': 4, ')': 5, ',': 6, '.': 7, ':': 8, '``': 9, 'CC': 10, 'CD': 11, 'DT': 12, 'EX': 13, 'FW': 14, 'IN': 15, 'JJ': 16, 'JJR': 17, 'JJS': 18, 'LS': 19, 'MD': 20, 'NN': 21, 'NNP': 22, 'NNPS': 23, 'NNS': 24, 'NN|SYM': 25, 'PDT': 26, 'POS': 27, 'PRP': 28, 'PRP$': 29, 'RB': 30, 'RBR': 31, 'RBS': 32, 'RP': 33, 'SYM': 34, 'TO': 35, 'UH': 36, 'VB': 37, 'VBD': 38, 'VBG': 39, 'VBN': 40, 'VBP': 41, 'VBZ': 42, 'WDT': 43, 'WP': 44, 'WP$': 45, 'WRB': 46}
      
  • chunk_tags: a list of classification labels (int).

    • Full tagset with indices:
      {'O': 0, 'B-ADJP': 1, 'I-ADJP': 2, 'B-ADVP': 3, 'I-ADVP': 4, 'B-CONJP': 5, 'I-CONJP': 6, 'B-INTJ': 7, 'I-INTJ': 8,'B-LST': 9, 'I-LST': 10, 'B-NP': 11, 'I-NP': 12, 'B-PP': 13, 'I-PP': 14, 'B-PRT': 15, 'I-PRT': 16, 'B-SBAR': 17, 'I-SBAR': 18, 'B-UCP': 19, 'I-UCP': 20, 'B-VP': 21, 'I-VP': 22}
      
  • ner_tags: a list of classification labels (int).

    • Full tagset with indices:
      {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
      

Data Splits

NameTrainValidationTest
conll20031404132503453

Data Formatting Example

id (string)tokens (json)pos_tags (json)chunk_tags (json)ner_tags (json)
0[ "EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", "." ][ 22, 42, 16, 21, 35, 37, 16, 21, 7 ][ 11, 21, 11, 12, 21, 22, 11, 12, 0 ][ 3, 0, 7, 0, 0, 0, 7, 0, 0 ]
1[ "Peter", "Blackburn" ][ 22, 22 ][ 11, 12 ][ 1, 2 ]
2[ "BRUSSELS", "1996-08-22" ][ 22, 11 ][ 11, 12 ][ 5, 0 ]
3[ "The", "European", "Commission", "said", "on", "Thursday", "it", "disagreed", "with", "German", "advice", "to", "consumers", "to", "shun", "British", "lamb", "until", "scientists", "determine", "whether", "mad", "cow", "disease", "can", "be", "transmitted", "to", "sheep", "." ][ 12, 22, 22, 38, 15, 22, 28, 38, 15, 16, 21, 35, 24, 35, 37, 16, 21, 15, 24, 41, 15, 16, 21, 21, 20, 37, 40, 35, 21, 7 ][ 11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 13, 11, 21, 22, 11, 12, 17, 11, 21, 17, 11, 12, 12, 21, 22, 22, 13, 11, 0 ][ 0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]

Licensing Information

From the CoNLL-2003 shared task page:

"The English data is a collection of news wire articles from the Reuters Corpus. The annotation has been done by people of the University of Antwerp. Because of copyright reasons we only make available the annotations. In order to build the complete data sets you will need access to the Reuters Corpus. It can be obtained for research purposes without any charge from NIST."

The copyrights are defined below, from the Reuters Corpus page:

"The stories in the Reuters Corpus are under the copyright of Reuters Ltd and/or Thomson Reuters, and their use is governed by the following agreements:

Organizational agreement

This agreement must be signed by the person responsible for the data at your organization, and sent to NIST.

Individual agreement

This agreement must be signed by all researchers using the Reuters Corpus at your organization, and kept on file at your organization."

Citation Information

@inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
    title = "Introduction to the {C}o{NLL}-2003 Shared Task: Language-Independent Named Entity Recognition",
    author = "Tjong Kim Sang, Erik F.  and
      De Meulder, Fien",
    booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at {HLT}-{NAACL} 2003",
    year = "2003",
    url = "https://www.aclweb.org/anthology/W03-0419",
    pages = "142--147",
}

More Information


Benchmarks

This model was trained on a single NVIDIA V100 GPU with recommended hyper-parameters from the original BERT paper which trained & evaluated the model on CoNLL-2003 NER task, yielding the following results:

metricdevtest
f195.791.7
precision95.391.2
recall96.192.3

The test metrics are a little lower than the official Google BERT results which encoded document context & experimented with CRF. More on replicating the original results can be found on this GitHub issue


Research

The original paper was accepted for Coling-2020, the 28th International Conference on Computational Linguistics.

Title: German’s Next Language Model

Authors: Branden Chan, Stefan Schweter, Timo Moller

Abstract

In this work we present the experiments which lead to the creation of our BERT and ELECTRA based German language models, GBERT and GELECTRA. By varying the input training data, model size, and the presence of Whole Word Masking (WWM) we were able to attain SoTA performance across a set of document classification and named entity recognition (NER) tasks for both models of base and large size. We adopt an evaluation driven approach in training these models and our results indicate that both adding more data and utilizing WWM improve model performance. By benchmarking against existing German models, we show that these models are the best German models to date. Our trained models will be made publicly available to the research community.

Additional papers that have relied on one or more of these models may be found here: GitHub


Author

This BERT NER model was developed by the MDZ Digital Library team (dbmdz) at the Bayerische Staatsbibliothek (Bavarian State Library).

  • ID
  • Name
    ner_english_v2
  • Model Type ID
    Text Token Classifier
  • Description
    Model that processes and understands unstructured human language, which is a useful process for answering questions, retrieving information, and topic modeling.
  • Last Updated
    Sep 05, 2022
  • Privacy
    PUBLIC
  • Toolkit
  • License
  • Share
    • Badge
      ner-english