ner-english

Model that processes and understands unstructured human language, which is a useful process for answering questions, retrieving information, and topic modeling.

No input available.

Notes

Named Entity Recognition (NER) Model – English

Introduction

This large BERT model identifies cased English. This model is large as it was pretrained on a large corpus of English data, i.e., it has been finetuned on CoNLL-2003, a dataset focused on named entity recognition (NER). The model is also cased, meaning that it is capable of recognizing capitalization differences, e.g., english vs. English. This model is ready to use and achieves state-of-the-art performance is NER tasks.

NER is an application of Natural Language Processing (NLP) that processes and understands unstructured human language. It is also known as entity identification, entity chunking, and entity extraction. NER extraction is a powerful process for answering questions, retrieving information, and topic modeling.

The original German language BERT-NER model was released on Oct. 2020, and it consists of a German BERT language model trained collaboratively by the makers of the original German BERT (aka. "bert-base-german-cased") and the dbmdz BERT (aka. bert-base-german-dbmdz-cased). Since this release, other BERT models using other languages have been based on this process, and they have been used extensively in research.

Limitations

Note that this model is limited by its training dataset of entity-annotated news articles from a specific span of time. This may not generalize well for all use cases in different domains. Furthermore, the model occasionally tags sub-word tokens as entities and post-processing of results may be necessary to handle those cases.

More Info

This BERT NER model was developed by the MDZ Digital Library team (dbmdz) at the Bayerische Staatsbibliothek (Bavarian State Library):
- MDZ Digital Library team repository: GitHub
- DBMDZ BERT models repository: GitHub
The original Hugging Face model page is lacking information, but it has been included for reference purposes: dbmdz/bert-large-cased-finetuned-conll03-english
Since there is great overlap with other BERT-based NER models, we recommend relying in similar models for additional information:
- dslim/bert-base-NER
- dslim/bert-large-NER

Dataset

The data used to train this model is the CoNLL-2003 (Conference on Computational Natual Language Learning 2003) shared task dataset.

This dataset concerns language-independent NER. It focus on four types of named entities:

People
Locations
Organization
Other miscellaneous entities that do not belong in the previous three groups.

The contents of the dataset are four columns separated by a single space. Each word is in a separate line and there is an empty line after each sentence. Each line contains four items, (1) words or tokens, (2) part-of-speech (POS) tag, (3) syntactic chunk tag, and (4) the named entity tag.

The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2 tagging scheme, whereas the original dataset uses IOB1.

The training dataset distinguishes between the beginning and continuation of an entity so that if there are back-to-back entities of the same type, the model can output where the second entity begins. As in the dataset, each token will be classified as one of the following classes:

Abbreviation	Description
O	Outside of a named entity
B-MIS	Beginning of a miscellaneous entity right after another miscellaneous entity
I-MIS	Miscellaneous entity
B-PER	Beginning of a person’s name right after another person’s name
I-PER	Person’s name
B-ORG	Beginning of an organization right after another organization
I-ORG	organization
B-LOC	Beginning of a location right after another location
I-LOC	Location

The CoNLL-2003 English dataset was derived from the Reuters corpus which consists of Reuters news stories.

# of training examples per entity type

Dataset	LOC	MISC	ORG	PER
Train	7140	3438	6321	6600
Dev	1837	922	1341	1842
Test	1668	702	1661	1617

# of articles/sentences/tokens per dataset

Dataset	Articles	Sentences	Tokens
Train	946	14,987	203,621
Dev	216	3,466	51,362
Test	231	3,684	46,435

Paper

Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition

Authors: Erik F. Tjong Kim Sang, Fien De Meulder

Abstract

We describe the CoNLL-2003 shared task: language-independent named entity recognition. We give background information on the data sets (English and German) and the evaluation method, present a general overview of the systems that have taken part in the task and discuss their performance.

Dataset Instances

conll2003

Size of downloaded dataset files: 4.63 MB
Size of the generated dataset: 9.78 MB
Total amount of disk used: 14.41 MB

The original data files have -DOCSTART- lines used to separate documents. -DOCSTART- is a special line that acts as a boundary between two different documents.

Data Fields

id: a string feature.
tokens: a list of string features.

pos_tags: a list of classification labels (int).

Full tagset with indices:

{'"': 0, "''": 1, '#': 2, '$': 3, '(': 4, ')': 5, ',': 6, '.': 7, ':': 8, '``': 9, 'CC': 10, 'CD': 11, 'DT': 12, 'EX': 13, 'FW': 14, 'IN': 15, 'JJ': 16, 'JJR': 17, 'JJS': 18, 'LS': 19, 'MD': 20, 'NN': 21, 'NNP': 22, 'NNPS': 23, 'NNS': 24, 'NN|SYM': 25, 'PDT': 26, 'POS': 27, 'PRP': 28, 'PRP$': 29, 'RB': 30, 'RBR': 31, 'RBS': 32, 'RP': 33, 'SYM': 34, 'TO': 35, 'UH': 36, 'VB': 37, 'VBD': 38, 'VBG': 39, 'VBN': 40, 'VBP': 41, 'VBZ': 42, 'WDT': 43, 'WP': 44, 'WP$': 45, 'WRB': 46}

chunk_tags: a list of classification labels (int).

Full tagset with indices:

{'O': 0, 'B-ADJP': 1, 'I-ADJP': 2, 'B-ADVP': 3, 'I-ADVP': 4, 'B-CONJP': 5, 'I-CONJP': 6, 'B-INTJ': 7, 'I-INTJ': 8,'B-LST': 9, 'I-LST': 10, 'B-NP': 11, 'I-NP': 12, 'B-PP': 13, 'I-PP': 14, 'B-PRT': 15, 'I-PRT': 16, 'B-SBAR': 17, 'I-SBAR': 18, 'B-UCP': 19, 'I-UCP': 20, 'B-VP': 21, 'I-VP': 22}

ner_tags: a list of classification labels (int).

Full tagset with indices:

{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}

Data Splits

Name	Train	Validation	Test
conll2003	14041	3250	3453

Data Formatting Example

id (string)	tokens (json)	pos_tags (json)	chunk_tags (json)	ner_tags (json)
0	[ "EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", "." ]	[ 22, 42, 16, 21, 35, 37, 16, 21, 7 ]	[ 11, 21, 11, 12, 21, 22, 11, 12, 0 ]	[ 3, 0, 7, 0, 0, 0, 7, 0, 0 ]
1	[ "Peter", "Blackburn" ]	[ 22, 22 ]	[ 11, 12 ]	[ 1, 2 ]
2	[ "BRUSSELS", "1996-08-22" ]	[ 22, 11 ]	[ 11, 12 ]	[ 5, 0 ]
3	[ "The", "European", "Commission", "said", "on", "Thursday", "it", "disagreed", "with", "German", "advice", "to", "consumers", "to", "shun", "British", "lamb", "until", "scientists", "determine", "whether", "mad", "cow", "disease", "can", "be", "transmitted", "to", "sheep", "." ]	[ 12, 22, 22, 38, 15, 22, 28, 38, 15, 16, 21, 35, 24, 35, 37, 16, 21, 15, 24, 41, 15, 16, 21, 21, 20, 37, 40, 35, 21, 7 ]	[ 11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 13, 11, 21, 22, 11, 12, 17, 11, 21, 17, 11, 12, 12, 21, 22, 22, 13, 11, 0 ]	[ 0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]

Licensing Information

From the CoNLL-2003 shared task page:

"The English data is a collection of news wire articles from the Reuters Corpus. The annotation has been done by people of the University of Antwerp. Because of copyright reasons we only make available the annotations. In order to build the complete data sets you will need access to the Reuters Corpus. It can be obtained for research purposes without any charge from NIST."

The copyrights are defined below, from the Reuters Corpus page:

"The stories in the Reuters Corpus are under the copyright of Reuters Ltd and/or Thomson Reuters, and their use is governed by the following agreements:

Organizational agreement

This agreement must be signed by the person responsible for the data at your organization, and sent to NIST.

Individual agreement

This agreement must be signed by all researchers using the Reuters Corpus at your organization, and kept on file at your organization."

Citation Information

@inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
    title = "Introduction to the {C}o{NLL}-2003 Shared Task: Language-Independent Named Entity Recognition",
    author = "Tjong Kim Sang, Erik F.  and
      De Meulder, Fien",
    booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at {HLT}-{NAACL} 2003",
    year = "2003",
    url = "https://www.aclweb.org/anthology/W03-0419",
    pages = "142--147",
}

More Information

Benchmarks

This model was trained on a single NVIDIA V100 GPU with recommended hyper-parameters from the original BERT paper which trained & evaluated the model on CoNLL-2003 NER task, yielding the following results:

metric	dev	test
f1	95.7	91.7
precision	95.3	91.2
recall	96.1	92.3

The test metrics are a little lower than the official Google BERT results which encoded document context & experimented with CRF. More on replicating the original results can be found on this GitHub issue

Research

The original paper was accepted for Coling-2020, the 28th International Conference on Computational Linguistics.

Title: German’s Next Language Model

Authors: Branden Chan, Stefan Schweter, Timo Moller

Abstract

In this work we present the experiments which lead to the creation of our BERT and ELECTRA based German language models, GBERT and GELECTRA. By varying the input training data, model size, and the presence of Whole Word Masking (WWM) we were able to attain SoTA performance across a set of document classification and named entity recognition (NER) tasks for both models of base and large size. We adopt an evaluation driven approach in training these models and our results indicate that both adding more data and utilizing WWM improve model performance. By benchmarking against existing German models, we show that these models are the best German models to date. Our trained models will be made publicly available to the research community.

Additional papers that have relied on one or more of these models may be found here: GitHub

Author

This BERT NER model was developed by the MDZ Digital Library team (dbmdz) at the Bayerische Staatsbibliothek (Bavarian State Library).

ID
Model Type ID
Text Token Classifier
Input Type
text
Output Type
regions[...].region_info.span,regions[...].data.concepts
Description
Model that processes and understands unstructured human language, which is a useful process for answering questions, retrieving information, and topic modeling.
Last Updated
Sep 05, 2022
Privacy
PUBLIC
Toolkit
License
Share
Badge