In our previous blog posts in the series, we have described traditional methods for few-shot named entity recognition (NER) and discussed how large language models (LLMs) are being used to solve the NER task. In this post, we close the gap between these two areas and apply an LLM-based method for few-shot NER.
As a reminder, NER is the task of finding and categorizing named entities in text, for example, names of people, organizations, locations, etc. In a few-shot scenario, there are only a handful of labeled examples available for training or adapting an NER system, in contrast to the vast amounts of data typically needed to train a deep learning model.
Example of a labeled NER sentence
While Transformer-based models, such as BERT, have been used as a backbone for models fine-tuned to NER for quite some time, recently there is increasing interest in understanding the effectiveness of prompting pre-trained decoder-only LLMs with few-shot examples for a variety of tasks.
GPT-NER is a method of prompting LLMs to perform NER proposed by Shuhe Wang et al. They prompt a language model to detect a class of named entities, showing a few input and output examples in the prompt, where in the output the entities are marked with special symbols (@@ marks the start and ## the end of a named entity).
A GPT-NER prompt. All event entities in the example outputs in the prompt are marked with “@@” (beginning of the named entity) and “##” (end of the named entity)
While Wang et al. evaluate their method in the low-resource setting, they imitate this scenario by selecting a random subset of a larger, general-purpose dataset (CoNLL-2003). They also put considerable emphasis on choosing the best possible few-shot examples to include in the prompt; however, in a truly few-shot scenario there is no wealth of examples to choose from.
To close this gap, we apply the prompting method in a true few-shot scenario, using a purposefully constructed dataset for few-shot NER, specifically, the Few-NERD dataset.
The task of few-shot NER has gained popularity in recent years, but there is not much benchmark data focused on this specific task. Often, data scarcity for the few-shot case is simulated by using a larger dataset and selecting a random subset of it to use for training. Few-NERD is one dataset that was designed specifically for the few-shot NER task.
The few-shot dataset is organized in episodes. Each episode consists of a support set containing several few-shot examples (labeled sentences), and a query set for which labels need to be predicted using the information of the support set. The dataset has training, development, and test splits; however, as we are using a pre-trained LLM without any fine-tuning, we only use the test split in our experiments. The support sets serve as the few-shot examples provided in the prompt, and we predict the labels for the query sets.
Coarse- and fine-grained entity types in the Few-NERD dataset (Ding et al., 2021)
The types, or classes, of named entities in Few-NERD have two levels: coarse-grained (person, location, etc.) and fine-grained (e.g. actor is a subclass of person, island is a subclass of location, etc.). In our experiments described here, we only deal with the easier coarse-grained classification.
The full dataset includes a few tasks. There is a supervised task, which is not few-shot and is not organized in episodes: the data is split into train (70% of all data), development (10%), and test (20%) sets. The few-shot task organizes data in episodes. Moreover, there is a distinction between the inter and intra tasks. In the intra task, each coarse-grained entity type will only be labeled in one of the train, development, and test splits, and will be completely unseen in the other two. We use the second task, inter, where the same coarse-grained entity type may appear in all data splits (train, development, and test), but any fine-grained type will only be labeled in one of the splits. Furthermore, the dataset includes variants where either 5 or 10 entity types are present in an episode, and where either 1-2 or 5-10 examples per class are included in the support set of an episode.
In our experiments, we aimed to evaluate the GPT-NER prompting setup, but a) do that in a truly few-shot scenario using the Few-NERD dataset, and b) use LLMs from Llama 2 family, which are available on the Clarifai platform, instead of the closed models used by the GPT-NER authors. Our code can be found in this Github repository.
We aim to answer these questions:
We compare the results along two dimensions: first, we compare the performance of different Llama 2 model sizes on the same dataset; then, we also compare the behavior of the models when a different number of few-shot input-output examples are shown in the prompt.
We compared the three different-sized Llama-2-chat models available on the Clarifai platform. As an example, let us look at the scores of 7B, 13B, and 70B models on the inter 5-way 1-2-shot Few-NERD test set.
The largest, 70B model has the best F1 scores, but the 13B model is worse on this metric than the smallest 7B model.
F1 scores of Llama 2 7B (blue), 13B (cyan), and 70B (black) models on the “inter” 5-way, 1~2-shot test set of Few-NERD
However, if we look at the precision and recall metrics which contribute to F1, the situation becomes even more nuanced. The 13B model turns out to have the best precision scores out of all three model sizes, and the 70B model is, in fact, the worst on precision for all classes.
Precision scores of Llama 2 7B (blue), 13B (cyan), and 70B (black) models on the “inter” 5-way, 1~2-shot test set of Few-NERD
This is compensated by recall, which is much higher for the 70B model than for the smaller ones. Thus, it seems that the largest model detects more named entities than the others, but the 13B model needs to be more certain about named entities to detect them. From these results, we can expect the 13B model to have the fewest false positives, and the 70B the fewest false negatives, while the smallest, 7B model falls somewhere in between on both types of errors.
Recall scores of Llama 2 7B (blue), 13B (cyan), and 70B (black) models on the “inter” 5-way, 1~2-shot test set of Few-NERD
We also compare differently sized Llama 2 models on datasets with different numbers of named entity examples in few-shot prompts: 1-2 or 5-10 examples per (fine-grained) class.
As expected, all models do better when there are more few-shot examples in the prompt. At the same time, we notice that the difference in scores is much smaller for the 70B model than for the smaller ones, which suggests that the larger model can do well with fewer examples. The trend is not entirely consistent with model size though: for the medium-sized 13B model, the difference between seeing 1-2 or 5-10 examples in the prompt is the most drastic.
F1 scores of Llama 2 7B (left), 13B (center), and 70B (right) models on the “inter” 5-way 1~2-shot (blue) and 5~10-shot (cyan) test sets of Few-NERD
A few issues need to be considered when we prompt LLMs to do NER in the GPT-NER style.
A single sentence often contains more than one entity type, which means the LLM needs to be prompted separately for each type
Sometimes, the model output is not well-formed: in output 1, there is the opening tag “@@”, but the closing tag “##” never appears; in output 2, the model used the opening tag instead of the closing one
After producing the output sentence, the LLM keeps inventing new input-output pairs
Sometimes the LLM may generate tokens which are different from those in the input, for example, translating foreign words into English
As only some entity classes are labeled in each split of the Few-NERD episode data and annotations for all other classes are removed, the model will not have full information for coarse-grained classes by the nature of the data. Only the data for the supervised task contains full labels, and some extra processing needs to be done if we want to match those. For instance, in the example below only the character is labeled in the episode data, but the actors are not labeled. This may cause issues for both prompting and evaluation. This may be one of the reasons for the larger model’s low precision scores: if the LLM has enough prior knowledge to label all the person entities, some of them may be identified as false positives.
The labels are not always obviously correct: for example, here the character Spider-Man is labeled as a painting, and a racehorse is labeled as a person
An important note is that in Few-NERD, the classes have two levels of granularity: for example, “person-actor”, where “person” is the coarse-grained, and “actor” the fine-grained class. For now, we only consider the broader coarse-grained classes, which are easier for the models to detect than the more specific fine-grained classes would be.
In the GPT-NER pre-print, there is some emphasis placed on the self-verification technique. After finding a named entity, the model is then prompted to reconsider its decision: given the sentence and the entity that the model found in that sentence, it has to answer whether that entity does indeed belong to the class in question. While we have replicated the basic GPT-NER setup with Few-NERD and Llama 2, we have not yet explored the self-verification technique in detail.
We focus on recreating the main setup of GPT-NER and use the prompts as shown in the pre-print. However, we think that the results could be improved and some of the issues described above could be fixed with more sophisticated prompt engineering. This is also something we leave for future experiments.
Finally, there are other exciting LLMs to experiment with, including the recently released Llama 3 models available on the Clarifai platform.
We applied the prompting approach of GPT-NER to the task of few-shot NER using the Few-NERD dataset and the Llama 2 models hosted by Clarifai. While there are a few issues to be considered, we have found that, as would be expected, the models do better when there are more few-shot examples shown in the prompt, but, less expectedly, the trends related to model sizes are varied. There is still a lot to be explored as well: better prompt engineering, more advanced techniques such as self-verification, how the models perform when detecting fine-grained instead of coarse-grained classes, and much more.
Try out one of the LLMs on the Clarifai platform today. Can’t find what you need? Consult our docs page or send us a message in our Community Discord channel.
© 2023 Clarifai, Inc. Terms of Service Content TakedownPrivacy Policy
© 2023 Clarifai, Inc. Terms of Service Content TakedownPrivacy Policy