hkunlp_instructor-xl

A embedding model that can generate text embeddings tailored to any task (e.g., classification, clustering, text evaluation, etc.) and domains (e.g., science, finance, etc.) by simply providing the task instruction, without any finetuning.

No input available.

Notes

Overview

Instructor, a method for computing text embeddings given task instructions: every text input is embedded together with instructions explaining the use case (e.g., task and domain descriptions). Unlike encoders from prior work that are more specialized, Instructor is a single embedder that can generate text embeddings tailored to different downstream tasks and domains, without any further training.

Instructor-xl

Generalizable T5-based dense Retrievers (GTR) models is used as the backbone encoder, The instructor embedding model family consists of three models:

Instructor-base: GTR-Base for INSTRUCTOR-Base.
Instructor-large: GTRLarge for INSTRUCTOR-Large with 330M parameters.
Instrictor-xl: GTR-XL for INSTRUCTOR-XL with 1.5B parameters.

What does Instructor do?

The model’s embeddings are the way it represents a document.
The model is given instructions/descriptions of the task the embeddings will be used for, along with the document that it should embed.
Since a single document can be represented uniquely for each task (instruction) that is provided, it means that we can solve a large variety of tasks.

Ultimately, InstructOR is capable of generating task-specific embeddings for a given document.

Use cases

Text Classification: InstructOR can classify text into different categories or topics, such as news articles, customer reviews, or social media posts.
Information Retrieval: It helps retrieve relevant documents or information based on user queries, improving search engines or recommendation systems.
Question Answering: The model can provide accurate answers to questions by understanding the context of the given document.
Content Moderation: InstructOR assists in identifying harmful or inappropriate content, enabling platforms to maintain a safe and clean environment.
Multitask Learning: It allows a single model to generate task-specific and domain-specific embeddings for various text-related tasks, reducing the need for multiple models and saving resources.
Efficient deployment: Deploying InstructOR is less resource-intensive compared to training and deploying numerous large models, making it more practical for companies with diverse text-related tasks.

Instructor Training

Model Architecture

Instructor uses Generalizable T5-based dense Retrievers (GTR)models. GTR models have shown good performance in retrieval-based tasks. T5 models have an Encoder-Decoder style architecture; however, GTR models only use the encoder of T5. Different versions of GTR models are used, such as GTR-Base, GTRLarge, and GTR-XL, depending on the type of instruction we're working with (INSTRUCTOR-Base, INSTRUCTOR-LARGE, or INSTRUCTOR-XL). These GTR models are initially trained on a large collection of web text and then further refined using specific datasets related to information search.

The GTR models are initialized from T5 models, pretrained on a web corpus, and fine-tuned on information search datasets

Having different sizes of GTR models allows us to see how well instruction-based models perform as they get bigger. When we receive a text input (x) and an instruction (Ix), we combine them by putting them together (Ix ⊕ x). This combined input is then processed by the INSTRUCTOR model to create a fixed-size embedding called EI (Ix, x), using mean pooling over the hidden representations of the tokens in input text (x).

Embedding Visualization

Vsualized embeddings using T-SNE to show how the distance between documents change based on the instructions provided. As can be seen below on providing instructions the documents that share the same sentiment are closer together, while the one’s with different sentiments move further apart!

Dataset

For training, the authors collect 330 different datasets and combine them to form a single dataset that they name Multitask Embeddings Data with Instructions (MEDI). The datasets span multiple task types and domains.

For evaluating their model, they leverage the Massive Text Embedding Benchmark (MTEB).

MTEB is a massive benchmark for measuring the performance of text embedding models on diverse embedding tasks. MTEB includes 56 datasets across 8 task.

Evaluation

For evaluating Instructor model, they leverage the Massive Text Embedding Benchmark (MTEB).

MTEB is a massive benchmark for measuring the performance of text embedding models on diverse embedding tasks. MTEB includes 56 datasets across 8 task.

Instructors-xl is on top of the MTEB leaderboard on average across all tasks.

The instructor achieves sota on 70 diverse embedding tasks! An important thing to point out is that out of the 70 different evaluation tasks, 66 of these tasks were unseen during training. This means that the model can generalize well enough to perform well on tasks that it hasn’t seen before! An unseen task could also correspond to a new instruction that wasn’t seen during training, meaning that the role instructions play can also be generalized by the model.

INSTRUCTOR is strong because:

Efficient: Fast adaptation! INSTRUCTOR can calculate domain-specific and task-aware embeddings without any further training.
General: Any task! INSTRUCTOR can be applied to any task for computing fixed-length embeddings of texts.
Performance: State-of-the-art! INSTRUCTOR achieves state-of-the-art performance on 70 datasets and surpasses models with an order of magnitude larger.

Analysis and Conclusion

The complexity of Instructions, the more detailed the instructions are, the better the model performs.
Performance on Unseen Domains, InstructOR embeddings can generalize better to domains of data that haven’t been seen during training time. This means that we can expect good performance without having to fine-tune the model against new domains.
Instruction Robustness, InstructOR can perform well even if instructions are phrased in different ways, the authors ran experiments by evaluating (not training) the model using five different paraphrased instructions compared to the ones used at training time.

ID
Model Type ID
Text Embedder
Input Type
text
Output Type
embeddings
Description
A embedding model that can generate text embeddings tailored to any task (e.g., classification, clustering, text evaluation, etc.) and domains (e.g., science, finance, etc.) by simply providing the task instruction, without any finetuning.
Last Updated
Jul 17, 2023
Privacy
PUBLIC
License
Share
Badge