- Community
- Model
- mxbai-embed-large-v1
mxbai-embed-large-v1
The Mxbai-embed-large-v1 is a state-of-the-art, versatile sentence embedding model trained on a unique dataset for superior performance across a wide range of NLP tasks and is on the top of MTEB Leaderboard
Notes
Introduction
The Mxbai-embed-large-v1 is a state-of-the-art (SOTA) sentence embedding model designed to capture the nuanced semantics of text across a wide range of domains and tasks. Leveraging advanced training techniques and a vast, high-quality dataset, this model sets new benchmarks in the performance of natural language understanding tasks.
Mxbai-embed-large Model
- Model Name: mxbai-embed-large-v1
- Training Technique: The model is trained using AnglE loss, a technique that ensures the model learns high-quality, context-rich embeddings.
- Data Scale: Training was conducted on a large-scale, high-quality dataset comprising over 700 million pairs and fine-tuned on more than 30 million high-quality triplets.
- Performance Benchmark: Achieves SOTA performance on BERT-large scale, demonstrating its effectiveness in capturing complex text representations.
Run Mxbai-embed-large with an API
Running the API with Clarifai's Python SDK
You can run the Mxbai-embed-large Model API using Clarifai’s Python SDK.
Export your PAT as an environment variable. Then, import and initialize the API Client.
Find your PAT in your security settings.
export CLARIFAI_PAT={your personal access token}
from clarifai.client.model import Model
text = '''In India Green Revolution commenced in the early 1960s that led to an increase in food grain production, especially in Punjab, Haryana, and Uttar Pradesh. Major milestones in this undertaking were the development of high-yielding varieties of wheat. The Green revolution is revolutionary in character due to the introduction of new technology, new ideas, the new application of inputs like HYV seeds, fertilizers, irrigation water, pesticides, etc. As all these were brought suddenly and spread quickly to attain dramatic results thus it is termed as a revolution in green agriculture.
'''
# Model Predict
model_prediction = Model("https://clarifai.com/mixedbread-ai/embed/models/mxbai-embed-large-v1").predict_by_bytes(text.encode(), "text")
# print(model_prediction.outputs[0].data.text.raw)
embeddings = model_prediction.outputs[0].data.embeddings[0].vector
num_dimensions= model_prediction.outputs[0].data.embeddings[0].num_dimensions
You can also run Mxbai-embed-large API using other Clarifai Client Libraries like Java, cURL, NodeJS, PHP, etc here.
For Retrieval
For retrieval you need to pass your prompt in following format:
prompt = f'Represent this sentence for searching relevant passages: {text}'
from clarifai.client.model import Model
text = '''In India Green Revolution commenced in the early 1960s that led to an increase in food grain production, especially in Punjab, Haryana, and Uttar Pradesh. Major milestones in this undertaking were the development of high-yielding varieties of wheat. The Green revolution is revolutionary in character due to the introduction of new technology, new ideas, the new application of inputs like HYV seeds, fertilizers, irrigation water, pesticides, etc. As all these were brought suddenly and spread quickly to attain dramatic results thus it is termed as a revolution in green agriculture.
'''
prompt = f'Represent this sentence for searching relevant passages: {text}'
# Model Predict
model_prediction = Model("https://clarifai.com/mixedbread-ai/embed/models/mxbai-embed-large-v1").predict_by_bytes(prompt.encode(), "text")
embeddings = model_prediction.outputs[0].data.embeddings[0].vector
num_dimensions= model_prediction.outputs[0].data.embeddings[0].num_dimensions
Use Cases
The Mxbai-embed-large model is versatile, supporting a wide range of applications including, but not limited to:
- Text classification
- RAG
- Semantic textual similarity (STS)
- Information Retrieval
- Content summarization
- Document clustering Its robust performance across various domains and tasks makes it an ideal choice for both academic research and commercial NLP applications.
Evaluation
The model has undergone comprehensive evaluation, particularly on the Multilingual Text Embedding Benchmark (MTEB), showcasing exceptional performance:
MTEB evaluates models on seven tasks across 56 datasets. The mxbai-embed-large-v1 model demonstrates superior performance, especially in classification, pair classification, and semantic textual similarity (STS) tasks.
- Overall Ranking: Ranked first among embedding models of similar size.
- Comparative Performance: Outperforms OpenAI's text-embedding-3-large model and matches the performance of models 20 times its size, such as echo-mistral-7b.
- Generalization: Demonstrates strong generalization across tasks and domains, attributed to the rigorous dataset selection and training process.
Below is the evaluation results table formatted for clarity, showing the performance of the Mxbai-embed-large-v1 model compared to other notable models across various tasks measured in the Multilingual Text Embedding Benchmark (MTEB):
Model | Avg (56 datasets) | Classification (12 datasets) | Clustering (11 datasets) | Pair Classification (3 datasets) | Re-ranking (4 datasets) | Retrieval (15 datasets) | STS (10 datasets) | Summarization (1 dataset) |
---|---|---|---|---|---|---|---|---|
mxbai-embed-large-v1 | 64.68 | 75.64 | 46.71 | 87.2 | 60.11 | 54.39 | 85.00 | 32.71 |
bge-large-en-v1.5 | 64.23 | 75.97 | 46.08 | 87.12 | 60.03 | 54.29 | 83.11 | 31.61 |
mxbai-embed-2d-large-v1 | 63.25 | 74.14 | 46.07 | 85.89 | 58.94 | 51.42 | 84.9 | 31.55 |
nomic-embed-text-v1 | 62.39 | 74.12 | 43.91 | 85.15 | 55.69 | 52.81 | 82.06 | 30.08 |
jina-embeddings-v2-base-en | 60.38 | 73.45 | 41.73 | 85.38 | 56.98 | 47.87 | 80.7 | 31.6 |
OpenAI text-embedding-3-large | 64.58 | 75.45 | 49.01 | 85.72 | 59.16 | 55.44 | 81.73 | 29.92 |
Cohere embed-english-v3.0 | 64.47 | 76.49 | 47.43 | 85.84 | 58.01 | 55.00 | 82.62 | 30.18 |
OpenAI text-embedding-ada-002 | 60.99 | 70.93 | 45.90 | 84.89 | 56.32 | 49.25 | 80.97 | 30.80 |
Dataset
The model benefits from a carefully curated training dataset, constructed by:
- Scraping a large portion of the internet.
- Cleaning the data to ensure high quality.
- Ensuring zero overlap with MTEB test sets or any potential test candidates, maintaining the integrity of the evaluation process. This extensive and meticulously prepared dataset underpins the model's SOTA performance and broad applicability.
Advantages
- High Performance: Achieves benchmark-setting performance on a wide range of NLP tasks.
- Generalization Capability: Performs well across different domains and text lengths, indicative of its robust training and dataset diversity.
- Ethical Training Practices: Avoids any overlap with MTEB test sets or training on potential test candidates, ensuring a fair and genuine evaluation of its capabilities.
Limitations
- Despite its robust performance, the model has some limitations that will be addressed in version 2.
- ID
- Namemxbai-embed-large-v1
- Model Type IDText Embedder
- DescriptionThe Mxbai-embed-large-v1 is a state-of-the-art, versatile sentence embedding model trained on a unique dataset for superior performance across a wide range of NLP tasks and is on the top of MTEB Leaderboard
- Last UpdatedMar 13, 2024
- PrivacyPUBLIC
- Use Case
- License
- Share
- Badge