mxbai-embed-large-v1 model | Clarifai

mxbai-embed-large-v1

The Mxbai-embed-large-v1 is a state-of-the-art, versatile sentence embedding model trained on a unique dataset for superior performance across a wide range of NLP tasks and is on the top of MTEB Leaderboard

No input available.

Notes

Introduction

The Mxbai-embed-large-v1 is a state-of-the-art (SOTA) sentence embedding model designed to capture the nuanced semantics of text across a wide range of domains and tasks. Leveraging advanced training techniques and a vast, high-quality dataset, this model sets new benchmarks in the performance of natural language understanding tasks.

Mxbai-embed-large Model

Model Name: mxbai-embed-large-v1
Training Technique: The model is trained using AnglE loss, a technique that ensures the model learns high-quality, context-rich embeddings.
Data Scale: Training was conducted on a large-scale, high-quality dataset comprising over 700 million pairs and fine-tuned on more than 30 million high-quality triplets.
Performance Benchmark: Achieves SOTA performance on BERT-large scale, demonstrating its effectiveness in capturing complex text representations.

Run Mxbai-embed-large with an API

Running the API with Clarifai's Python SDK

You can run the Mxbai-embed-large Model API using Clarifai’s Python SDK.

Export your PAT as an environment variable. Then, import and initialize the API Client.

Find your PAT in your security settings.

export CLARIFAI_PAT={your personal access token}

from clarifai.client.model import Model

text = '''In India Green Revolution commenced in the early 1960s that led to an increase in food grain production, especially in Punjab, Haryana, and Uttar Pradesh. Major milestones in this undertaking were the development of high-yielding varieties of wheat. The Green revolution is revolutionary in character due to the introduction of new technology, new ideas, the new application of inputs like HYV seeds, fertilizers, irrigation water, pesticides, etc. As all these were brought suddenly and spread quickly to attain dramatic results thus it is termed as a revolution in green agriculture.
'''

# Model Predict
model_prediction = Model("https://clarifai.com/mixedbread-ai/embed/models/mxbai-embed-large-v1").predict_by_bytes(text.encode(), "text")
# print(model_prediction.outputs[0].data.text.raw)

embeddings = model_prediction.outputs[0].data.embeddings[0].vector

num_dimensions= model_prediction.outputs[0].data.embeddings[0].num_dimensions

You can also run Mxbai-embed-large API using other Clarifai Client Libraries like Java, cURL, NodeJS, PHP, etc here.

For Retrieval

For retrieval you need to pass your prompt in following format:

prompt = f'Represent this sentence for searching relevant passages: {text}'

from clarifai.client.model import Model

text = '''In India Green Revolution commenced in the early 1960s that led to an increase in food grain production, especially in Punjab, Haryana, and Uttar Pradesh. Major milestones in this undertaking were the development of high-yielding varieties of wheat. The Green revolution is revolutionary in character due to the introduction of new technology, new ideas, the new application of inputs like HYV seeds, fertilizers, irrigation water, pesticides, etc. As all these were brought suddenly and spread quickly to attain dramatic results thus it is termed as a revolution in green agriculture.
'''

prompt = f'Represent this sentence for searching relevant passages: {text}'

# Model Predict
model_prediction = Model("https://clarifai.com/mixedbread-ai/embed/models/mxbai-embed-large-v1").predict_by_bytes(prompt.encode(), "text")

embeddings = model_prediction.outputs[0].data.embeddings[0].vector

num_dimensions= model_prediction.outputs[0].data.embeddings[0].num_dimensions

Use Cases

The Mxbai-embed-large model is versatile, supporting a wide range of applications including, but not limited to:

Text classification
RAG
Semantic textual similarity (STS)
Information Retrieval
Content summarization
Document clustering Its robust performance across various domains and tasks makes it an ideal choice for both academic research and commercial NLP applications.

Evaluation

The model has undergone comprehensive evaluation, particularly on the Multilingual Text Embedding Benchmark (MTEB), showcasing exceptional performance:

MTEB evaluates models on seven tasks across 56 datasets. The mxbai-embed-large-v1 model demonstrates superior performance, especially in classification, pair classification, and semantic textual similarity (STS) tasks.

Overall Ranking: Ranked first among embedding models of similar size.
Comparative Performance: Outperforms OpenAI's text-embedding-3-large model and matches the performance of models 20 times its size, such as echo-mistral-7b.
Generalization: Demonstrates strong generalization across tasks and domains, attributed to the rigorous dataset selection and training process.

Below is the evaluation results table formatted for clarity, showing the performance of the Mxbai-embed-large-v1 model compared to other notable models across various tasks measured in the Multilingual Text Embedding Benchmark (MTEB):

Model	Avg (56 datasets)	Classification (12 datasets)	Clustering (11 datasets)	Pair Classification (3 datasets)	Re-ranking (4 datasets)	Retrieval (15 datasets)	STS (10 datasets)	Summarization (1 dataset)
mxbai-embed-large-v1	64.68	75.64	46.71	87.2	60.11	54.39	85.00	32.71
bge-large-en-v1.5	64.23	75.97	46.08	87.12	60.03	54.29	83.11	31.61
mxbai-embed-2d-large-v1	63.25	74.14	46.07	85.89	58.94	51.42	84.9	31.55
nomic-embed-text-v1	62.39	74.12	43.91	85.15	55.69	52.81	82.06	30.08
jina-embeddings-v2-base-en	60.38	73.45	41.73	85.38	56.98	47.87	80.7	31.6
OpenAI text-embedding-3-large	64.58	75.45	49.01	85.72	59.16	55.44	81.73	29.92
Cohere embed-english-v3.0	64.47	76.49	47.43	85.84	58.01	55.00	82.62	30.18
OpenAI text-embedding-ada-002	60.99	70.93	45.90	84.89	56.32	49.25	80.97	30.80

Dataset

The model benefits from a carefully curated training dataset, constructed by:

Scraping a large portion of the internet.
Cleaning the data to ensure high quality.
Ensuring zero overlap with MTEB test sets or any potential test candidates, maintaining the integrity of the evaluation process. This extensive and meticulously prepared dataset underpins the model's SOTA performance and broad applicability.

Advantages

High Performance: Achieves benchmark-setting performance on a wide range of NLP tasks.
Generalization Capability: Performs well across different domains and text lengths, indicative of its robust training and dataset diversity.
Ethical Training Practices: Avoids any overlap with MTEB test sets or training on potential test candidates, ensuring a fair and genuine evaluation of its capabilities.

Limitations

Despite its robust performance, the model has some limitations that will be addressed in version 2.

ID
Model Type ID
Text Embedder
Input Type
text
Output Type
embeddings
Description
The Mxbai-embed-large-v1 is a state-of-the-art, versatile sentence embedding model trained on a unique dataset for superior performance across a wide range of NLP tasks and is on the top of MTEB Leaderboard
Last Updated
Mar 13, 2024
Privacy
PUBLIC
Use Case
License
Share
Badge