- Community
- Model
- jina-embeddings-v2-base-en
jina-embeddings-v2-base-en
Jina-embeddings-v2 is an English text embedding model by Jina AI, based on Bert architecture, with an 8192-sequence length, outperforming OpenAI's embedding model in various metrics
Notes
Introduction
Jina AI introduces its second-generation text embedding model, jina-embeddings-v2. This model is designed to support an impressive 8K (8192 tokens) context length, making it comparable to OpenAI's text-embedding-ada-002 in terms of capabilities and performance on the Massive Text Embedding Benchmark (MTEB) leaderboard.
Jina-embeddings-v2
Model Information
Jina-embeddings-v2-base-en is an English, monolingual embedding model supporting 8192 sequence length. It is based on a Bert architecture (JinaBert) that supports the symmetric bidirectional variant of ALiBi for longer sequence length.
Training Data
The backbone, jina-bert-v2-base-en, is pre-trained on the C4 dataset and further trained on over 400 million sentence pairs and hard negatives collected by Jina AI, spanning various domains and subjected to meticulous cleaning.
Run Jina-embeddings-v2 with an API
Running the API with Clarifai's Python SDK
You can run the Jina-embeddings-v2 Model API using Clarifai’s Python SDK.
Export your PAT as an environment variable. Then, import and initialize the API Client.
Find your PAT in your security settings.
export CLARIFAI_PAT={your personal access token}
from clarifai.client.model import Model
text = '''In India Green Revolution commenced in the early 1960s that led to an increase in food grain production, especially in Punjab, Haryana, and Uttar Pradesh. Major milestones in this undertaking were the development of high-yielding varieties of wheat. The Green revolution is revolutionary in character due to the introduction of new technology, new ideas, the new application of inputs like HYV seeds, fertilizers, irrigation water, pesticides, etc. As all these were brought suddenly and spread quickly to attain dramatic results thus it is termed as a revolution in green agriculture.
'''
# Model Predict
model_prediction = Model("https://clarifai.com/jinaai/jina-embeddings/models/jina-embeddings-v2-base-en").predict_by_bytes(text.encode(), "text")
# print(model_prediction.outputs[0].data.text.raw)
embeddings = model_prediction.outputs[0].data.embeddings[0].vector
num_dimensions= model_prediction.outputs[0].data.embeddings[0].num_dimensions
You can also run Jina-embeddings-v2 API using other Clarifai Client Libraries like Java, cURL, NodeJS, PHP, etc here.
Using cURL to Make a Direct HTTP Call
To make a direct HTTP call to the Jina-embeddings-v2 API using cURL, you can use the following command:
curl -X POST "https://api.clarifai.com/v2/users/jinaai/apps/jina-embeddings/models/jina-embeddings-v2-base-en/versions/bb64ef0192984bea83688b90f88572ff/outputs" \
-H "Authorization: Key YOUR_PAT_HERE" \
-H "Content-Type: application/json" \
-d '{
"inputs": [
{
"data": {
"text": {
"raw": "You are a good boy."
}
}
}
]
}'
OpenAI Embedding vs Jina Embedding
When compared to OpenAI's, "text-embedding-ada-002," Jina-embeddings-v2 demonstrates its excellence in various aspects. Below is a performance comparison:
Rank | Model | Model Size (GB) | Embedding Dimensions | Sequence Length | Average (56 datasets) | Classification Average (12 datasets) | Reranking Average (4 datasets) | Retrieval Average (15 datasets) | Summarization Average (1 dataset) |
---|---|---|---|---|---|---|---|---|---|
15 | text-embedding-ada-002 | Unknown | 1536 | 8191 | 60.99 | 70.93 | 84.89 | 56.32 | 30.8 |
17 | jina-embeddings-v2-base-en | 0.27 | 768 | 8192 | 60.38 | 73.45 | 85.38 | 56.98 | 31.6 |
Notably, "jina-embeddings-v2" outperforms its OpenAI counterpart in Classification Average, Reranking Average, Retrieval Average, and Summarization Average.
Use Cases
Jina-embeddings-v2's extended context capabilities unlock new possibilities for a range of industry applications including:
- Legal Document Analysis: Capture and analyze intricate details in extensive legal texts effectively.
- Medical Research: Holistically embed scientific papers for advanced analytics and discoveries in the field of medicine.
- Literary Analysis: Dive deep into long-form content, allowing the capture of nuanced thematic elements in literature.
- Financial Forecasting: Attain superior insights from detailed financial reports, making it a valuable tool for finance professionals.
- Conversational AI: Enhance chatbot responses to intricate user queries with a deeper understanding of context.
Evaluation
Benchmarking demonstrates that in several datasets, jina-embeddings-v2's extended context capabilities offer practical advantages, outperforming other leading base embedding models in various tasks.
- ID
- Namejina-embeddings-v2-base-en
- Model Type IDText Embedder
- DescriptionJina-embeddings-v2 is an English text embedding model by Jina AI, based on Bert architecture, with an 8192-sequence length, outperforming OpenAI's embedding model in various metrics
- Last UpdatedNov 06, 2023
- PrivacyPUBLIC
- Use Case
- License
- Share
- Badge