jina-embeddings-v2-base-en model | Clarifai

jina-embeddings-v2-base-en

Jina-embeddings-v2 is an English text embedding model by Jina AI, based on Bert architecture, with an 8192-sequence length, outperforming OpenAI's embedding model in various metrics

No input available.

Notes

Introduction

Jina AI introduces its second-generation text embedding model, jina-embeddings-v2. This model is designed to support an impressive 8K (8192 tokens) context length, making it comparable to OpenAI's text-embedding-ada-002 in terms of capabilities and performance on the Massive Text Embedding Benchmark (MTEB) leaderboard.

Jina-embeddings-v2

Model Information

Jina-embeddings-v2-base-en is an English, monolingual embedding model supporting 8192 sequence length. It is based on a Bert architecture (JinaBert) that supports the symmetric bidirectional variant of ALiBi for longer sequence length.

Training Data

The backbone, jina-bert-v2-base-en, is pre-trained on the C4 dataset and further trained on over 400 million sentence pairs and hard negatives collected by Jina AI, spanning various domains and subjected to meticulous cleaning.

Run Jina-embeddings-v2 with an API

Running the API with Clarifai's Python SDK

You can run the Jina-embeddings-v2 Model API using Clarifai’s Python SDK.

Export your PAT as an environment variable. Then, import and initialize the API Client.

Find your PAT in your security settings.

export CLARIFAI_PAT={your personal access token}

from clarifai.client.model import Model

text = '''In India Green Revolution commenced in the early 1960s that led to an increase in food grain production, especially in Punjab, Haryana, and Uttar Pradesh. Major milestones in this undertaking were the development of high-yielding varieties of wheat. The Green revolution is revolutionary in character due to the introduction of new technology, new ideas, the new application of inputs like HYV seeds, fertilizers, irrigation water, pesticides, etc. As all these were brought suddenly and spread quickly to attain dramatic results thus it is termed as a revolution in green agriculture.
'''

# Model Predict
model_prediction = Model("https://clarifai.com/jinaai/jina-embeddings/models/jina-embeddings-v2-base-en").predict_by_bytes(text.encode(), "text")
# print(model_prediction.outputs[0].data.text.raw)

embeddings = model_prediction.outputs[0].data.embeddings[0].vector

num_dimensions= model_prediction.outputs[0].data.embeddings[0].num_dimensions

You can also run Jina-embeddings-v2 API using other Clarifai Client Libraries like Java, cURL, NodeJS, PHP, etc here.

Using cURL to Make a Direct HTTP Call

To make a direct HTTP call to the Jina-embeddings-v2 API using cURL, you can use the following command:

curl -X POST "https://api.clarifai.com/v2/users/jinaai/apps/jina-embeddings/models/jina-embeddings-v2-base-en/versions/bb64ef0192984bea83688b90f88572ff/outputs" \
    -H "Authorization: Key YOUR_PAT_HERE" \
    -H "Content-Type: application/json" \
    -d '{
    "inputs": [
        {
            "data": {
                "text": {
                    "raw": "You are a good boy."
                }
            }
        }
    ]
}'

OpenAI Embedding vs Jina Embedding

When compared to OpenAI's, "text-embedding-ada-002," Jina-embeddings-v2 demonstrates its excellence in various aspects. Below is a performance comparison:

Rank	Model	Model Size (GB)	Embedding Dimensions	Sequence Length	Average (56 datasets)	Classification Average (12 datasets)	Reranking Average (4 datasets)	Retrieval Average (15 datasets)	Summarization Average (1 dataset)
15	text-embedding-ada-002	Unknown	1536	8191	60.99	70.93	84.89	56.32	30.8
17	jina-embeddings-v2-base-en	0.27	768	8192	60.38	73.45	85.38	56.98	31.6

Notably, "jina-embeddings-v2" outperforms its OpenAI counterpart in Classification Average, Reranking Average, Retrieval Average, and Summarization Average.

Use Cases

Jina-embeddings-v2's extended context capabilities unlock new possibilities for a range of industry applications including:

Legal Document Analysis: Capture and analyze intricate details in extensive legal texts effectively.
Medical Research: Holistically embed scientific papers for advanced analytics and discoveries in the field of medicine.
Literary Analysis: Dive deep into long-form content, allowing the capture of nuanced thematic elements in literature.
Financial Forecasting: Attain superior insights from detailed financial reports, making it a valuable tool for finance professionals.
Conversational AI: Enhance chatbot responses to intricate user queries with a deeper understanding of context.

Evaluation

Benchmarking demonstrates that in several datasets, jina-embeddings-v2's extended context capabilities offer practical advantages, outperforming other leading base embedding models in various tasks.

ID
Model Type ID
Text Embedder
Input Type
text
Output Type
embeddings
Description
Jina-embeddings-v2 is an English text embedding model by Jina AI, based on Bert architecture, with an 8192-sequence length, outperforming OpenAI's embedding model in various metrics
Last Updated
Nov 06, 2023
Privacy
PUBLIC
Use Case
License
Share
Badge