• Community
  • Model
  • jina-embeddings-v2-base-en

jina-embeddings-v2-base-en

Jina-embeddings-v2 is an English text embedding model by Jina AI, based on Bert architecture, with an 8192-sequence length, outperforming OpenAI's embedding model in various metrics

Notes

Introduction

Jina AI introduces its second-generation text embedding model, jina-embeddings-v2. This model is designed to support an impressive 8K (8192 tokens) context length, making it comparable to OpenAI's text-embedding-ada-002 in terms of capabilities and performance on the Massive Text Embedding Benchmark (MTEB) leaderboard.

Jina-embeddings-v2

Model Information

Jina-embeddings-v2-base-en is an English, monolingual embedding model supporting 8192 sequence length. It is based on a Bert architecture (JinaBert) that supports the symmetric bidirectional variant of ALiBi for longer sequence length.

Training Data

The backbone, jina-bert-v2-base-en, is pre-trained on the C4 dataset and further trained on over 400 million sentence pairs and hard negatives collected by Jina AI, spanning various domains and subjected to meticulous cleaning.

Run Jina-embeddings-v2 with an API

Running the API with Clarifai's Python SDK

You can run the Jina-embeddings-v2 Model API using Clarifai’s Python SDK.

Export your PAT as an environment variable. Then, import and initialize the API Client.

Find your PAT in your security settings.

export CLARIFAI_PAT={your personal access token}
from clarifai.client.model import Model

text = '''In India Green Revolution commenced in the early 1960s that led to an increase in food grain production, especially in Punjab, Haryana, and Uttar Pradesh. Major milestones in this undertaking were the development of high-yielding varieties of wheat. The Green revolution is revolutionary in character due to the introduction of new technology, new ideas, the new application of inputs like HYV seeds, fertilizers, irrigation water, pesticides, etc. As all these were brought suddenly and spread quickly to attain dramatic results thus it is termed as a revolution in green agriculture.
'''

# Model Predict
model_prediction = Model("https://clarifai.com/jinaai/jina-embeddings/models/jina-embeddings-v2-base-en").predict_by_bytes(text.encode(), "text")
# print(model_prediction.outputs[0].data.text.raw)

embeddings = model_prediction.outputs[0].data.embeddings[0].vector

num_dimensions= model_prediction.outputs[0].data.embeddings[0].num_dimensions

You can also run Jina-embeddings-v2 API using other Clarifai Client Libraries like Java, cURL, NodeJS, PHP, etc here.

Using cURL to Make a Direct HTTP Call

To make a direct HTTP call to the Jina-embeddings-v2 API using cURL, you can use the following command:

curl -X POST "https://api.clarifai.com/v2/users/jinaai/apps/jina-embeddings/models/jina-embeddings-v2-base-en/versions/bb64ef0192984bea83688b90f88572ff/outputs" \
    -H "Authorization: Key YOUR_PAT_HERE" \
    -H "Content-Type: application/json" \
    -d '{
    "inputs": [
        {
            "data": {
                "text": {
                    "raw": "You are a good boy."
                }
            }
        }
    ]
}'

OpenAI Embedding vs Jina Embedding

When compared to OpenAI's, "text-embedding-ada-002," Jina-embeddings-v2 demonstrates its excellence in various aspects. Below is a performance comparison:

RankModelModel Size (GB)Embedding DimensionsSequence LengthAverage (56 datasets)Classification Average (12 datasets)Reranking Average (4 datasets)Retrieval Average (15 datasets)Summarization Average (1 dataset)
15text-embedding-ada-002Unknown1536819160.9970.9384.8956.3230.8
17jina-embeddings-v2-base-en0.27768819260.3873.4585.3856.9831.6

Notably, "jina-embeddings-v2" outperforms its OpenAI counterpart in Classification Average, Reranking Average, Retrieval Average, and Summarization Average.

Use Cases

Jina-embeddings-v2's extended context capabilities unlock new possibilities for a range of industry applications including:

  • Legal Document Analysis: Capture and analyze intricate details in extensive legal texts effectively.
  • Medical Research: Holistically embed scientific papers for advanced analytics and discoveries in the field of medicine.
  • Literary Analysis: Dive deep into long-form content, allowing the capture of nuanced thematic elements in literature.
  • Financial Forecasting: Attain superior insights from detailed financial reports, making it a valuable tool for finance professionals.
  • Conversational AI: Enhance chatbot responses to intricate user queries with a deeper understanding of context.

Evaluation

Benchmarking demonstrates that in several datasets, jina-embeddings-v2's extended context capabilities offer practical advantages, outperforming other leading base embedding models in various tasks.

  • ID
  • Name
    jina-embeddings-v2-base-en
  • Model Type ID
    Text Embedder
  • Description
    Jina-embeddings-v2 is an English text embedding model by Jina AI, based on Bert architecture, with an 8192-sequence length, outperforming OpenAI's embedding model in various metrics
  • Last Updated
    Nov 06, 2023
  • Privacy
    PUBLIC
  • Use Case
  • License
  • Share
    • Badge
      jina-embeddings-v2-base-en