llava-v1_6-mistral-7b

LLaVA-v1.6-Mistral-7B: A high-performance, efficient, and cross-lingual large multimodal model, boasting state-of-the-art capabilities in visual reasoning, OCR, and zero-shot Chinese multimodal understanding

Input

Prompt:

Press Ctrl + Enter to submit
The maximum number of tokens to generate. Shorter token lengths will provide faster performance.
A decimal number that determines the degree of randomness in the response
The top-k parameter limits the model's predictions to the top k most probable tokens at each step of generation.
An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.

Output

Submit a prompt for a response.

Notes

Note

  • Datatype: 4-bit bitandbytes NF4 quantised model
  • Model Source

Introduction

LLaVA-NeXT-Mistral-7B or LLaVA-1.6-Mistral-7B  is part of the latest suite of models in the LLaVA series, released to build upon the foundation of the successful LLaVA-1.5. This model is specifically designed to integrate and enhance capabilities in visual reasoning, OCR, and world knowledge. It comes with significant improvements over its predecessor, offering enhanced image processing capabilities, better logical reasoning, and more efficient deployment.

LLaVA-NeXT-Mistral-7B LMM

LLaVA-NeXT-Mistral-7B is engineered to offer superior performance in handling high-resolution multimodal data. The model utilizes the Mistral-7B language model backbone, integrated with advancements in dynamic resolution and efficient processing capabilities. This version supports diverse aspect ratios and is optimized for both quality and computational efficiency, making it suitable for robust real-time applications.

Run LLaVA-NeXT with an API

Running the API with Clarifai's Python SDK

You can run the LLaVA-NeXT Model API using Clarifai’s Python SDK.

Export your PAT as an environment variable. Then, import and initialize the API Client.

Find your PAT in your security settings.

export CLARIFAI_PAT={your personal access token}

Predict via Image URL

from clarifai.client.model import Model
from clarifai.client.input import Inputs

prompt = "What time of day is it?"
image_url = "https://samples.clarifai.com/metro-north.jpg"
inference_params = dict(temperature=0.2, max_tokens=100)

model_prediction = Model("https://clarifai.com/liuhaotian/llava/models/llava-v1_6-mistral-7b").predict(inputs = [Inputs.get_multimodal_input(input_id="",image_url=image_url, raw_text=prompt)],inference_params=inference_params)

print(model_prediction.outputs[0].data.text.raw)

You can also run LLaVA-1.6 API using other Clarifai Client Libraries like Java, cURL, NodeJS, PHP, etc here.

Use Cases

LLaVA-NeXT-Mistral-7B is particularly suitable for applications involving:

  • Content Moderation: Identifying harmful content in multimodal inputs.
  • Educational Tools: Assisting in learning platforms by providing detailed explanations of visual and textual content.
  • Visual Reasoning and OCR: Improved OCR capabilities enable reliable text extraction and interpretation from images, useful in document analysis and accessibility tools.
  • Multimodal Conversations: Enhanced to support more complex conversational scenarios involving visuals, beneficial for customer support and interactive systems
  • Market Analysis: Extracting insights from mixed data sources like charts, graphs, and written reports.

Evaluation

Evaluation

The evaluation of LLaVA-NeXT-Mistral-7B focuses on its performance across a diverse range of benchmarks designed to test various aspects of multimodal understanding, including visual question answering, text-based image analysis, and general visual reasoning. This section provides a detailed comparison of the LLaVA-NeXT-Mistral-7B model with other leading models in the domain to highlight its strengths and performance metrics.

Evaluation Metrics and Benchmarks

The performance of each model is assessed on the following benchmarks:

  • VQAv2: Visual Question Answering, testing the model's ability to answer questions based on image content.
  • GQA: Visual reasoning on real-world images with compositional questions.
  • VisWiz: Assessing the model’s capability to aid visually impaired users by answering questions about their surroundings.
  • TextVQA: Evaluating OCR capabilities to answer questions based on text in images.
  • ScienceQA: Testing scientific diagram understanding and answering related questions.

Evaluation Table

The following table outlines the performance scores of LLaVA-NeXT-Mistral-7B compared to other notable models in the field:

ModelVQAv2GQAVisWizTextVQAScienceQA
LLaVA-NeXT-Mistral-7B82.264.860.065.772.8
LLaVA-NeXT-Vicuna-13B82.865.460.567.173.6
LLaVA-NeXT-34B83.767.163.869.581.8
Gemini Pro71.2--74.6-
Gemini Ultra77.8--82.3-
PALI-X86.0--71.4-
CogVLM-30B83.465.276.468.192.7
LLaVA-1.5-13B80.063.353.661.371.6
Qwen-VL-Plus-----

Observations:

  • VQAv2 Performance: LLaVA-NeXT-Mistral-7B achieves competitive scores, slightly below CogVLM-30B but outperforming Gemini Pro.
  • GQA Performance: LLaVA-NeXT-Mistral-7B demonstrates comparable performance to other models in this benchmark.
  • VisWiz Performance: LLaVA-NeXT-Mistral-7B performs similarly to other models like LLaVA-NeXT-Vicuna-13B.
  • TextVQA Performance: LLaVA-NeXT-Mistral-7B shows promising performance, surpassing Gemini Pro and LLaVA-1.5-13B.
  • ScienceQA Performance: LLaVA-NeXT-Mistral-7B exhibits strong performance, ranking among the top models in this benchmark.

Overall, LLaVA-NeXT-Mistral-7B demonstrates competitive performance across a range of benchmarks, indicating its efficacy in various multimodal tasks

Dataset

The training dataset comprises 1.3 million samples, with a mixture of high-quality multimodal data including proprietary and publicly sourced elements. Significant datasets include:

  • DocVQA and SynDog-EN for OCR capabilities.
  • ChartQA, DVQA, and AI2D for chart and diagram understanding.
  • A small, curated 15K visual instruction dataset for robust response generation in varied applications.

Advantages

  • High-resolution Image Processing: Capable of interpreting images with a higher pixel count, allowing for more detailed analysis.
  • Versatile: Supports multiple aspect ratios effectively, making it adaptable to various types of visual content.
  • State-of-the-Art Performance: Leads in several key benchmarks, particularly in zero-shot Chinese multimodal scenarios.

Limitations

  • Limited Language Support: While it excels in Chinese in a zero-shot manner, the performance in other non-English languages has not been explicitly reported.
  • Resource Intensity for High-Resolution: Although optimized, processing higher resolution images inherently demands more from computational resources, potentially limiting deployment in resource-constrained environments.
  • ID
  • Model Type ID
    Multimodal To Text
  • Input Type
    image
  • Output Type
    text
  • Description
    LLaVA-v1.6-Mistral-7B: A high-performance, efficient, and cross-lingual large multimodal model, boasting state-of-the-art capabilities in visual reasoning, OCR, and zero-shot Chinese multimodal understanding
  • Last Updated
    Oct 17, 2024
  • Privacy
    PUBLIC
  • License
  • Share
  • Badge
    llava-v1_6-mistral-7b