phi-3-vision-128k-instruct model | Clarifai

phi-3-vision-128k-instruct

The Phi-3-Vision-128K-Instruct is a high-performance, cost-effective multimodal model for advanced text and image understanding tasks.

Input

Prompt:

Press Ctrl + Enter to submit

Max Tokens

The maximum number of tokens to generate. Shorter token lengths will provide faster performance.

Temperature

A decimal number that determines the degree of randomness in the response

Top K

The top-k parameter limits the model's predictions to the top k most probable tokens at each step of generation.

Top P

An alternative to sampling with temperature, where samples from the top p percentage of most likely tokens.

Output

Submit a prompt for a response.

Notes

Note

Datatype: 16-bit bfloat format
Model Source
License: MIT

Introduction

Phi-3-Vision-128K-Instruct is a state-of-the-art, multimodal model from the Phi-3 family, developed by Microsoft. This model is designed to handle both language and vision tasks, making it suitable for various applications that require understanding and reasoning over text and images. With 4.2 billion parameters and a context length of 128K tokens, Phi-3-Vision-128K-Instruct is optimized for high-quality performance in memory-constrained and latency-sensitive environments.

Phi-3-vision LMM

Phi-3-Vision-128K-Instruct is part of the Phi-3 family of small language models (SLMs), known for their superior performance and cost-effectiveness compared to similarly sized models. These models are instruction-tuned and adhere to Microsoft’s responsible AI principles, ensuring readiness for immediate use. Phi-3-Vision is the first multimodal model in this family, integrating language and vision capabilities to enable robust reasoning over real-world images, text extraction, and analysis of charts and diagrams.

Run Phi-3-vision with an API

Running the API with Clarifai's Python SDK

You can run the Phi-3-vision Model API using Clarifai’s Python SDK.

Export your PAT as an environment variable. Then, import and initialize the API Client.

Find your PAT in your security settings.

export CLARIFAI_PAT={your personal access token}

Predict via Image URL

from clarifai.client.model import Model
from clarifai.client.input import Inputs

prompt = "What time of day is it?"
image_url = "https://samples.clarifai.com/metro-north.jpg"

inference_params = dict(temperature=0.2, max_tokens=100, top_k=50, top_p= 0.9)

model_prediction = Model("https://clarifai.com/microsoft/text-generation/models/phi-3-vision-128k-instruct").predict(inputs = [Inputs.get_multimodal_input(input_id="",image_url=image_url, raw_text=prompt)],inference_params=inference_params)

print(model_prediction.outputs[0].data.text.raw)

You can also run Phi-3-vision API using other Clarifai Client Libraries like Java, cURL, NodeJS, PHP, etc here.

Predict via local Image

from clarifai.client.model import Model
from clarifai.client.input import Inputs

IMAGE_FILE_LOCATION = 'LOCAL IMAGE PATH'
with open(IMAGE_FILE_LOCATION, "rb") as f:
file_bytes = f.read()

prompt = "What time of day is it?"
inference_params = dict(temperature=0.2, max_tokens=100, top_k=50, top_p= 0.9)

model_prediction = Model("https://clarifai.com/microsoft/text-generation/models/phi-3-vision-128k-instruct").predict(inputs = [Inputs.get_multimodal_input(input_id="", image_bytes = file_bytes, raw_text=prompt)], inference_params=inference_params)

print(model_prediction.outputs[0].data.text.raw)

Use Cases

Phi-3-Vision-128K-Instruct is designed for broad commercial and research applications in English. Its primary use cases include:

General Image Understanding: Analyzing and interpreting visual content in images.
OCR (Optical Character Recognition): Extracting and understanding text from images.
Chart and Table Understanding: Generating insights and answering questions based on data presented in charts, tables, and diagrams.

Evaluation and Benchmark Results

Phi-3-Vision-128K-Instruct has been rigorously evaluated across various benchmarks, demonstrating strong performance:

Benchmark	Phi-3-Vision-128K-In	LlaVA-1.6 Vicuna-7B	QWEN-VL Chat	Llama3-Llava-Next-8B	Claude-3 Haiku	Gemini 1.0 Pro V	GPT-4V-Turbo
MMMU	40.4	34.2	39.0	36.4	40.7	42.0	55.5
MMBench	80.5	76.3	75.8	79.4	62.4	80.0	86.1
ScienceQA	90.8	70.6	67.2	73.7	72.0	79.7	75.7
MathVista	44.5	31.5	29.4	34.8	33.2	35.0	47.5
InterGPS	38.1	20.5	22.3	24.6	32.1	28.6	41.0
AI2D	76.7	63.1	59.8	66.9	60.3	62.8	74.7
ChartQA	81.4	55.0	50.9	65.8	59.3	58.0	62.3
TextVQA	70.9	64.6	59.4	55.7	62.7	64.7	68.1
POPE	85.8	87.2	82.6	87.0	74.4	84.2	83.7

Dataset

The Phi-3-Vision-128K-Instruct model was trained on a comprehensive dataset comprising 500 billion vision and text tokens. The training data sources include:

Publicly Available Documents: Carefully filtered for quality to ensure the highest standards.
Educational Data and Code: Selected for its high quality and relevance.
Image-Text Interleaved Data: Ensuring robust multimodal understanding.
Synthetic Data: Designed to teach math, coding, common sense reasoning, general knowledge, and more.
Supervised Data in Chat Format: Covering diverse topics and reflecting human preferences in instruction-following, truthfulness, honesty, and helpfulness.

The data collection process emphasized privacy and quality, filtering out any potentially personal data and undesirable documents.

Advantages

High-Quality Multimodal Understanding: Capable of robust reasoning over text and images.
Cost-Effective: Outperforms larger models at a lower computational cost.
Broad Applicability: Suitable for various commercial and research applications.
Optimized for Charts and Diagrams: Enhanced capabilities for understanding and generating insights from visual data representations.
Large Context Length: Supports up to 128K tokens, ideal for complex tasks requiring extensive context.

Limitations

Not Suitable for All Use Cases: While versatile, the model may not be suitable for high-risk scenarios without additional evaluation and mitigation.
Language Limitation: Primarily designed for use in English.
Compliance and Safety: Developers must ensure their use case complies with relevant laws and regulations, particularly regarding privacy and safety.
Instruction Adherence: While fine-tuned for following instructions, it may still require oversight in specific applications

ID
Model Type ID
Multimodal To Text
Input Type
image
Output Type
text
Description
The Phi-3-Vision-128K-Instruct is a high-performance, cost-effective multimodal model for advanced text and image understanding tasks.
Last Updated
Oct 17, 2024
Privacy
PUBLIC
Use Case
Toolkit
License
Share
Badge