Phi-3-Vision-128K-Instruct is a state-of-the-art, multimodal model from the Phi-3 family, developed by Microsoft. This model is designed to handle both language and vision tasks, making it suitable for various applications that require understanding and reasoning over text and images. With 4.2 billion parameters and a context length of 128K tokens, Phi-3-Vision-128K-Instruct is optimized for high-quality performance in memory-constrained and latency-sensitive environments.
Phi-3-vision LMM
Phi-3-Vision-128K-Instruct is part of the Phi-3 family of small language models (SLMs), known for their superior performance and cost-effectiveness compared to similarly sized models. These models are instruction-tuned and adhere to Microsoft’s responsible AI principles, ensuring readiness for immediate use. Phi-3-Vision is the first multimodal model in this family, integrating language and vision capabilities to enable robust reasoning over real-world images, text extraction, and analysis of charts and diagrams.
Run Phi-3-vision with an API
Running the API with Clarifai's Python SDK
You can run the Phi-3-vision Model API using Clarifai’s Python SDK.
Export your PAT as an environment variable. Then, import and initialize the API Client.
from clarifai.client.model import Model
from clarifai.client.inputimport Inputs
prompt ="What time of day is it?"image_url ="https://samples.clarifai.com/metro-north.jpg"inference_params =dict(temperature=0.2, max_tokens=100, top_k=50, top_p=0.9)model_prediction = Model("https://clarifai.com/microsoft/text-generation/models/phi-3-vision-128k-instruct").predict(inputs =[Inputs.get_multimodal_input(input_id="",image_url=image_url, raw_text=prompt)],inference_params=inference_params)print(model_prediction.outputs[0].data.text.raw)
You can also run Phi-3-vision API using other Clarifai Client Libraries like Java, cURL, NodeJS, PHP, etc here.
Predict via local Image
from clarifai.client.model import Model
from clarifai.client.inputimport Inputs
IMAGE_FILE_LOCATION ='LOCAL IMAGE PATH'withopen(IMAGE_FILE_LOCATION,"rb")as f:file_bytes = f.read()prompt ="What time of day is it?"inference_params =dict(temperature=0.2, max_tokens=100, top_k=50, top_p=0.9)model_prediction = Model("https://clarifai.com/microsoft/text-generation/models/phi-3-vision-128k-instruct").predict(inputs =[Inputs.get_multimodal_input(input_id="", image_bytes = file_bytes, raw_text=prompt)], inference_params=inference_params)print(model_prediction.outputs[0].data.text.raw)
Use Cases
Phi-3-Vision-128K-Instruct is designed for broad commercial and research applications in English. Its primary use cases include:
General Image Understanding: Analyzing and interpreting visual content in images.
OCR (Optical Character Recognition): Extracting and understanding text from images.
Chart and Table Understanding: Generating insights and answering questions based on data presented in charts, tables, and diagrams.
Evaluation and Benchmark Results
Phi-3-Vision-128K-Instruct has been rigorously evaluated across various benchmarks, demonstrating strong performance:
Benchmark
Phi-3-Vision-128K-In
LlaVA-1.6 Vicuna-7B
QWEN-VL Chat
Llama3-Llava-Next-8B
Claude-3 Haiku
Gemini 1.0 Pro V
GPT-4V-Turbo
MMMU
40.4
34.2
39.0
36.4
40.7
42.0
55.5
MMBench
80.5
76.3
75.8
79.4
62.4
80.0
86.1
ScienceQA
90.8
70.6
67.2
73.7
72.0
79.7
75.7
MathVista
44.5
31.5
29.4
34.8
33.2
35.0
47.5
InterGPS
38.1
20.5
22.3
24.6
32.1
28.6
41.0
AI2D
76.7
63.1
59.8
66.9
60.3
62.8
74.7
ChartQA
81.4
55.0
50.9
65.8
59.3
58.0
62.3
TextVQA
70.9
64.6
59.4
55.7
62.7
64.7
68.1
POPE
85.8
87.2
82.6
87.0
74.4
84.2
83.7
Dataset
The Phi-3-Vision-128K-Instruct model was trained on a comprehensive dataset comprising 500 billion vision and text tokens. The training data sources include:
Publicly Available Documents: Carefully filtered for quality to ensure the highest standards.
Educational Data and Code: Selected for its high quality and relevance.
Synthetic Data: Designed to teach math, coding, common sense reasoning, general knowledge, and more.
Supervised Data in Chat Format: Covering diverse topics and reflecting human preferences in instruction-following, truthfulness, honesty, and helpfulness.
The data collection process emphasized privacy and quality, filtering out any potentially personal data and undesirable documents.
Advantages
High-Quality Multimodal Understanding: Capable of robust reasoning over text and images.
Cost-Effective: Outperforms larger models at a lower computational cost.
Broad Applicability: Suitable for various commercial and research applications.
Optimized for Charts and Diagrams: Enhanced capabilities for understanding and generating insights from visual data representations.
Large Context Length: Supports up to 128K tokens, ideal for complex tasks requiring extensive context.
Limitations
Not Suitable for All Use Cases: While versatile, the model may not be suitable for high-risk scenarios without additional evaluation and mitigation.
Language Limitation: Primarily designed for use in English.
Compliance and Safety: Developers must ensure their use case complies with relevant laws and regulations, particularly regarding privacy and safety.
Instruction Adherence: While fine-tuned for following instructions, it may still require oversight in specific applications
ID
Model Type ID
Multimodal To Text
Input Type
image
Output Type
text
Description
The Phi-3-Vision-128K-Instruct is a high-performance, cost-effective multimodal model for advanced text and image understanding tasks.