GPT-4 Vision extends GPT-4's capabilities that can understand and answer questions about images, expanding its capabilities beyond just processing text.
The maximum number of tokens to generate. Shorter token lengths will provide faster performance.
A decimal number that determines the degree of randomness in the response
An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.
ResetGenerate
Output
Submit a prompt for a response.
Notes
GPT-4 Vision
Note: the model will be deprecated on Dec 6, please use GPT-4o
The GPT-4 Vision model, extends the capabilities of the traditional GPT-4 model by incorporating image processing. This allows the model to analyze images and respond to questions related to them, expanding its applicability beyond text-based inputs.
Notable Point:
GPT-4 with vision maintains the same behavior as GPT-4, with the addition of image processing capabilities.
It is an augmentative set of capabilities, enhancing the model's versatility without compromising its performance on text-related tasks.
Run GPT-4 Vision with an API
Running the API with Clarifai's Python SDK
You can run the GPT-4 Vision Model API using Clarifai’s Python SDK.
Export your PAT as an environment variable. Then, import and initialize the API Client.
from clarifai.client.model import Model
from clarifai.client.inputimport Inputs
prompt ="What time of day is it?"image_url ="https://samples.clarifai.com/metro-north.jpg"inference_params =dict(temperature=0.2, max_tokens=100)model_prediction = Model("https://clarifai.com/openai/chat-completion/models/openai-gpt-4-vision").predict(inputs =[Inputs.get_multimodal_input(input_id="",image_url=image_url, raw_text=prompt)],inference_params=inference_params)print(model_prediction.outputs[0].data.text.raw)
You can also run GPT-4 Vision API using other Clarifai Client Libraries like Java, cURL, NodeJS, PHP, etc here.
Predict via local Image
from clarifai.client.model import Model
from clarifai.client.inputimport Inputs
IMAGE_FILE_LOCATION ='LOCAL IMAGE PATH'withopen(IMAGE_FILE_LOCATION,"rb")as f: file_bytes = f.read()prompt ="What time of day is it?"inference_params =dict(temperature=0.2, max_tokens=100)model_prediction = Model("https://clarifai.com/openai/chat-completion/models/openai-gpt-4-vision").predict(inputs =[Inputs.get_multimodal_input(input_id="", image_bytes = file_bytes, raw_text=prompt)], inference_params=inference_params)print(model_prediction.outputs[0].data.text.raw)
Using cURL to Make a Direct HTTP Call
To make a direct HTTP call to the GPT-4 Vision API using cURL, you can use the following command:
The introduction of vision capabilities in GPT-4 brings several enhancements to its utility. Key improvements include:
Multimodal Inputs: GPT-4 with Vision allows the model to handle both text and image inputs, expanding its range of applications.
Versatility: Developers can now create applications that involve questions about images, enabling innovative use cases.
Use Cases
GPT-4 with Vision can be applied in various scenarios, including but not limited to:
Image-based Question Answering: The model can answer questions related to the content of images.
Multimodal Applications: Developers can build applications that leverage both text and image inputs for a richer user experience.
Limitations
Despite its powerful capabilities, GPT-4 with Vision has some limitations that users should be aware of:
Medical Images: Not suitable for interpreting specialized medical images like CT scans.
Non-English Text: May not perform optimally with images containing non-Latin alphabets.
Rotation: May misinterpret rotated or upside-down text or images.
Visual Elements: Struggles with understanding complex visual elements like graphs.
Spatial Reasoning: Difficulty in tasks requiring precise spatial localization.
Accuracy: May generate incorrect descriptions or captions in certain scenarios.
Image Shape: Struggles with panoramic and fisheye images.
Metadata and Resizing: Doesn't process original file names or metadata; images are resized before analysis.
Counting: May provide approximate counts for objects in images.
CAPTCHAs: Submission of CAPTCHAs is blocked for safety reasons.
Disclaimer
Please be advised that this model utilizes wrapped Artificial Intelligence (AI) provided by OpenAI (the "Vendor"). These AI models may collect, process, and store data as part of their operations. By using our website and accessing these AI models, you hereby consent to the data practices of the Vendor. We do not have control over the data collection, processing, and storage practices of the Vendor. Therefore, we cannot be held responsible or liable for any data handling practices, data loss, or breaches that may occur. It is your responsibility to review the privacy policies and terms of service of the Vendor to understand their data practices. You can access the Vendor's privacy policy and terms of service at https://openai.com/policies/privacy-policy. We disclaim all liability with respect to the actions or omissions of the Vendor, and we encourage you to exercise caution and to ensure that you are comfortable with these practices before utilizing the AI models hosted on our site.
ID
Model Type ID
Multimodal To Text
Input Type
image
Output Type
text
Description
GPT-4 Vision extends GPT-4's capabilities that can understand and answer questions about images, expanding its capabilities beyond just processing text.