openai-gpt-4-vision model | Clarifai

openai-gpt-4-vision

GPT-4 Vision extends GPT-4's capabilities that can understand and answer questions about images, expanding its capabilities beyond just processing text.

Input

Prompt:

Press Ctrl + Enter to submit

Max Tokens

The maximum number of tokens to generate. Shorter token lengths will provide faster performance.

Temperature

A decimal number that determines the degree of randomness in the response

Top P

An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.

Output

Submit a prompt for a response.

Notes

GPT-4 Vision

Note: the model will be deprecated on Dec 6, please use GPT-4o

The GPT-4 Vision model, extends the capabilities of the traditional GPT-4 model by incorporating image processing. This allows the model to analyze images and respond to questions related to them, expanding its applicability beyond text-based inputs.

Notable Point:
GPT-4 with vision maintains the same behavior as GPT-4, with the addition of image processing capabilities.
It is an augmentative set of capabilities, enhancing the model's versatility without compromising its performance on text-related tasks.

Run GPT-4 Vision with an API

Running the API with Clarifai's Python SDK

You can run the GPT-4 Vision Model API using Clarifai’s Python SDK.

Export your PAT as an environment variable. Then, import and initialize the API Client.

Find your PAT in your security settings.

export CLARIFAI_PAT={your personal access token}

Predict via Image URL

from clarifai.client.model import Model
from clarifai.client.input import Inputs

prompt = "What time of day is it?"
image_url = "https://samples.clarifai.com/metro-north.jpg"
inference_params = dict(temperature=0.2, max_tokens=100)

model_prediction = Model("https://clarifai.com/openai/chat-completion/models/openai-gpt-4-vision").predict(inputs = [Inputs.get_multimodal_input(input_id="",image_url=image_url, raw_text=prompt)],inference_params=inference_params)

print(model_prediction.outputs[0].data.text.raw)

You can also run GPT-4 Vision API using other Clarifai Client Libraries like Java, cURL, NodeJS, PHP, etc here.

Predict via local Image

from clarifai.client.model import Model
from clarifai.client.input import Inputs

IMAGE_FILE_LOCATION = 'LOCAL IMAGE PATH'
with open(IMAGE_FILE_LOCATION, "rb") as f:
    file_bytes = f.read()


prompt = "What time of day is it?"
inference_params = dict(temperature=0.2, max_tokens=100)

model_prediction = Model("https://clarifai.com/openai/chat-completion/models/openai-gpt-4-vision").predict(inputs = [Inputs.get_multimodal_input(input_id="", image_bytes = file_bytes, raw_text=prompt)], inference_params=inference_params)
print(model_prediction.outputs[0].data.text.raw)

Using cURL to Make a Direct HTTP Call

To make a direct HTTP call to the GPT-4 Vision API using cURL, you can use the following command:

curl -X POST "https://api.clarifai.com/v2/users/openai/apps/chat-completion/models/openai-gpt-4-vision/versions/12b67ac2b5894fb9af9c06ebf8dc02fb/outputs" \
    -H "Authorization: Key YOUR_PAT_HERE" \
    -H "Content-Type: application/json" \
    -d '{
    "inputs": [
        {
            "data": {
                "text": {
                    "raw": "Write Caption for the image "
                },
                "image": {
                    "url": "https://samples.clarifai.com/metro-north.jpg"
                }
            }
        }
    ],
    "model": {
        "model_version": {
            "output_info": {
                "params": {
                    "temperature":"0.5",
                    "max_tokens":2048,
                    "top_p":"0.95"
                }
            }
        }
    }
}'

Note

Providing an image as input is optional. The model will function like GPT-4-Turbo even if no image is provided.

Aliases: GPT4, gpt4, gpt-4, GPT 4, gpt4 vision, gpt4-vision, gpt-4-vision

Improvements

The introduction of vision capabilities in GPT-4 brings several enhancements to its utility. Key improvements include:

Multimodal Inputs: GPT-4 with Vision allows the model to handle both text and image inputs, expanding its range of applications.
Versatility: Developers can now create applications that involve questions about images, enabling innovative use cases.

Use Cases

GPT-4 with Vision can be applied in various scenarios, including but not limited to:

Image-based Question Answering: The model can answer questions related to the content of images.
Multimodal Applications: Developers can build applications that leverage both text and image inputs for a richer user experience.

Limitations

Despite its powerful capabilities, GPT-4 with Vision has some limitations that users should be aware of:

Medical Images: Not suitable for interpreting specialized medical images like CT scans.
Non-English Text: May not perform optimally with images containing non-Latin alphabets.
Rotation: May misinterpret rotated or upside-down text or images.
Visual Elements: Struggles with understanding complex visual elements like graphs.
Spatial Reasoning: Difficulty in tasks requiring precise spatial localization.
Accuracy: May generate incorrect descriptions or captions in certain scenarios.
Image Shape: Struggles with panoramic and fisheye images.
Metadata and Resizing: Doesn't process original file names or metadata; images are resized before analysis.
Counting: May provide approximate counts for objects in images.
CAPTCHAs: Submission of CAPTCHAs is blocked for safety reasons.

Disclaimer

Please be advised that this model utilizes wrapped Artificial Intelligence (AI) provided by OpenAI (the "Vendor"). These AI models may collect, process, and store data as part of their operations. By using our website and accessing these AI models, you hereby consent to the data practices of the Vendor. We do not have control over the data collection, processing, and storage practices of the Vendor. Therefore, we cannot be held responsible or liable for any data handling practices, data loss, or breaches that may occur. It is your responsibility to review the privacy policies and terms of service of the Vendor to understand their data practices. You can access the Vendor's privacy policy and terms of service at https://openai.com/policies/privacy-policy. We disclaim all liability with respect to the actions or omissions of the Vendor, and we encourage you to exercise caution and to ensure that you are comfortable with these practices before utilizing the AI models hosted on our site.

ID
Model Type ID
Multimodal To Text
Input Type
image
Output Type
text
Description
GPT-4 Vision extends GPT-4's capabilities that can understand and answer questions about images, expanding its capabilities beyond just processing text.
Last Updated
Nov 07, 2024
Privacy
PUBLIC
Use Case
Toolkit
License
Share
Badge