gemini-pro-vision

Google Gemini Pro Vision was created from the ground up to be multimodal (text, images, videos) and to scale across a wide range of tasks.

Input

Prompt:

Press Ctrl + Enter to submit
The maximum number of tokens to generate. Shorter token lengths will provide faster performance.
A decimal number that determines the degree of randomness in the response
The top-k parameter limits the model's predictions to the top k most probable tokens at each step of generation.
An alternative to sampling with temperature, where samples from the top p percentage of most likely tokens.

Output

Submit a prompt for a response.

Notes

Google Gemini Pro Vision was created from the ground up to be multimodal (text, images, videos) and to scale across a wide range of tasks.

Gemini Pro Vision

Gemini Pro Vision is a Gemini large language vision model that understands input from text and visual modalities (image and video) in addition to text to generate relevant text responses.

Gemini Pro Vision is a foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from image and video. It's adept at processing visual and text inputs such as photographs, documents, infographics, and screenshots.

Run Gemini Pro Vision with an API

Running the API with Clarifai's Python SDK

You can run the Gemini Pro Vision Model API using Clarifai’s Python SDK.

Export your PAT as an environment variable. Then, import and initialize the API Client.

Find your PAT in your security settings.

export CLARIFAI_PAT={your personal access token}

Predict via Image URL

from clarifai.client.model import Model
from clarifai.client.input import Inputs

prompt = "What time of day is it?"
image_url = "https://samples.clarifai.com/metro-north.jpg"
inference_params = dict(temperature=0.2, top_k =50, top_p=0.95, max_tokens=100)

model_prediction = Model("https://clarifai.com/gcp/generate/models/gemini-pro-vision").predict(inputs = [Inputs.get_multimodal_input(input_id="",image_url=image_url, raw_text=prompt)],inference_params=inference_params)

print(model_prediction.outputs[0].data.text.raw)

Predict via local Image

from clarifai.client.model import Model
from clarifai.client.input import Inputs

IMAGE_FILE_LOCATION = 'LOCAL IMAGE PATH'
with open(IMAGE_FILE_LOCATION, "rb") as f:
file_bytes = f.read()


prompt = "What time of day is it?"
inference_params = dict(temperature=0.2, top_k =50, top_p=0.95, max_tokens=100)

model_prediction = Model("https://clarifai.com/gcp/generate/models/gemini-pro-vision").predict(inputs = [Inputs.get_multimodal_input(input_id="", image_bytes = file_bytes, raw_text=prompt)], inference_params=inference_params)
print(model_prediction.outputs[0].data.text.raw)

You can also run Gemini Pro Vision API using other Clarifai Client Libraries like Java, cURL, NodeJS, PHP, etc here.

Use cases

  1. Visual information seeking: Use external knowledge combined with information extracted from the input image or video to answer questions.
  2. Object recognition: Answer questions related to fine-grained identification of the objects in images and videos.
  3. Digital content understanding: Answer questions and extract information from visual content like infographics, charts, figures, tables, and web pages.
  4. Structured content generation: Generate responses based on multimodal inputs in formats like HTML and JSON.
  5. Captioning and description: Generate descriptions of images and videos with varying levels of details.
  6. Reasoning: Compositionally infer new information without memorization or retrieval.

Disclaimer

Please be advised that this model utilizes wrapped Artificial Intelligence (AI) provided by GCP (the "Vendor"). These AI models may collect, process, and store data as part of their operations. By using our website and accessing these AI models, you hereby consent to the data practices of the Vendor. We do not have control over the data collection, processing, and storage practices of the Vendor. Therefore, we cannot be held responsible or liable for any data handling practices, data loss, or breaches that may occur. It is your responsibility to review the privacy policies and terms of service of the Vendor to understand their data practices. You can access the Vendor's privacy policy and terms of service at https://cloud.google.com/privacy.

We disclaim all liability with respect to the actions or omissions of the Vendor, and we encourage you to exercise caution and to ensure that you are comfortable with these practices before utilizing the AI models hosted on our site.

  • ID
  • Model Type ID
    Multimodal To Text
  • Input Type
    image
  • Output Type
    text
  • Description
    Google Gemini Pro Vision was created from the ground up to be multimodal (text, images, videos) and to scale across a wide range of tasks.
  • Last Updated
    Oct 17, 2024
  • Privacy
    PUBLIC
  • Use Case
  • Toolkit
  • License
  • Share
  • Badge
    gemini-pro-vision