The maximum number of tokens to generate. Shorter token lengths will provide faster performance.
A decimal number that determines the degree of randomness in the response
The top-k parameter limits the model's predictions to the top k most probable tokens at each step of generation.
An alternative to sampling with temperature, where samples from the top p percentage of most likely tokens.
ResetGenerate
Output
Submit a prompt for a response.
Notes
Google Gemini Pro Vision was created from the ground up to be multimodal (text, images, videos) and to scale across a wide range of tasks.
Gemini Pro Vision
Gemini Pro Vision is a Gemini large language vision model that understands input from text and visual modalities (image and video) in addition to text to generate relevant text responses.
Gemini Pro Vision is a foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from image and video. It's adept at processing visual and text inputs such as photographs, documents, infographics, and screenshots.
Run Gemini Pro Vision with an API
Running the API with Clarifai's Python SDK
You can run the Gemini Pro Vision Model API using Clarifai’s Python SDK.
Export your PAT as an environment variable. Then, import and initialize the API Client.
from clarifai.client.model import Model
from clarifai.client.inputimport Inputs
prompt ="What time of day is it?"image_url ="https://samples.clarifai.com/metro-north.jpg"inference_params =dict(temperature=0.2, top_k =50, top_p=0.95, max_tokens=100)model_prediction = Model("https://clarifai.com/gcp/generate/models/gemini-pro-vision").predict(inputs =[Inputs.get_multimodal_input(input_id="",image_url=image_url, raw_text=prompt)],inference_params=inference_params)print(model_prediction.outputs[0].data.text.raw)
Predict via local Image
from clarifai.client.model import Model
from clarifai.client.inputimport Inputs
IMAGE_FILE_LOCATION ='LOCAL IMAGE PATH'withopen(IMAGE_FILE_LOCATION,"rb")as f:file_bytes = f.read()prompt ="What time of day is it?"inference_params =dict(temperature=0.2, top_k =50, top_p=0.95, max_tokens=100)model_prediction = Model("https://clarifai.com/gcp/generate/models/gemini-pro-vision").predict(inputs =[Inputs.get_multimodal_input(input_id="", image_bytes = file_bytes, raw_text=prompt)], inference_params=inference_params)print(model_prediction.outputs[0].data.text.raw)
You can also run Gemini Pro Vision API using other Clarifai Client Libraries like Java, cURL, NodeJS, PHP, etc here.
Use cases
Visual information seeking: Use external knowledge combined with information extracted from the input image or video to answer questions.
Object recognition: Answer questions related to fine-grained identification of the objects in images and videos.
Digital content understanding: Answer questions and extract information from visual content like infographics, charts, figures, tables, and web pages.
Structured content generation: Generate responses based on multimodal inputs in formats like HTML and JSON.
Captioning and description: Generate descriptions of images and videos with varying levels of details.
Reasoning: Compositionally infer new information without memorization or retrieval.
Disclaimer
Please be advised that this model utilizes wrapped Artificial Intelligence (AI) provided by GCP (the "Vendor"). These AI models may collect, process, and store data as part of their operations. By using our website and accessing these AI models, you hereby consent to the data practices of the Vendor.
We do not have control over the data collection, processing, and storage practices of the Vendor. Therefore, we cannot be held responsible or liable for any data handling practices, data loss, or breaches that may occur.
It is your responsibility to review the privacy policies and terms of service of the Vendor to understand their data practices. You can access the Vendor's privacy policy and terms of service at https://cloud.google.com/privacy.
We disclaim all liability with respect to the actions or omissions of the Vendor, and we encourage you to exercise caution and to ensure that you are comfortable with these practices before utilizing the AI models hosted on our site.
ID
Model Type ID
Multimodal To Text
Input Type
image
Output Type
text
Description
Google Gemini Pro Vision was created from the ground up to be multimodal (text, images, videos) and to scale across a wide range of tasks.