BLIP-2, a scalable multimodal pre-training method that enables any Large Language Models (LLMs) to ingest and understand images, unlocks the capabilities of zero-shot image-to-text generation. BLIP-2 is quick, efficient, and accurate.
from clarifai.client.model import Model
prompt ="What's the future of AI?"model_url="https://clarifai.com/salesforce/blip/models/multimodal-embedder-blip-2"model_prediction = Model(url=model_url, pat="YOUR_PAT_HERE").predict_by_bytes(prompt.encode())print(model_prediction.outputs[0].data.text.raw)