LLaVA-1.5 is a state-of-the-art language vision model that represents a significant advancement in the field of multimodal artificial intelligence. Building upon the original LLaVA model, LLaVA-1.5 integrates a vision encoder with Vicuna, creating a powerful tool for general-purpose visual and language understanding.
LLaVA-1.5 7B Model
LLaVA-1.5 is an auto-regressive language model based on the transformer architecture, which has been fine-tuned from LLaMA/Vicuna with GPT-generated multimodal instruction-following data. The model incorporates simple yet effective modifications from its predecessor, LLaVA, enabling it to achieve state-of-the-art performance on 11 benchmarks.
Run LLaVA-1.5 with an API
Running the API with Clarifai's Python SDK
You can run the LLaVA-1.5 Model API using Clarifai’s Python SDK.
Export your PAT as an environment variable. Then, import and initialize the API Client.
from clarifai.client.model import Model
from clarifai.client.inputimport Inputs
prompt ="What time of day is it?"image_url ="https://samples.clarifai.com/metro-north.jpg"inference_params =dict(temperature=0.2, max_tokens=100)model_prediction = Model("https://clarifai.com/liuhaotian/llava/models/llava-1_5-7b").predict(inputs =[Inputs.get_multimodal_input(input_id="",image_url=image_url, raw_text=prompt)],inference_params=inference_params)print(model_prediction.outputs[0].data.text.raw)
You can also run LLaVA-1.5 API using other Clarifai Client Libraries like Java, cURL, NodeJS, PHP, etc here.
Use Cases
LLaVA-1.5 is designed primarily for research purposes, focusing on the exploration and development of large multimodal models and advanced chatbots. Its capabilities make it a valuable tool for:
Enhancing visual question answering systems.
Improving the performance of chatbots with visual context understanding.
Facilitating research in machine learning, NLP, and computer vision by providing a robust model for experimentation.
This eclectic mix of data sources ensures that LLaVA-1.5 is well-equipped to handle a wide range of visual and textual inputs, making it highly adaptable to various contexts and applications.
Dataset
The training of LLaVA-1.5 involved a diverse and rich dataset comprising:
558K filtered image-text pairs sourced from LAION/CC/SBU, captioned by BLIP.
LLaVA-1.5 has been rigorously evaluated across a collection of 12 benchmarks, which include 5 academic visual question answering (VQA) benchmarks and 7 recent benchmarks specifically proposed for instruction-following language multimodal models (LMMs). Through these evaluations, LLaVA-1.5 has demonstrated its superior performance and versatility.
ID
Model Type ID
Multimodal To Text
Input Type
image
Output Type
text
Description
LLaVA-1.5 is a state-of-the-art language vision model that represents a significant advancement in the field of multimodal artificial intelligence