llava-1_5-7b model | Clarifai

llava-1_5-7b

LLaVA-1.5 is a state-of-the-art language vision model that represents a significant advancement in the field of multimodal artificial intelligence

Input

Prompt:

Press Ctrl + Enter to submit

Max Tokens

The maximum number of tokens to generate. Shorter token lengths will provide faster performance.

Temperature

A decimal number that determines the degree of randomness in the response

Top K

The top-k parameter limits the model's predictions to the top k most probable tokens at each step of generation.

Top P

An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.

Output

Submit a prompt for a response.

Notes

Note

Datatype: 4-bit bitandbytes NF4 quantised model
Model Source

Introduction

LLaVA-1.5 is a state-of-the-art language vision model that represents a significant advancement in the field of multimodal artificial intelligence. Building upon the original LLaVA model, LLaVA-1.5 integrates a vision encoder with Vicuna, creating a powerful tool for general-purpose visual and language understanding.

LLaVA-1.5 7B Model

LLaVA-1.5 is an auto-regressive language model based on the transformer architecture, which has been fine-tuned from LLaMA/Vicuna with GPT-generated multimodal instruction-following data. The model incorporates simple yet effective modifications from its predecessor, LLaVA, enabling it to achieve state-of-the-art performance on 11 benchmarks.

Run LLaVA-1.5 with an API

Running the API with Clarifai's Python SDK

You can run the LLaVA-1.5 Model API using Clarifai’s Python SDK.

Export your PAT as an environment variable. Then, import and initialize the API Client.

Find your PAT in your security settings.

export CLARIFAI_PAT={your personal access token}

Predict via Image URL

from clarifai.client.model import Model
from clarifai.client.input import Inputs

prompt = "What time of day is it?"
image_url = "https://samples.clarifai.com/metro-north.jpg"
inference_params = dict(temperature=0.2, max_tokens=100)

model_prediction = Model("https://clarifai.com/liuhaotian/llava/models/llava-1_5-7b").predict(inputs = [Inputs.get_multimodal_input(input_id="",image_url=image_url, raw_text=prompt)],inference_params=inference_params)

print(model_prediction.outputs[0].data.text.raw)

You can also run LLaVA-1.5 API using other Clarifai Client Libraries like Java, cURL, NodeJS, PHP, etc here.

Use Cases

LLaVA-1.5 is designed primarily for research purposes, focusing on the exploration and development of large multimodal models and advanced chatbots. Its capabilities make it a valuable tool for:

Enhancing visual question answering systems.
Improving the performance of chatbots with visual context understanding.
Facilitating research in machine learning, NLP, and computer vision by providing a robust model for experimentation.

This eclectic mix of data sources ensures that LLaVA-1.5 is well-equipped to handle a wide range of visual and textual inputs, making it highly adaptable to various contexts and applications.

Dataset

The training of LLaVA-1.5 involved a diverse and rich dataset comprising:

558K filtered image-text pairs sourced from LAION/CC/SBU, captioned by BLIP.
158K GPT-generated multimodal instruction-following data.
450K academic-task-oriented VQA data mixture.
40K ShareGPT data.

Evaluation

LLaVA-1.5 has been rigorously evaluated across a collection of 12 benchmarks, which include 5 academic visual question answering (VQA) benchmarks and 7 recent benchmarks specifically proposed for instruction-following language multimodal models (LMMs). Through these evaluations, LLaVA-1.5 has demonstrated its superior performance and versatility.

ID
Model Type ID
Multimodal To Text
Input Type
image
Output Type
text
Description
LLaVA-1.5 is a state-of-the-art language vision model that represents a significant advancement in the field of multimodal artificial intelligence
Last Updated
Oct 17, 2024
Privacy
PUBLIC
Use Case
Toolkit
License
Share
Badge