Llama 3.2 (3B) is a multilingual, instruction-tuned LLM by Meta, optimized for dialogue, retrieval, and summarization. It uses an autoregressive transformer with SFT and RLHF for improved alignment and outperforms many industry models.

Model is trained and ready for deployment

Llama-3_2-3B-Instruct

prompt

images

audios

videos

chat_history

audio

video

image

tools

tool_choice

The system-level prompt used to define the assistant's behavior.

system_prompt

The maximum number of tokens to generate. Shorter token lengths will provide faster performance.

max_tokens

A decimal number that determines the degree of randomness in the response.

temperature

An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.

top_p

predict

return

generate

The single model method to get the OpenAI-compatible request and send it to the OpenAI server
  then return its response.

Args:
    msg: JSON string containing the request parameters

Returns:
    JSON string containing the response or error

openai_transport

Process an OpenAI-compatible request and return a streaming response iterator.
This method is used when stream=True and returns an iterator of strings directly,
without converting to a list or JSON serializing.

Args:
    msg: The request as a JSON string.

Returns:
    Iterator[str]: An iterator yielding text chunks from the streaming response.

openai_stream_transport

Llama 3.2 (3B) is a multilingual, instruction-tuned LLM by Meta, optimized for dialogue, retrieval, and summarization. It uses an autoregressive transformer