gpt-oss-120b / gpt-oss-20b are OpenAI's new open models designed for reasoning and efficient deployment. They use only 5.1B active parameters, feature advanced attention mechanisms, RoPE encoding, and adjustable reasoning levels. Released as Apache 2.0.

Process an OpenAI-compatible request and send it to the appropriate OpenAI endpoint.

Args:
    msg: JSON string containing the request parameters including 'openai_endpoint'

Returns:
    JSON string containing the response or error

openai_transport

return

Args:
    msg: The request as a JSON string.

Returns:
    Iterator[str]: An iterator yielding text chunks from the streaming response.

openai_stream_transport

This method is used to predict the response for the given prompt and chat history using the model and tools.

prompt

chat_history

tools

tool_choice

The maximum number of tokens to generate. Shorter token lengths will provide faster performance.

max_tokens

A decimal number that determines the degree of randomness in the response

temperature

An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.

top_p

An integer that controls the number of top tokens to consider. Set to -1 to disable (default behavior).

top_k

The level of reasoning effort to apply to the response. Currently supported values are low, medium, and high. 

reasoning_effort

predict

This method is used to stream generated text tokens from a prompt + optional chat history and tools.

generate

Model is trained and ready for deployment

gpt-oss-120b

Open-source 120B parameter model supporting advanced reasoning, tool calling, and agentic workflows. Efficient MoE architecture, RoPE positional encoding, and adjustable reasoning effort.

GPT-OSS-120B

A speedy and economical reasoning model that excels at agentic coding

Process an OpenAI-compatible request and return a streaming response iterator.
This method is used when stream=True and returns an iterator of strings directly,
without converting to a list or JSON serializing. Supports chat completions and responses endpoints.

Args:
    msg: The request as a JSON string.

Returns:
    Iterator[str]: An iterator yielding text chunks from the streaming response.

This method is used to predict the output of the model using the grok-code-fast-1 model with streaming = False.

An upper bound for the number of tokens that can be generated for a completion.

This method is used to generate the output of the model using the grok-code-fast-1 model with streaming support.

grok-code-fast-1

Qwen3-Next-80B-A3B-Thinking is a 80B-parameter, sparsely activated reasoning-optimized LLM that delivers near-flagship performance on complex reasoning tasks with extreme efficiency in training and ultra-long context inference (up to 256K tokens).

qwen3-next-80B-A3B-Thinking

Qwen3-Next-80B-A3B-Thinking

GPT-5-mini is a faster, smaller, and cost-effective version of GPT-5, built for high-volume, efficient use. It offers strong performance and takes over when GPT-5 usage limits are hit. Available via Clarifai for seamless developer integration.

This method is used to predict the output of the model using the OpenAI gpt-4.1 model with streaming = False.

image

images

This method is used to generate the output of the model using the OpenAI gpt-4.1 model with streaming support.

gpt-5-mini

GPT-5-nano is the smallest, fastest, and most cost-efficient GPT-5 model, ideal for high-volume and resource-limited use. It ensures fast, reliable AI for basic tasks and is available via Clarifai for seamless, low-cost integration.

gpt-5-nano

GPT-5 is OpenAI's most advanced model, excelling in reasoning, multimodal tasks, and expert-level responses. It's fast, intelligent, and available through Clarifai for easy developer integration.

gpt-5

State-of-the-art mixture-of-experts agentic intelligence model with 1 T parameters, 128K context, and native tool use

Kimi-K2-Instruct

DeepSeek-R1-0528 improves reasoning and logic via better computation and optimization, nearing the performance of top models like O3 and Gemini 2.5 Pro.

The system-level prompt used to define the assistant's behavior.

system_prompt

A decimal number that determines the degree of randomness in the response.

DeepSeek-R1-0528-Qwen3-8B

Llama 3.2 (3B) is a multilingual, instruction-tuned LLM by Meta, optimized for dialogue, retrieval, and summarization. It uses an autoregressive transformer with SFT and RLHF for improved alignment and outperforms many industry models.

audios

videos

audio

video

The single model method to get the OpenAI-compatible request and send it to the OpenAI server
  then return its response.

Args:
    msg: JSON string containing the request parameters

Returns:
    JSON string containing the response or error

Process an OpenAI-compatible request and return a streaming response iterator.
This method is used when stream=True and returns an iterator of strings directly,
without converting to a list or JSON serializing.

Args:
    msg: The request as a JSON string.

Returns:
    Iterator[str]: An iterator yielding text chunks from the streaming response.

Llama-3_2-3B-Instruct

Qwen3-30B-A3B-Instruct-2507 improves  comprehension, coding, multilingual knowledge, user alignment, and 256K long-context handling.


Qwen3-30B-A3B-Instruct-2507

An open-source 30B parameter Mixture-of-Experts (MoE) model with 3.3B active parameters. It features a native 256K context window and excels in instruction-following, reasoning, coding, and agentic workflows.

Qwen3-30B-A3B-Thinking-2507 is an enhanced version with significantly improved reasoning, general capabilities, user alignment, and a long-context understanding.

Qwen3-30B-A3B-Thinking-2507

An open-source 30B parameter Mixture-of-Experts (MoE) model with a dedicated thinking mode for complex reasoning tasks. It features a native 256K context window and excels in logical, mathematical, and agentic workflows.

Qwen3-Coder-30B-A3B-Instruct is a high-performing, efficient model with strong agentic coding abilities, long-context support, and broad platform compatibility.

Qwen3-Coder-30B-A3B-Instruct

gpt-oss-20b / gpt-oss-120b are OpenAI's new open models designed for reasoning and efficient deployment. They use only 5.1B active parameters, feature advanced attention mechanisms, RoPE encoding, and adjustable reasoning levels. Released as Apache 2.0.

gpt-oss-20b

Open-source 20B parameter model supporting advanced reasoning, tool calling, and agentic workflows. Efficient MoE architecture, RoPE positional encoding, and adjustable reasoning effort.

GPT-OSS-20B

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements.

Qwen3-14B

Identifies a variety of concepts in images and video including objects, themes, and more. Trained with over 10,000 concepts and 20M images.

Image Recognition

The MiniCPM4 series are efficient LLMs optimized for end-side devices, achieved through innovations in architecture, data, training, and inference.

The system-level prompt that defines the assistant's behavior. If not provided, the model's default system prompt will be used.

MiniCPM4-8B

Grok-3 is a state-of-the-art large language model (LLM) developed by XAI, their most sophisticated model to date, which combines robust reasoning with vast pretraining knowledge.

This method is used to predict the output of the model using the grok-3 model with streaming = False.

This method is used to generate the output of the model using the grok-3 model with streaming support.

grok-3

GPT-4o is a multimodal AI model that excels in processing and generating text, audio, and images, offering rapid response times and improved performance across languages and tasks, while incorporating advanced safety features

This method is used to predict the output of the model using the OpenAI gpt-4o model with streaming = False.

This method is used to generate the output of the model using the OpenAI gpt-4o model with streaming support.

gpt-4o

GPT-4.1 is OpenAI's advanced LLM, optimized for coding, instruction following, and processing extended contexts up to 1 million tokens. It delivers enhanced performance, making it ideal for complex and long-form content generation tasks

gpt-4_1

Gemini 2.5 Flash-Lite is Google’s fastest, most efficient model yet, with improved performance and ultra-low latency.

This method is used to predict the output of the model using the Gemini model with streaming = False.

This method is used to predict the output of the model using the Gemini model with streaming = True.

gemini-2_5-flash-lite

Gemini 2.5 Flash Preview is the next iteration in the Gemini 2.0 series of models, a suite of highly-capable, natively multimodal, reasoning models. Gemini 2.5 Flash Preview is Google’s first fully hybrid reasoning model.

This method is used to predict the output of the model using the Gemini-2.5 PRO model with streaming = False.

This method is used to predict the output of the model using the Gemini-2.5 FLASH model with streaming = True.

gemini-2_5-flash

Gemini 2.0 Flash is a fast, low-latency multimodal model with enhanced performance and new capabilities

gemini-2_0-flash

Phi-4-reasoning-plus is a state-of-the-art open-weight reasoning model finetuned from Phi-4 using supervised fine-tuning on a dataset of chain-of-thought traces and reinforcement learning. 

Phi-4-reasoning-plus

MiniCPM3-4B is the 3rd generation of MiniCPM series. The overall performance of MiniCPM3-4B surpasses Phi-3.5-mini-Instruct and GPT-3.5-Turbo-0125, being comparable with many recent 7B~9B models.

MiniCPM3-4B

Qwen2.5-VL is a vision-language model designed for AI agents, finance, and commerce. It excels in visual recognition, reasoning, long video analysis, object localization, and structured data extraction.

Qwen2_5-VL-7B-Instruct

Multi-model workflow that detects, crops, and recognizes demographic characteristics of faces. Visually classifies age, gender, and multi-culture characteristics.

Multi-model workflow that combines face detection and sentiment classification of 7 concepts: anger, disgust, fear, neutral, happiness, sadness, contempt, and surprise.

A general image workflow that combines detection, classification, and embedding to identify general concepts including objects, themes, moods,  etc.

RAG Agent uses  GPT-4 Turbo LLM model with ReAct prompting, optimizing dynamic reasoning and action planning.

Unify reasoning, coding, and agentic capabilities into a single model

The level of reasoning effort to apply to the response. Currently supported values are none, low, medium, and high. 

GLM_4_5

DeepSeek-V3.1 is a hybrid model that supports both thinking mode and non-thinking mode.

Whether to enable the model's 'thinking' mode, which allows it to reason through complex problems step-by-step.

thinking

DeepSeek-V3_1

DeepSeek-R1-Distill-Qwen-7B is a 7B-parameter dense model distilled from DeepSeek-R1 based on Qwen-7B.



DeepSeek-R1-Distill-Qwen-7B

Claude 3.5 Sonnet is a high-speed, advanced AI model excelling in reasoning, knowledge, coding, and visual tasks, ideal for complex applications.




claude-3_5-sonnet

Pixtral 12B is a natively multimodal model excelling in multimodal reasoning, instruction following, and text benchmarks with a 12B parameter architecture supporting variable image sizes and long context inputs

pixtral-12b

DeepSeek-R1-Distill-Qwen-32B is a 32B-parameter dense model distilled from DeepSeek-R1 based on Qwen-32B.



DeepSeek-R1-Distill-Qwen-32B

Llama 3.3 (70B) is a multilingual instruction-tuned LLM optimized for dialogue, trained on 15T+ tokens, supporting 8 languages, and incorporating strong safety measures

sglang-llama-3_3-70b-instruct

DeepSeek-Coder-33B-Instruct model is a SOTA 33 billion parameter code generation model, fine-tuned on 2 billion tokens of instruction data, offering superior performance in code completion and infilling tasks across more than 80 programming languages.

deepseek-coder-33b-instruct

MiniCPM-o is the latest series of end-side multimodal LLMs (MLLMs) ungraded from MiniCPM-V. The models can now take images, video, text, and audio as inputs and provide high-quality text and speech outputs in an end-to-end fashion

minicpm-o-2_6

DeepSeek-R1-Distill-Qwen-1_5B is a 1.5B-parameter dense model distilled from DeepSeek-R1 based on Qwen-1.5B.



DeepSeek-R1-Distill-Qwen-1_5B

DeepSeek-R1-Distill-Qwen-14B is a 14B-parameter dense model distilled from DeepSeek-R1 based on Qwen-14B.


DeepSeek-R1-Distill-Qwen-14B

Granite-3.0-8B-Instruct is a versatile, enterprise-ready language model optimized for multilingual understanding, coding, and instruction-following across diverse tasks and constrained environments.

granite-3_0-8b-instruct

Granite-3.0-2B-Instruct SLM is a lightweight, multilingual, enterprise-ready language model optimized for instruction-following and code understanding, ideal for versatile applications under Apache 2.0.

granite-3_0-2b-instruct

DBRX-Instruct is a state-of-the-art, efficient, open LLM by Databricks, capable of handling input length up to 32K tokens. The model excels at a broad set of natural language tasks such as: text summarization, question-answering, extraction, and coding.

dbrx-instruct

Gemini 2.0 Flash-Lite is our fastest and most cost efficient Flash model. It's an upgrade path for 1.5 Flash users who want better quality for the same price and speed.

gemini-2_0-flash-lite

LLaVA-1.5 is a state-of-the-art language vision model that represents a significant advancement in the field of multimodal artificial intelligence

llava-1_5-7b


Florence-2-large is a lightweight, versatile vision-language model by Microsoft, excelling in multiple tasks using a unified representation and the extensive FLD-5B dataset

florence-2-large

The Phi-3-Vision-128K-Instruct is a high-performance, cost-effective multimodal model for advanced text and image understanding tasks.

phi-3-vision-128k-instruct

MiniCPM-Llama3-V 2.5 is a high-performance, efficient 8B parameter  multimodal model excelling in OCR, multilingual support, and multimodal tasks.




miniCPM-Llama3-V-2_5

Gemini 1.5 Pro is a powerful, efficient AI LLM with 1 million long-context window, enabling advanced reasoning and comprehension across various data types. 


gemini-1_5-pro

Gemini 1.5 Flash is a cost-efficient, high-speed foundation llm optimized for multimodal tasks, ideal for applications requiring rapid processing and scalability

gemini-1_5-flash

The Claude 3 Opus model is a state-of-the-art, multimodal language model (llm) with superior performance in reasoning, math, coding, and multilingual understanding.

claude-3-opus

Claude 3 Sonnet is a multimodal AI llm model balancing skills and speed, excelling in reasoning, multilingual tasks, and visual interpretation.

claude-3-sonnet

GPT-4o Mini: An affordable, high-performing small model excelling in text and vision tasks with extensive context support

gpt-4o-mini

CogVLM-Chat is a state-of-the-art visual language model that excels in generating context-aware, conversational responses by integrating advanced visual and textual understanding

cogvlm-chat

Qwen-VL-Chat is a high-performing LVLM by Alibaba Cloud for text-image dialogue tasks, excelling in zero-shot captioning, VQA, and referring expression comprehension while supporting multilingual dialogue.






qwen-VL-Chat

Fuyu-8B is an open-source, simplified multimodal architecture with a decoder-only transformer, supporting arbitrary image resolutions, and excelling in diverse applications, including question answering and complex visual understanding

fuyu-8b

Explore Clarifai's free community for AI resources. Use our state-of-the-art AI models, workflows and more to add AI into your own applications.