Explore The World's AI™
Use any open and closed source models and workflows from leading partners and the community.
🔥 Trending Models
general-image-recognition
Identifies a variety of concepts in images and video including objects, themes, and more. Trained with over 10,000 concepts and 20M images.
Llama-3_2-3B-Instruct
Llama 3.2 (1B) is a multilingual, instruction-tuned LLM by Meta, optimized for dialogue, retrieval, and summarization. It uses an autoregressive transformer with SFT and RLHF for improved alignment and outperforms many industry models.
gemma-3-4b-it
Gemma 3 (4B) is a multilingual, multimodal open model by Google, handling text and image inputs with a 128K context window. It excels in tasks like QA and summarization while being efficient for deployment on limited-resource devices.
MiniCPM-o-2_6-language
MiniCPM-o-2d6-language is the latest series of end-side multimodal LLMs (MLLMs) upgraded from MiniCPM-V. The models can now take images, video, text, and audio as inputs and provide high-quality text output in an end-to-end fashion
Qwen2_5-VL-7B-Instruct
Qwen2.5-VL is a vision-language model designed for AI agents, finance, and commerce. It excels in visual recognition, reasoning, long video analysis, object localization, and structured data extraction.
Qwen2_5-Coder-7B-Instruct-vllm
Qwen2.5-Coder is a code-specific LLM series (0.5B–32B) with improved code generation, reasoning, and fixing. Trained on 5.5T tokens, the 32B model rivals GPT-4o in coding capabilities.
phi-4
Phi-4 is a state-of-the-art open model trained on high-quality synthetic, public, and academic data for advanced reasoning. It uses fine-tuning and preference optimization for precise instruction adherence and safety.
MiniCPM3-4B
MiniCPM3-4B is the 3rd generation of MiniCPM series. The overall performance of MiniCPM3-4B surpasses Phi-3.5-mini-Instruct and GPT-3.5-Turbo-0125, being comparable with many recent 7B~9B models.
QwQ-32B-AWQ
QwQ is the reasoning model of the Qwen series, designed for enhanced problem-solving and downstream task performance. QwQ-32B competes with top reasoning models like DeepSeek-R1 and o1-mini.
phi-4-mini-instruct
Phi-4-mini-instruct is a lightweight open model from the Phi-4 family, optimized for reasoning with high-quality data. It supports a 128K context window and uses fine-tuning for precise instruction adherence and safety.
DeepSeek-R1-Distill-Qwen-7B
DeepSeek-R1-Distill-Qwen-7B is a 7B-parameter dense model distilled from DeepSeek-R1 based on Qwen-7B.
gpt-4o
GPT-4o is a multimodal AI model that excels in processing and generating text, audio, and images, offering rapid response times and improved performance across languages and tasks, while incorporating advanced safety features
claude-3_5-sonnet
Claude 3.5 Sonnet is a high-speed, advanced AI model excelling in reasoning, knowledge, coding, and visual tasks, ideal for complex applications.
pixtral-12b
Pixtral 12B is a natively multimodal model excelling in multimodal reasoning, instruction following, and text benchmarks with a 12B parameter architecture supporting variable image sizes and long context inputs
got-ocr-2_0
The OCR-2.0 model (GOT) is a versatile and efficient optical character recognition system designed to handle diverse tasks, including text, formulas, and charts, through a unified end-to-end architecture.
o4-mini
o4-mini model from openAI.
o3
OpenAI o3 reasoning multimodal.
grok-3
--
Large Language Models
Last Updated
llama-3_2-11b-vision-instruct
Llama-3.2-11B-Vision-Instruct is a multimodal LLM by Meta designed for visual reasoning, image captioning, and VQA tasks, supporting text + image inputs with 11B parameters
DeepSeek-R1-Distill-Qwen-32B
DeepSeek-R1-Distill-Qwen-32B is a 32B-parameter dense model distilled from DeepSeek-R1 based on Qwen-32B.
llama-3_3-70b-instruct
Llama 3.3 (70B) is a multilingual instruction-tuned LLM optimized for dialogue, trained on 15T+ tokens, supporting 8 languages, and incorporating strong safety measures
deepseek-coder-33b-instruct
DeepSeek-Coder-33B-Instruct model is a SOTA 33 billion parameter code generation model, fine-tuned on 2 billion tokens of instruction data, offering superior performance in code completion and infilling tasks across more than 80 programming languages.
minicpm-o-2_6
MiniCPM-o is the latest series of end-side multimodal LLMs (MLLMs) ungraded from MiniCPM-V. The models can now take images, video, text, and audio as inputs and provide high-quality text and speech outputs in an end-to-end fashion
DeepSeek-R1-Distill-Qwen-1_5B
DeepSeek-R1-Distill-Qwen-1_5B is a 1.5B-parameter dense model distilled from DeepSeek-R1 based on Qwen-1.5B.
Vision Language Models
Last Updated
gemini-2_0-flash
Gemini 2.0 Flash is a fast, low-latency multimodal model with enhanced performance and new capabilities
gemini-2_0-flash-lite
Gemini 2.0 Flash-Lite is our fastest and most cost efficient Flash model. It's an upgrade path for 1.5 Flash users who want better quality for the same price and speed.
llama-3_2-11b-vision-instruct
Llama-3.2-11B-Vision-Instruct is a multimodal LLM by Meta designed for visual reasoning, image captioning, and VQA tasks, supporting text + image inputs with 11B parameters
minicpm-o-2_6
MiniCPM-o is the latest series of end-side multimodal LLMs (MLLMs) ungraded from MiniCPM-V. The models can now take images, video, text, and audio as inputs and provide high-quality text and speech outputs in an end-to-end fashion
llava-1_5-7b
LLaVA-1.5 is a state-of-the-art language vision model that represents a significant advancement in the field of multimodal artificial intelligence
florence-2-large
Florence-2-large is a lightweight, versatile vision-language model by Microsoft, excelling in multiple tasks using a unified representation and the extensive FLD-5B dataset
Popular Workflows
Last Updated
General
A general image workflow that combines detection, classification, and embedding to identify general concepts including objects, themes, moods, etc.
rag-agent-gpt4-turbo-React-few-shot
RAG Agent uses GPT-4 Turbo LLM model with ReAct prompting, optimizing dynamic reasoning and action planning.
Face-Sentiment
Multi-model workflow that combines face detection and sentiment classification of 7 concepts: anger, disgust, fear, neutral, happiness, sadness, contempt, and surprise.
Demographics
Multi-model workflow that detects, crops, and recognizes demographic characteristics of faces. Visually classifies age, gender, and multi-culture characteristics.