Identifies a variety of concepts in images and video including objects, themes, and more. Trained with over 10,000 concepts and 20M images.
Gemma 3 (4B) is a multilingual, multimodal open model by Google, handling text and image inputs with a 128K context window. It excels in tasks like QA and summarization while being efficient for deployment on limited-resource devices.
MiniCPM-o-2d6-language is the latest series of end-side multimodal LLMs (MLLMs) upgraded from MiniCPM-V. The models can now take images, video, text, and audio as inputs and provide high-quality text output in an end-to-end fashion
Qwen2.5-VL is a vision-language model designed for AI agents, finance, and commerce. It excels in visual recognition, reasoning, long video analysis, object localization, and structured data extraction.
Qwen2.5-Coder is a code-specific LLM series (0.5B–32B) with improved code generation, reasoning, and fixing. Trained on 5.5T tokens, the 32B model rivals GPT-4o in coding capabilities.
Phi-4 is a state-of-the-art open model trained on high-quality synthetic, public, and academic data for advanced reasoning. It uses fine-tuning and preference optimization for precise instruction adherence and safety.
MiniCPM3-4B is the 3rd generation of MiniCPM series. The overall performance of MiniCPM3-4B surpasses Phi-3.5-mini-Instruct and GPT-3.5-Turbo-0125, being comparable with many recent 7B~9B models.
QwQ is the reasoning model of the Qwen series, designed for enhanced problem-solving and downstream task performance. QwQ-32B competes with top reasoning models like DeepSeek-R1 and o1-mini.
Phi-4-mini-instruct is a lightweight open model from the Phi-4 family, optimized for reasoning with high-quality data. It supports a 128K context window and uses fine-tuning for precise instruction adherence and safety.
Llama 3.2 (1B) is a multilingual, instruction-tuned LLM by Meta, optimized for dialogue, retrieval, and summarization. It uses an autoregressive transformer with SFT and RLHF for improved alignment and outperforms many industry models.
DeepSeek-R1-Distill-Qwen-7B is a 7B-parameter dense model distilled from DeepSeek-R1 based on Qwen-7B.
GPT-4o is a multimodal AI model that excels in processing and generating text, audio, and images, offering rapid response times and improved performance across languages and tasks, while incorporating advanced safety features
Claude 3.5 Sonnet is a high-speed, advanced AI model excelling in reasoning, knowledge, coding, and visual tasks, ideal for complex applications.
Llama-3.2-11B-Vision-Instruct is a multimodal LLM by Meta designed for visual reasoning, image captioning, and VQA tasks, supporting text + image inputs with 11B parameters
Pixtral 12B is a natively multimodal model excelling in multimodal reasoning, instruction following, and text benchmarks with a 12B parameter architecture supporting variable image sizes and long context inputs
The OCR-2.0 model (GOT) is a versatile and efficient optical character recognition system designed to handle diverse tasks, including text, formulas, and charts, through a unified end-to-end architecture.
Multi-model workflow that detects, crops, and recognizes demographic characteristics of faces. Visually classifies age, gender, and multi-culture characteristics.
Multi-model workflow that combines face detection and sentiment classification of 7 concepts: anger, disgust, fear, neutral, happiness, sadness, contempt, and surprise.
A general image workflow that combines detection, classification, and embedding to identify general concepts including objects, themes, moods, etc.
RAG Agent uses GPT-4 Turbo LLM model with ReAct prompting, optimizing dynamic reasoning and action planning.