Identifies a variety of concepts in images and video including objects, themes, and more. Trained with over 10,000 concepts and 20M images.
GPT-4o is a multimodal AI model that excels in processing and generating text, audio, and images, offering rapid response times and improved performance across languages and tasks, while incorporating advanced safety features
Claude 3.5 Sonnet is a high-speed, advanced AI model excelling in reasoning, knowledge, coding, and visual tasks, ideal for complex applications.
Llama-3.2-11B-Vision-Instruct is a multimodal LLM by Meta designed for visual reasoning, image captioning, and VQA tasks, supporting text + image inputs with 11B parameters
Pixtral 12B is a natively multimodal model excelling in multimodal reasoning, instruction following, and text benchmarks with a 12B parameter architecture supporting variable image sizes and long context inputs
The OCR-2.0 model (GOT) is a versatile and efficient optical character recognition system designed to handle diverse tasks, including text, formulas, and charts, through a unified end-to-end architecture.
Multi-model workflow that detects, crops, and recognizes demographic characteristics of faces. Visually classifies age, gender, and multi-culture characteristics.
Multi-model workflow that combines face detection and sentiment classification of 7 concepts: anger, disgust, fear, neutral, happiness, sadness, contempt, and surprise.
A general image workflow that combines detection, classification, and embedding to identify general concepts including objects, themes, moods, etc.
RAG Agent uses GPT-4 Turbo LLM model with ReAct prompting, optimizing dynamic reasoning and action planning.