florence2-large

Florence2-Large is a vision foundation model built for prompt-based computer vision tasks. This version is configured specifically for object detection in images and videos; to be used as an open-world visual detector.

Notes

Introduction

Florence-2-large is a lightweight vision-language model developed and open-sourced by Microsoft under the MIT License. Optimized for a wide range of vision and vision-language tasks, it utilizes a prompt-based framework for flexible application. With a compact architecture comprising 0.77 billion parameters, Florence-2-large delivers performance comparable to significantly larger models, such as Kosmos-2 (1.6 billion parameters). Its effectiveness is largely attributed to training on the extensive FLD-5B dataset, which includes 126 million images and 5.4 billion richly annotated visual elements.

Model Architecture

  • Vision Encoder: DaViT vision encoder to convert images into visual token embeddings.
  • Text Embeddings: BERT-generated text embeddings.
  • Multi-Modal Encoder-Decoder: Transformer-based architecture to process combined visual and text embeddings.
  • Location Tokens: Added for region-specific tasks to represent quantized coordinates.

Model Usage

The Florence-2-large model provided in this application is specifically configured for open-world object detection.

Object Detection prompt: <OD>

Object detecion output format:

{
  "<OD>": {
    "bboxes": [[x1,
 y1, x2, y2], ...],
    "labels": ["label1", "label2", ...]
  }
}

This model is pre-prompted to perform object detection on the given input, streamlining its deployment for this task. The model's output is automatically parsed and formatted to align with the bounding box specifications required by the Clarifai platform.

Running the API with Clarifai's Python SDK

You can run the Florence-2 Model API using Clarifai’s Python SDK. Export your PAT as an environment variable. Then, import and initialize the API Client. Find your PAT in your security settings.

export CLARIFAI_PAT={your personal access token}

Predict via Image URL

from clarifai.client.model import Model

image_url = "https://s3.amazonaws.com/samples.clarifai.com/people_walking2.jpeg"

model_url = "https://clarifai.com/clarifai/open-world/models/florence2-large"
model_prediction = Model(url=model_url,pat="").predict_by_url(image_url)

print(model_prediction.outputs[0].data.regions)

Dataset

Florence-2-large was trained on the FLD-5B dataset, comprising 126 million images and approximately 5.4 billion annotations. This dataset includes a diverse range of annotations—such as bounding boxes, segmentation masks, and descriptive captions—spanning multiple levels of granularity. Unlike traditional manually labeled datasets, FLD-5B was built using automated annotation pipelines powered by specialized models. It aggregates images from a variety of established computer vision datasets, providing a unified and scalable foundation for training general-purpose vision models.

Advantages

  • Unified Representation: Capable of performing multiple vision tasks with a single model, reducing the need for separate specialized models.
  • Efficiency: Compact architecture enables deployment on resource-constrained devices.
  • Strong Zero-shot Performance: Excels in zero-shot learning scenarios, outperforming larger models.
  • Versatility: Applicable to a wide range of vision and vision-language tasks, from captioning to segmentation.

Limitations

Dataset Availability: The FLD-5B dataset, crucial for training and fine-tuning, is not yet publicly available, potentially limiting reproducibility and further research. Task-specific Optimization: While the unified representation is efficient, it may not always match the performance of specialized models optimized for single tasks.

  • ID
  • Model Type ID
    Visual Detector
  • Description
    Florence2-Large is a vision foundation model built for prompt-based computer vision tasks. This version is configured specifically for object detection in images and videos; to be used as an open-world visual detector.
  • Last Updated
    May 02, 2025
  • Privacy
    PUBLIC
  • License
  • Share
    • Badge
      florence2-large