detr-resnet-101 model by facebook | Clarifai

detr-resnet-101-general-object-detection

DETR is a state-of-the-art object detection model that uses a set-based global loss and transformer encoder-decoder architecture for direct set prediction.

No input available.

Notes

Overview

The DETR (Detection Transformer) model is a state-of-the-art object detection system based on transformers and bipartite matching loss for direct set prediction. It is a novel approach to object detection that views the task as a direct set prediction problem. Unlike traditional object detectors that rely on hand-designed components, such as non-maximum suppression and anchor generation, DETR uses a set-based global loss that forces unique predictions via bipartite matching and a transformer encoder-decoder architecture. The model reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel.

DETR Model

The DETR model consists of three main components: a CNN backbone to extract a compact feature representation, an encoder-decoder transformer, and a simple feed-forward network (FFN) that makes the final detection prediction. The CNN backbone generates a lower-resolution activation map from the initial image, which is then passed through the transformer encoder along with spatial positional encoding. The decoder receives queries, output positional encoding, and encoder memory, and produces the final set of predicted class labels and bounding boxes through multiple multi-head self-attention and decoder-encoder attention. The model can be implemented in any deep learning framework that provides a common CNN backbone and a transformer architecture implementation with just a few hundred lines.

Use Cases

The DETR model can be used for various object detection tasks, including pedestrian detection, vehicle detection, and general object detection across 80 categories. The model can be fine-tuned on custom datasets to detect specific objects or classes.

Dataset

The DETR model has been trained and evaluated on the COCO object detection dataset, which contains over 330,000 images with more than 2.5 million object instances labeled across 80 categories. The dataset is split into training, validation, and test sets, with the training set containing 118,000 images and the validation set containing 5,000 images. The test set is not publicly available, and evaluation on this set is done through the COCO evaluation server.

Evaluation

The DETR model has been evaluated on the COCO object detection dataset using the standard COCO metrics, including Average Precision (AP) and Average Recall (AR) at different Intersection over Union (IoU) thresholds. The model achieves state-of-the-art performance on the COCO test-dev set, with an AP of 43.3 and an AR of 59.6 at an IoU threshold of 0.5.

ID
Model Type ID
Visual Detector
Input Type
image
Output Type
regions[...].data.concepts,regions[...].region_info.bounding_box
Description
DETR is a state-of-the-art object detection model that uses a set-based global loss and transformer encoder-decoder architecture for direct set prediction.
Last Updated
Oct 17, 2024
Privacy
PUBLIC
Use Case
Toolkit
License
Share
Badge