general-image-detector-detic_clipR50Caption-coco model by facebook

general-image-detector-detic_clipR50Caption-coco

Notes

Detecting Twenty-thousand Classes using Image-level Supervision

Detic: A Detector with image classes that can use image-level labels to easily train detectors.

Detecting Twenty-thousand Classes using Image-level Supervision,
Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra,
ECCV 2022 (arXiv 2201.02605)

Features

Detects any class given class names (using CLIP).
We train the detector on ImageNet-21K dataset with 21K classes.
Cross-dataset generalization to OpenImages and Objects365 without finetuning.
State-of-the-art results on Open-vocabulary LVIS and Open-vocabulary COCO.

Detic_CLIP_R50_1x_caption-image Performance

Open-vocabulary COCO

Name	Training time	box mAP50	box mAP50_novel
BoxSup_CLIP_R50_1x	12h	39.3	1.3
Detic_CLIP_R50_1x_image	13h	44.7	24.1
Detic_CLIP_R50_1x_caption	16h	43.8	21.0
Detic_CLIP_R50_1x_caption-image	16h	45.0	27.8

Note

All models are trained with ResNet50-C4 without multi-scale augmentation. All models use CLIP embeddings as the classifier.
We extract class names from COCO-captions as image-labels. Detic_CLIP_R50_1x_image uses the max-size loss; Detic_CLIP_R50_1x_caption directly uses CLIP caption embedding within each mini-batch for classification; Detic_CLIP_R50_1x_caption-image uses both losses.
We report box mAP50 under the "generalized" open-vocabulary setting.

Inference with LVIS Vocabulary

ID
Model Type ID
Visual Detector
Input Type
image
Output Type
regions[...].data.concepts,regions[...].region_info.bounding_box
Description
--
Last Updated
Aug 29, 2022
Privacy
PUBLIC
License
Share
Badge