- Community
- Model
- general-image-detector-detic_clipR50Caption-coco
Notes
Detecting Twenty-thousand Classes using Image-level Supervision
Detic: A Detector with image classes that can use image-level labels to easily train detectors.
Detecting Twenty-thousand Classes using Image-level Supervision,
Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra,
ECCV 2022 (arXiv 2201.02605)
Features
Detects any class given class names (using CLIP).
We train the detector on ImageNet-21K dataset with 21K classes.
Cross-dataset generalization to OpenImages and Objects365 without finetuning.
State-of-the-art results on Open-vocabulary LVIS and Open-vocabulary COCO.
Detic_CLIP_R50_1x_caption-image Performance
Open-vocabulary COCO
Name | Training time | box mAP50 | box mAP50_novel |
---|---|---|---|
BoxSup_CLIP_R50_1x | 12h | 39.3 | 1.3 |
Detic_CLIP_R50_1x_image | 13h | 44.7 | 24.1 |
Detic_CLIP_R50_1x_caption | 16h | 43.8 | 21.0 |
Detic_CLIP_R50_1x_caption-image | 16h | 45.0 | 27.8 |
Note
All models are trained with ResNet50-C4 without multi-scale augmentation. All models use CLIP embeddings as the classifier.
We extract class names from COCO-captions as image-labels. Detic_CLIP_R50_1x_image uses the max-size loss; Detic_CLIP_R50_1x_caption directly uses CLIP caption embedding within each mini-batch for classification; Detic_CLIP_R50_1x_caption-image uses both losses.
We report box mAP50 under the "generalized" open-vocabulary setting.
Inference with LVIS Vocabulary
- ID
- Namedetic-clip-r50-1x_caption-CPU
- Model Type IDVisual Detector
- Description--
- Last UpdatedAug 29, 2022
- PrivacyPUBLIC
- License
- Share
- Badge