We train the detector on ImageNet-21K dataset with 21K classes.
Cross-dataset generalization to OpenImages and Objects365 without finetuning.
State-of-the-art results on Open-vocabulary LVIS and Open-vocabulary COCO.
Detic_CLIP_R50_1x_caption-image Performance
Open-vocabulary COCO
Name
Training time
box mAP50
box mAP50_novel
BoxSup_CLIP_R50_1x
12h
39.3
1.3
Detic_CLIP_R50_1x_image
13h
44.7
24.1
Detic_CLIP_R50_1x_caption
16h
43.8
21.0
Detic_CLIP_R50_1x_caption-image
16h
45.0
27.8
Note
All models are trained with ResNet50-C4 without multi-scale augmentation. All models use CLIP embeddings as the classifier.
We extract class names from COCO-captions as image-labels. Detic_CLIP_R50_1x_image uses the max-size loss; Detic_CLIP_R50_1x_caption directly uses CLIP caption embedding within each mini-batch for classification; Detic_CLIP_R50_1x_caption-image uses both losses.
We report box mAP50 under the "generalized" open-vocabulary setting.