We train the detector on ImageNet-21K dataset with 21K classes.
Cross-dataset generalization to OpenImages and Objects365 without finetuning.
State-of-the-art results on Open-vocabulary LVIS and Open-vocabulary COCO.
Detic_C2_IN-L_SwinB_896_4x Performance
Open-vocabulary LVIS
Name
Training time
mask mAP
mask mAP_novel
Box-Supervised_C2_R50_640_4x
17h
30.2
16.4
Detic_C2_IN-L_R50_640_4x
22h
32.4
24.9
Detic_C2_CCimg_R50_640_4x
22h
31.0
19.8
Detic_C2_CCcapimg_R50_640_4x
22h
31.0
21.3
Box-Supervised_C2_SwinB_896_4x
43h
38.4
21.9
Detic_C2_IN-L_SwinB_896_4x
47h
40.7
33.8
Note
The open-vocabulary LVIS setup is LVIS without rare class annotations in training. We evaluate rare classes as novel classes in testing.
The models with C2 are trained using our improved LVIS baseline (Appendix D of the paper), including CenterNet2 detector, Federated Loss, large-scale jittering, etc.
All models use CLIP embeddings as classifiers. This makes the box-supervised models have non-zero mAP on novel classes.
The models with IN-L use the overlap classes between ImageNet-21K and LVIS as image-labeled data.
The models with CC use Conception Captions. CCimg uses image labels extracted from the captions (using a naive text-match) as image-labeled data. CCcapimg additionally uses the row captions (Appendix C of the paper).
The Detic models are finetuned on the corresponding Box-Supervised models above (indicated by MODEL.WEIGHTS in the config files). Please train or download the Box-Supervised model and place them under DETIC_ROOT/models/ before training the Detic models.