Detecting Twenty-thousand Classes using Image-level Supervision

Detic: A Detector with image classes that can use image-level labels to easily train detectors.

Detecting Twenty-thousand Classes using Image-level Supervision,
Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra,
ECCV 2022 (arXiv 2201.02605)

Features

Detects any class given class names (using CLIP).
We train the detector on ImageNet-21K dataset with 21K classes.
Cross-dataset generalization to OpenImages and Objects365 without finetuning.
State-of-the-art results on Open-vocabulary LVIS and Open-vocabulary COCO.

Detic_C2_IN-L_SwinB_896_4x Performance

Open-vocabulary LVIS

Name	Training time	mask mAP	mask mAP_novel
Box-Supervised_C2_R50_640_4x	17h	30.2	16.4
Detic_C2_IN-L_R50_640_4x	22h	32.4	24.9
Detic_C2_CCimg_R50_640_4x	22h	31.0	19.8
Detic_C2_CCcapimg_R50_640_4x	22h	31.0	21.3
Box-Supervised_C2_SwinB_896_4x	43h	38.4	21.9
Detic_C2_IN-L_SwinB_896_4x	47h	40.7	33.8

Note

The open-vocabulary LVIS setup is LVIS without rare class annotations in training. We evaluate rare classes as novel classes in testing.
The models with C2 are trained using our improved LVIS baseline (Appendix D of the paper), including CenterNet2 detector, Federated Loss, large-scale jittering, etc.
All models use CLIP embeddings as classifiers. This makes the box-supervised models have non-zero mAP on novel classes.
The models with IN-L use the overlap classes between ImageNet-21K and LVIS as image-labeled data.
The models with CC use Conception Captions. CCimg uses image labels extracted from the captions (using a naive text-match) as image-labeled data. CCcapimg additionally uses the row captions (Appendix C of the paper).
The Detic models are finetuned on the corresponding Box-Supervised models above (indicated by MODEL.WEIGHTS in the config files). Please train or download the Box-Supervised model and place them under DETIC_ROOT/models/ before training the Detic models.

ID
Model Type ID
Visual Detector
Input Type
image
Output Type
regions[...].data.concepts,regions[...].region_info.bounding_box
Description
--
Last Updated
Aug 29, 2022
Privacy
PUBLIC
License
Share
Badge

Sign up or Log in to view model predictions.

Notes

Detecting Twenty-thousand Classes using Image-level Supervision

Features

Detic_C2_IN-L_SwinB_896_4x Performance

Open-vocabulary LVIS

Note