general-image-detector-detic_C2_SwinB_896_lvis model by facebook

Detects any class given class names (using CLIP).

We train the detector on ImageNet-21K dataset with 21K classes.

Cross-dataset generalization to OpenImages and Objects365 without finetuning.

State-of-the-art results on Open-vocabulary LVIS and Open-vocabulary COCO.

Name	Training time	mask mAP	mask mAP_rare
Box-Supervised_C2_R50_640_4x	17h	31.5	25.6
Detic_C2_R50_640_4x	22h	33.2	29.7
Box-Supervised_C2_SwinB_896_4x	43h	40.7	35.9
Detic_C2_SwinB_896_4x	47h	41.7	41.7

Detecting Twenty-thousand Classes using Image-level Supervision