general-image-detection

Detects a variety of common objects and the location and generates regions of an image that may contain that object.

Notes

General Image Detector

Introduction

The general image detection model is able to detect and label a variety of common objects from images. In order to do this, it creates bounding box proposals. Note that these box proposals can be fed to downstream classifiers, expanding the labeling functionality of the model in situations where the dataset used for training (Open Images V4) is not yielding satisfactory classification results.

This image detector uses a RetinaNet with an Inception-V4 backbone architecture.

Limitations

The model may encounter difficulties detecting small objects. This is because contrary to the MS COCO dataset, Open Images is comprised of images containing more relatively large objects. It however has the advantage of having many more classes, resulting in more detectable objects.

Note that detectors can only interpret objects that belonged to the dataset used for training, meaning that they will be unable to detect out-of-sample classes (see Taxonomy section below for the classes used for training). Note that there are a few classes that are not very well represented in the Open Images V4 dataset.

RetinaNet

From Papers with Code:

RetinaNet is a one-stage object detection model that utilizes a focal loss function to address class imbalance during training. Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard negative examples. RetinaNet is a single, unified network composed of a backbone network and two task-specific subnetworks. The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-self convolutional network. The first subnet performs convolutional object classification on the backbone's output; the second subnet performs convolutional bounding box regression. The two subnetworks feature a simple design that the authors propose specifically for one-stage, dense detection.

Cross Entropy vs. Focal Loss

Focal Loss Graph

RetinaNet Architecture

RetinaNet Architecture

More Info

Original Repository (Deprecated): GitHub
Updated Model Repository: GitHub

Paper

Focal Loss for Dense Object Detection

Authors: Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár

Abstract

The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron.

Author

Facebook/Meta Research

Inception V4

Inception V4 is a CNN architecture building on previous iterations of Inception architectures. It differs from its predecesors in that it simplifies the architecture and utilizes more inception modules.

Architecture

Inception V4 Architecture

The original Inception architecture was heavily engineered to push performance in terms of speed and accuracy while reducing computational cost. Prior CNN solutions would improve performance by increasing the depth of the network, which would compromise computational cost. Very deep networks are also susceptible to overfitting and present difficulties in passing gradient updates when performing backpropagation.

Inception networks heavily rely on 1x1 convolution layers, which are used as a dimensionality reduction module. This reduction allows for fewer operations, and thus, less computation. By using 1x1 convolution to reduce the model size, the overfitting problem is also reduced.

Although it might seem counter intuitive to add a layer to reduce the number of operations, performing a 3x3 or a 5x5 convolution on smaller dimension inputs will drastically reduce the number of operations, greatly improving performance.

The design of the inception module effectively causes the network to get wider rather than deeper. The "naive" inception module performs convolution on an input with 3 different sizes of filters (1x1, 3x3, 5x5) as well as 3x3 max pooling.

Naive Inception Module

Naive Inception Module

This is where the dimensionality reduction comes in via the 1x1 convolution layer, reducing size prior to the 3x3 and 5x5 convolution layers, and after the 3x3 max pooling.

Inception Module with Dimension Reductions

Inception module with dimension reductions

Inception modules can be connected to each other (linear stacking) by concatenating the outputs of the filters (both convolution and max pool), and feeding them as the input of the next inception layer.

By relying on these inception modules, the original inception network, GoogLeNet, was built. GoogLeNet is comprised on 9 inception modules stacked linearly. It is 22 layers deep (27, including the pooling layers), and it uses global average pooling at the end of the last inception module.

A 22 layer network is a very deep classifier, and just like any other very deep network, it faces the vanishing gradient problem, which preventing middle and early layers from learning during backpropagation.

This problem was overcome by introducing two auxiliary classifiers (see purple boxes in the figure below). These essentially apply softmax to the outputs of two inception modules, and compute the auxiliary loss over the same labels. They compute a total loss function as a weighted sum of the auxiliary loss and the real loss. The original paper is using a 0.3 weight value for each auxiliary loss.

total_loss = real_loss + 0.3 * aux_loss_1 + 0.3 * aux_loss_2

GoogLeNet Network

GoogLeNet Network

Inception V4 Paper

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

Authors: Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke

Abstract

Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question of whether there are any benefit in combining the Inception architecture with residual connections. Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin. We also present several new streamlined architectures for both residual and non-residual Inception networks. These variations improve the single-frame recognition performance on the ILSVRC 2012 classification task significantly. We further demonstrate how proper activation scaling stabilizes the training of very wide residual Inception networks. With an ensemble of three residual and one Inception-v4, we achieve 3.08% top-5 error on the test set of the ImageNet classification (CLS) challenge.

GoogLeNet Paper

Going Deeper with Convolutions

Authors: Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich

Abstract

We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

Author

Google Inc.

Dataset

The model was trained primarily on Open Images V4 dataset.

Note that there are a few classes that are not very well represented in the Open Images validation set. When compared to MS COCO dataset, Open Images has more relatively large objects, but it has the advantage of having many more classes, resulting in more detectable objects.

More Info

Dataset Homepage: Google APIs
Source Code: tfds.object_detection.OpenImagesV4
Versions: 2.0.0 (default)
Download Size: 565.11 GiB

Metrics

Validation mean Average Precision (mAP) at an Intersection over Union (IoU) of 0.5 is 38.6%, using standard Pascal VOC (Visual Object Classes) and COCO mAP definition.

mAP is calculated assuming fully annotated images, not using the modified mAP_OI metric in the Open Images V4 paper, which counts FPs only for images with human-annotated negative classes (i.e. annotated class absence).

Open Images V4 Paper

The Open Images Dataset V4 - Unified image classification, object detection, and visual relationship detection at scale

Authors: Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, Vittorio Ferrari

Abstract

We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an initial design bias. Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, we provide 15× more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images). The images often show complex scenes with several objects (8 annotated objects per image on average). We annotated visual relationships between them, which support visual relationship detection, an emerging task that requires structured reasoning. We provide in-depth comprehensive statistics about the dataset, we validate the quality of the annotations, we study how the performance of several modern models evolves with increasing amounts of training data, and we demonstrate two applications made possible by having unified annotations of multiple types coexisting in the same images. We hope that the scale, quality, and variety of Open Images V4 will foster further research and innovation even beyond the areas of image classification, object detection, and visual relationship detection.

Feature Structure

FeaturesDict({
    'bobjects': Sequence({
        'bbox': BBoxFeature(shape=(4,), dtype=tf.float32),
        'is_depiction': tf.int8,
        'is_group_of': tf.int8,
        'is_inside': tf.int8,
        'is_occluded': tf.int8,
        'is_truncated': tf.int8,
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=601),
        'source': ClassLabel(shape=(), dtype=tf.int64, num_classes=6),
    }),
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'image/filename': Text(shape=(), dtype=tf.string),
    'objects': Sequence({
        'confidence': tf.int32,
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=19995),
        'source': ClassLabel(shape=(), dtype=tf.int64, num_classes=6),
    }),
    'objects_trainable': Sequence({
        'confidence': tf.int32,
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=7186),
        'source': ClassLabel(shape=(), dtype=tf.int64, num_classes=6),
    }),
})

Feature Documentation

Feature	Class	Shape	Dtype
	FeaturesDict
bobjects	Sequence
bobjects/bbox	BBoxFeature	(4,)	tf.float32
bobjects/is_depiction	Tensor		tf.int8
bobjects/is_group_of	Tensor		tf.int8
bobjects/is_inside	Tensor		tf.int8
bobjects/is_occluded	Tensor		tf.int8
bobjects/is_truncated	Tensor		tf.int8
bobjects/label	ClassLabel		tf.int64
bobjects/source	ClassLabel		tf.int64
image	Image	(None, None, 3)	tf.uint8
image/filename	Text		tf.string
objects	Sequence
objects/confidence	Tensor		tf.int32
objects/label	ClassLabel		tf.int64
objects/source	ClassLabel		tf.int64
objects_trainable	Sequence
objects_trainable/confidence	Tensor		tf.int32
objects_trainable/label	ClassLabel		tf.int64
objects_trainable/source	ClassLabel		tf.int64

Data Splits

Name	Train	Validation	Test
Open Images V4	1,743,042	41,620	125,436

Citation Information

@article{OpenImages,
  author = {Alina Kuznetsova and
            Hassan Rom and
            Neil Alldrin and
            Jasper Uijlings and
            Ivan Krasin and
            Jordi Pont-Tuset and
            Shahab Kamali and
            Stefan Popov and
            Matteo Malloci and
            Tom Duerig and
            Vittorio Ferrari},
  title = {The Open Images Dataset V4: Unified image classification,
           object detection, and visual relationship detection at scale},
  year = {2018},
  journal = {arXiv:1811.00982}
}
@article{OpenImages2,
  author = {Krasin, Ivan and
            Duerig, Tom and
            Alldrin, Neil and
            Ferrari, Vittorio
            and Abu-El-Haija, Sami and
            Kuznetsova, Alina and
            Rom, Hassan and
            Uijlings, Jasper and
            Popov, Stefan and
            Kamali, Shahab and
            Malloci, Matteo and
            Pont-Tuset, Jordi and
            Veit, Andreas and
            Belongie, Serge and
            Gomes, Victor and
            Gupta, Abhinav and
            Sun, Chen and
            Chechik, Gal and
            Cai, David and
            Feng, Zheyun and
            Narayanan, Dhyanesh and
            Murphy, Kevin},
  title = {OpenImages: A public dataset for large-scale multi-label and
           multi-class image classification.},
  journal = {Dataset available from
             https://storage.googleapis.com/openimages/web/index.html},
  year={2017}
}

Author

Google Inc.

Taxonomy

All recognizable objects can be found below in two formats: a code box for easy copying, and a table that includes the Average Precision (AP) and the number of ground-truth bounding boxes (GT boxes) per object.

Accordion, Adhesive tape, Aircraft, Airplane, Alarm clock, Alpaca, Ambulance, Animal, Ant, Antelope, Apple, Armadillo, Artichoke, Asparagus, Auto part, Axe, Backpack, Bagel, Baked goods, Balance beam, Ball, Balloon, Banana, Band-aid, Banjo, Barge, Barrel, Baseball bat, Baseball glove, Bat, Bathroom accessory, Bathroom cabinet, Bathtub, Beaker, Bear, Bed, Bee, Beehive, Beer, Beetle, Bell pepper, Belt, Bench, Bicycle, Bicycle helmet, Bicycle wheel, Bidet, Billboard, Billiard table, Binoculars, Bird, Blender, Blue jay, Boat, Bomb, Book, Bookcase, Boot, Bottle, Bottle opener, Bow and arrow, Bowl, Bowling equipment, Box, Boy, Brassiere, Bread, Briefcase, Broccoli, Bronze sculpture, Brown bear, Building, Bull, Burrito, Bus, Bust, Butterfly, Cabbage, Cabinetry, Cake, Cake stand, Calculator, Camel, Camera, Can opener, Canary, Candle, Candy, Cannon, Canoe, Cantaloupe, Car, Carnivore, Carrot, Cart, Cassette deck, Castle, Cat, Cat furniture, Caterpillar, Cattle, Ceiling fan, Cello, Centipede, Chainsaw, Chair, Cheese, Cheetah, Chest of drawers, Chicken, Chime, Chisel, Chopsticks, Christmas tree, Clock, Closet, Clothing, Coat, Cocktail, Cocktail shaker, Coconut, Coffee, Coffee cup, Coffee table, Coffeemaker, Coin, Common fig, Computer keyboard, Computer monitor, Computer mouse, Container, Convenience store, Cookie, Cooking spray, Corded phone, Cosmetics, Couch, Countertop, Cowboy hat, Crab, Cream, Cricket ball, Crocodile, Croissant, Crown, Crutch, Cucumber, Cupboard, Curtain, Cutting board, Dagger, Dairy, Deer, Desk, Dessert, Diaper, Dice, Digital clock, Dinosaur, Dishwasher, Dog, Dog bed, Doll, Dolphin, Door, Door handle, Doughnut, Dragonfly, Drawer, Dress, Drill, Drink, Drinking straw, Drum, Duck, Dumbbell, Eagle, Earrings, Egg, Elephant, Envelope, Eraser, Face powder, Facial tissue holder, Falcon, Fashion accessory, Fast food, Fax, Fedora, Filing cabinet, Fire hydrant, Fireplace, Fish, Flag, Flashlight, Flower, Flowerpot, Flute, Flying disc, Food, Food processor, Football, Football helmet, Footwear, Fork, Fountain, Fox, French fries, Frog, Fruit, Frying pan, Furniture, Gas stove, Giraffe, Girl, Glasses, Glove, Goat, Goggles, Goldfish, Golf ball, Golf cart, Gondola, Goose, Grape, Grapefruit, Grinder, Guacamole, Guitar, Hair dryer, Hair spray, Hamburger, Hammer, Hamster, Hand dryer, Handbag, Handgun, Harbor seal, Harmonica, Harp, Harpsichord, Hat, Headphones, Heater, Hedgehog, Helicopter, Helmet, High heels, Hiking equipment, Hippopotamus, Home appliance, Honeycomb, Horizontal bar, Horn, Horse, Hot dog, House, Houseplant, Human arm, Human beard, Human body, Human ear, Human eye, Human face, Human foot, Human hair, Human hand, Human head, Human leg, Human mouth, Human nose, Humidifier, Ice cream, Indoor rower, Infant bed, Insect, Invertebrate, Ipod, Isopod, Jacket, Jacuzzi, Jaguar, Jeans, Jellyfish, Jet ski, Jug, Juice, Kangaroo, Kettle, Kitchen & dining room table, Kitchen appliance, Kitchen knife, Kitchen utensil, Kitchenware, Kite, Knife, Koala, Ladder, Ladle, Ladybug, Lamp, Land vehicle, Lantern, Laptop, Lavender, Lemon, Leopard, Lifejacket, Light bulb, Light switch, Lighthouse, Lily, Limousine, Lion, Lipstick, Lizard, Lobster, Loveseat, Luggage and bags, Lynx, Magpie, Mammal, Man, Mango, Maple, Maracas, Marine invertebrates, Marine mammal, Measuring cup, Mechanical fan, Medical equipment, Microphone, Microwave oven, Milk, Miniskirt, Mirror, Missile, Mixer, Mixing bowl, Mobile phone, Monkey, Moths and butterflies, Motorcycle, Mouse, Muffin, Mug, Mule, Mushroom, Musical instrument, Musical keyboard, Nail, Necklace, Nightstand, Oboe, Office building, Office supplies, Orange, Organ, Ostrich, Otter, Oven, Owl, Oyster, Paddle, Palm tree, Pancake, Panda, Paper cutter, Paper towel, Parachute, Parking meter, Parrot, Pasta, Pastry, Peach, Pear, Pen, Pencil case, Pencil sharpener, Penguin, Perfume, Person, Personal care, Piano, Picnic basket, Picture frame, Pig, Pillow, Pineapple, Pitcher, Pizza, Pizza cutter, Plant, Plastic bag, Plate, Platter, Plumbing fixture, Polar bear, Pomegranate, Popcorn, Porch, Porcupine, Poster, Potato, Power plugs and sockets, Pressure cooker, Pretzel, Printer, Pumpkin, Punching bag, Rabbit, Raccoon, Racket, Radish, Ratchet, Raven, Rays and skates, Red panda, Refrigerator, Remote control, Reptile, Rhinoceros, Rifle, Ring binder, Rocket, Roller skates, Rose, Rugby ball, Ruler, Salad, Salt and pepper shakers, Sandal, Sandwich, Saucer, Saxophone, Scale, Scarf, Scissors, Scoreboard, Scorpion, Screwdriver, Sculpture, Sea lion, Sea turtle, Seafood, Seahorse, Seat belt, Segway, Serving tray, Sewing machine, Shark, Sheep, Shelf, Shellfish, Shirt, Shorts, Shotgun, Shower, Shrimp, Sink, Skateboard, Ski, Skirt, Skull, Skunk, Skyscraper, Slow cooker, Snack, Snail, Snake, Snowboard, Snowman, Snowmobile, Snowplow, Soap dispenser, Sock, Sofa bed, Sombrero, Sparrow, Spatula, Spice rack, Spider, Spoon, Sports equipment, Sports uniform, Squash, Squid, Squirrel, Stairs, Stapler, Starfish, Stationary bicycle, Stethoscope, Stool, Stop sign, Strawberry, Street light, Stretcher, Studio couch, Submarine, Submarine sandwich, Suit, Suitcase, Sun hat, Sunflower, Sunglasses, Surfboard, Sushi, Swan, Swim cap, Swimming pool, Swimwear, Sword, Syringe, Table, Table tennis racket, Tablet computer, Tableware, Taco, Tank, Tap, Tart, Taxi, Tea, Teapot, Teddy bear, Telephone, Television, Tennis ball, Tennis racket, Tent, Tiara, Tick, Tie, Tiger, Tin can, Tire, Toaster, Toilet, Toilet paper, Tomato, Tool, Toothbrush, Torch, Tortoise, Towel, Tower, Toy, Traffic light, Traffic sign, Train, Training bench, Treadmill, Tree, Tree house, Tripod, Trombone, Trousers, Truck, Trumpet, Turkey, Turtle, Umbrella, Unicycle, Van, Vase, Vegetable, Vehicle, Vehicle registration plate, Violin, Volleyball, Waffle, Waffle iron, Wall clock, Wardrobe, Washing machine, Waste container, Watch, Watercraft, Watermelon, Weapon, Whale, Wheel, Wheelchair, Whisk, Whiteboard, Willow, Window, Window blind, Wine, Wine glass, Wine rack, Winter melon, Wok, Woman, Wood-burning stove, Woodpecker, Worm, Wrench, Zebra, Zucchini

ID
Model Type ID
Visual Detector
Input Type
image
Output Type
regions[...].data.concepts,regions[...].region_info.bounding_box
Description
Detects a variety of common objects and the location and generates regions of an image that may contain that object.
Last Updated
Oct 25, 2024
Privacy
PUBLIC
Toolkit
License
Share
Badge

Sign up or Log in to view model predictions.

Notes

General Image Detector

Introduction

RetinaNet

Paper

Author

Inception V4

Inception V4 Paper

GoogLeNet Paper

Author

Dataset

Metrics

Open Images V4 Paper

Feature Structure

Feature Documentation

Data Splits

Citation Information

Author

Taxonomy