- Community
- Model
- general-image-detection
Notes
General Image Detector
Introduction
The general image detection model is able to detect and label a variety of common objects from images. In order to do this, it creates bounding box proposals. Note that these box proposals can be fed to downstream classifiers, expanding the labeling functionality of the model in situations where the dataset used for training (Open Images V4) is not yielding satisfactory classification results.
This image detector uses a RetinaNet with an Inception-V4 backbone architecture.
Limitations
The model may encounter difficulties detecting small objects. This is because contrary to the MS COCO dataset, Open Images is comprised of images containing more relatively large objects. It however has the advantage of having many more classes, resulting in more detectable objects.
Note that detectors can only interpret objects that belonged to the dataset used for training, meaning that they will be unable to detect out-of-sample classes (see Taxonomy section below for the classes used for training). Note that there are a few classes that are not very well represented in the Open Images V4 dataset.
RetinaNet
From Papers with Code:
RetinaNet is a one-stage object detection model that utilizes a focal loss function to address class imbalance during training. Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard negative examples. RetinaNet is a single, unified network composed of a backbone network and two task-specific subnetworks. The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-self convolutional network. The first subnet performs convolutional object classification on the backbone's output; the second subnet performs convolutional bounding box regression. The two subnetworks feature a simple design that the authors propose specifically for one-stage, dense detection.
Cross Entropy vs. Focal Loss
RetinaNet Architecture
More Info
Paper
Focal Loss for Dense Object Detection
Authors: Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár
Abstract
The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron.
Author
Inception V4
Inception V4 is a CNN architecture building on previous iterations of Inception architectures. It differs from its predecesors in that it simplifies the architecture and utilizes more inception modules.
Architecture
The original Inception architecture was heavily engineered to push performance in terms of speed and accuracy while reducing computational cost. Prior CNN solutions would improve performance by increasing the depth of the network, which would compromise computational cost. Very deep networks are also susceptible to overfitting and present difficulties in passing gradient updates when performing backpropagation.
Inception networks heavily rely on 1x1 convolution layers, which are used as a dimensionality reduction module. This reduction allows for fewer operations, and thus, less computation. By using 1x1 convolution to reduce the model size, the overfitting problem is also reduced.
Although it might seem counter intuitive to add a layer to reduce the number of operations, performing a 3x3 or a 5x5 convolution on smaller dimension inputs will drastically reduce the number of operations, greatly improving performance.
The design of the inception module effectively causes the network to get wider rather than deeper. The "naive" inception module performs convolution on an input with 3 different sizes of filters (1x1, 3x3, 5x5) as well as 3x3 max pooling.
Naive Inception Module
This is where the dimensionality reduction comes in via the 1x1 convolution layer, reducing size prior to the 3x3 and 5x5 convolution layers, and after the 3x3 max pooling.
Inception Module with Dimension Reductions
Inception modules can be connected to each other (linear stacking) by concatenating the outputs of the filters (both convolution and max pool), and feeding them as the input of the next inception layer.
By relying on these inception modules, the original inception network, GoogLeNet, was built. GoogLeNet is comprised on 9 inception modules stacked linearly. It is 22 layers deep (27, including the pooling layers), and it uses global average pooling at the end of the last inception module.
A 22 layer network is a very deep classifier, and just like any other very deep network, it faces the vanishing gradient problem, which preventing middle and early layers from learning during backpropagation.
This problem was overcome by introducing two auxiliary classifiers (see purple boxes in the figure below). These essentially apply softmax to the outputs of two inception modules, and compute the auxiliary loss over the same labels. They compute a total loss function as a weighted sum of the auxiliary loss and the real loss. The original paper is using a 0.3 weight value for each auxiliary loss.
total_loss = real_loss + 0.3 * aux_loss_1 + 0.3 * aux_loss_2
GoogLeNet Network
Inception V4 Paper
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
Authors: Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke
Abstract
Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question of whether there are any benefit in combining the Inception architecture with residual connections. Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin. We also present several new streamlined architectures for both residual and non-residual Inception networks. These variations improve the single-frame recognition performance on the ILSVRC 2012 classification task significantly. We further demonstrate how proper activation scaling stabilizes the training of very wide residual Inception networks. With an ensemble of three residual and one Inception-v4, we achieve 3.08% top-5 error on the test set of the ImageNet classification (CLS) challenge.
GoogLeNet Paper
Going Deeper with Convolutions
Authors: Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich
Abstract
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
Author
Dataset
The model was trained primarily on Open Images V4 dataset.
Note that there are a few classes that are not very well represented in the Open Images validation set. When compared to MS COCO dataset, Open Images has more relatively large objects, but it has the advantage of having many more classes, resulting in more detectable objects.
More Info
Dataset Homepage: Google APIs
Source Code: tfds.object_detection.OpenImagesV4
Versions: 2.0.0 (default)
Download Size: 565.11 GiB
Metrics
Validation mean Average Precision (mAP) at an Intersection over Union (IoU) of 0.5 is 38.6%, using standard Pascal VOC (Visual Object Classes) and COCO mAP definition.
mAP is calculated assuming fully annotated images, not using the modified mAP_OI metric in the Open Images V4 paper, which counts FPs only for images with human-annotated negative classes (i.e. annotated class absence).
Open Images V4 Paper
Authors: Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, Vittorio Ferrari
Abstract
We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an initial design bias. Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, we provide 15× more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images). The images often show complex scenes with several objects (8 annotated objects per image on average). We annotated visual relationships between them, which support visual relationship detection, an emerging task that requires structured reasoning. We provide in-depth comprehensive statistics about the dataset, we validate the quality of the annotations, we study how the performance of several modern models evolves with increasing amounts of training data, and we demonstrate two applications made possible by having unified annotations of multiple types coexisting in the same images. We hope that the scale, quality, and variety of Open Images V4 will foster further research and innovation even beyond the areas of image classification, object detection, and visual relationship detection.
Feature Structure
FeaturesDict({
'bobjects': Sequence({
'bbox': BBoxFeature(shape=(4,), dtype=tf.float32),
'is_depiction': tf.int8,
'is_group_of': tf.int8,
'is_inside': tf.int8,
'is_occluded': tf.int8,
'is_truncated': tf.int8,
'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=601),
'source': ClassLabel(shape=(), dtype=tf.int64, num_classes=6),
}),
'image': Image(shape=(None, None, 3), dtype=tf.uint8),
'image/filename': Text(shape=(), dtype=tf.string),
'objects': Sequence({
'confidence': tf.int32,
'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=19995),
'source': ClassLabel(shape=(), dtype=tf.int64, num_classes=6),
}),
'objects_trainable': Sequence({
'confidence': tf.int32,
'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=7186),
'source': ClassLabel(shape=(), dtype=tf.int64, num_classes=6),
}),
})
Feature Documentation
Feature | Class | Shape | Dtype |
---|---|---|---|
FeaturesDict | |||
bobjects | Sequence | ||
bobjects/bbox | BBoxFeature | (4,) | tf.float32 |
bobjects/is_depiction | Tensor | tf.int8 | |
bobjects/is_group_of | Tensor | tf.int8 | |
bobjects/is_inside | Tensor | tf.int8 | |
bobjects/is_occluded | Tensor | tf.int8 | |
bobjects/is_truncated | Tensor | tf.int8 | |
bobjects/label | ClassLabel | tf.int64 | |
bobjects/source | ClassLabel | tf.int64 | |
image | Image | (None, None, 3) | tf.uint8 |
image/filename | Text | tf.string | |
objects | Sequence | ||
objects/confidence | Tensor | tf.int32 | |
objects/label | ClassLabel | tf.int64 | |
objects/source | ClassLabel | tf.int64 | |
objects_trainable | Sequence | ||
objects_trainable/confidence | Tensor | tf.int32 | |
objects_trainable/label | ClassLabel | tf.int64 | |
objects_trainable/source | ClassLabel | tf.int64 |
Data Splits
Name | Train | Validation | Test |
---|---|---|---|
Open Images V4 | 1,743,042 | 41,620 | 125,436 |
Citation Information
@article{OpenImages,
author = {Alina Kuznetsova and
Hassan Rom and
Neil Alldrin and
Jasper Uijlings and
Ivan Krasin and
Jordi Pont-Tuset and
Shahab Kamali and
Stefan Popov and
Matteo Malloci and
Tom Duerig and
Vittorio Ferrari},
title = {The Open Images Dataset V4: Unified image classification,
object detection, and visual relationship detection at scale},
year = {2018},
journal = {arXiv:1811.00982}
}
@article{OpenImages2,
author = {Krasin, Ivan and
Duerig, Tom and
Alldrin, Neil and
Ferrari, Vittorio
and Abu-El-Haija, Sami and
Kuznetsova, Alina and
Rom, Hassan and
Uijlings, Jasper and
Popov, Stefan and
Kamali, Shahab and
Malloci, Matteo and
Pont-Tuset, Jordi and
Veit, Andreas and
Belongie, Serge and
Gomes, Victor and
Gupta, Abhinav and
Sun, Chen and
Chechik, Gal and
Cai, David and
Feng, Zheyun and
Narayanan, Dhyanesh and
Murphy, Kevin},
title = {OpenImages: A public dataset for large-scale multi-label and
multi-class image classification.},
journal = {Dataset available from
https://storage.googleapis.com/openimages/web/index.html},
year={2017}
}
Author
Taxonomy
All recognizable objects can be found below in two formats: a code box for easy copying, and a table that includes the Average Precision (AP) and the number of ground-truth bounding boxes (GT boxes) per object.
Accordion, Adhesive tape, Aircraft, Airplane, Alarm clock, Alpaca, Ambulance, Animal, Ant, Antelope, Apple, Armadillo, Artichoke, Asparagus, Auto part, Axe, Backpack, Bagel, Baked goods, Balance beam, Ball, Balloon, Banana, Band-aid, Banjo, Barge, Barrel, Baseball bat, Baseball glove, Bat, Bathroom accessory, Bathroom cabinet, Bathtub, Beaker, Bear, Bed, Bee, Beehive, Beer, Beetle, Bell pepper, Belt, Bench, Bicycle, Bicycle helmet, Bicycle wheel, Bidet, Billboard, Billiard table, Binoculars, Bird, Blender, Blue jay, Boat, Bomb, Book, Bookcase, Boot, Bottle, Bottle opener, Bow and arrow, Bowl, Bowling equipment, Box, Boy, Brassiere, Bread, Briefcase, Broccoli, Bronze sculpture, Brown bear, Building, Bull, Burrito, Bus, Bust, Butterfly, Cabbage, Cabinetry, Cake, Cake stand, Calculator, Camel, Camera, Can opener, Canary, Candle, Candy, Cannon, Canoe, Cantaloupe, Car, Carnivore, Carrot, Cart, Cassette deck, Castle, Cat, Cat furniture, Caterpillar, Cattle, Ceiling fan, Cello, Centipede, Chainsaw, Chair, Cheese, Cheetah, Chest of drawers, Chicken, Chime, Chisel, Chopsticks, Christmas tree, Clock, Closet, Clothing, Coat, Cocktail, Cocktail shaker, Coconut, Coffee, Coffee cup, Coffee table, Coffeemaker, Coin, Common fig, Computer keyboard, Computer monitor, Computer mouse, Container, Convenience store, Cookie, Cooking spray, Corded phone, Cosmetics, Couch, Countertop, Cowboy hat, Crab, Cream, Cricket ball, Crocodile, Croissant, Crown, Crutch, Cucumber, Cupboard, Curtain, Cutting board, Dagger, Dairy, Deer, Desk, Dessert, Diaper, Dice, Digital clock, Dinosaur, Dishwasher, Dog, Dog bed, Doll, Dolphin, Door, Door handle, Doughnut, Dragonfly, Drawer, Dress, Drill, Drink, Drinking straw, Drum, Duck, Dumbbell, Eagle, Earrings, Egg, Elephant, Envelope, Eraser, Face powder, Facial tissue holder, Falcon, Fashion accessory, Fast food, Fax, Fedora, Filing cabinet, Fire hydrant, Fireplace, Fish, Flag, Flashlight, Flower, Flowerpot, Flute, Flying disc, Food, Food processor, Football, Football helmet, Footwear, Fork, Fountain, Fox, French fries, Frog, Fruit, Frying pan, Furniture, Gas stove, Giraffe, Girl, Glasses, Glove, Goat, Goggles, Goldfish, Golf ball, Golf cart, Gondola, Goose, Grape, Grapefruit, Grinder, Guacamole, Guitar, Hair dryer, Hair spray, Hamburger, Hammer, Hamster, Hand dryer, Handbag, Handgun, Harbor seal, Harmonica, Harp, Harpsichord, Hat, Headphones, Heater, Hedgehog, Helicopter, Helmet, High heels, Hiking equipment, Hippopotamus, Home appliance, Honeycomb, Horizontal bar, Horn, Horse, Hot dog, House, Houseplant, Human arm, Human beard, Human body, Human ear, Human eye, Human face, Human foot, Human hair, Human hand, Human head, Human leg, Human mouth, Human nose, Humidifier, Ice cream, Indoor rower, Infant bed, Insect, Invertebrate, Ipod, Isopod, Jacket, Jacuzzi, Jaguar, Jeans, Jellyfish, Jet ski, Jug, Juice, Kangaroo, Kettle, Kitchen & dining room table, Kitchen appliance, Kitchen knife, Kitchen utensil, Kitchenware, Kite, Knife, Koala, Ladder, Ladle, Ladybug, Lamp, Land vehicle, Lantern, Laptop, Lavender, Lemon, Leopard, Lifejacket, Light bulb, Light switch, Lighthouse, Lily, Limousine, Lion, Lipstick, Lizard, Lobster, Loveseat, Luggage and bags, Lynx, Magpie, Mammal, Man, Mango, Maple, Maracas, Marine invertebrates, Marine mammal, Measuring cup, Mechanical fan, Medical equipment, Microphone, Microwave oven, Milk, Miniskirt, Mirror, Missile, Mixer, Mixing bowl, Mobile phone, Monkey, Moths and butterflies, Motorcycle, Mouse, Muffin, Mug, Mule, Mushroom, Musical instrument, Musical keyboard, Nail, Necklace, Nightstand, Oboe, Office building, Office supplies, Orange, Organ, Ostrich, Otter, Oven, Owl, Oyster, Paddle, Palm tree, Pancake, Panda, Paper cutter, Paper towel, Parachute, Parking meter, Parrot, Pasta, Pastry, Peach, Pear, Pen, Pencil case, Pencil sharpener, Penguin, Perfume, Person, Personal care, Piano, Picnic basket, Picture frame, Pig, Pillow, Pineapple, Pitcher, Pizza, Pizza cutter, Plant, Plastic bag, Plate, Platter, Plumbing fixture, Polar bear, Pomegranate, Popcorn, Porch, Porcupine, Poster, Potato, Power plugs and sockets, Pressure cooker, Pretzel, Printer, Pumpkin, Punching bag, Rabbit, Raccoon, Racket, Radish, Ratchet, Raven, Rays and skates, Red panda, Refrigerator, Remote control, Reptile, Rhinoceros, Rifle, Ring binder, Rocket, Roller skates, Rose, Rugby ball, Ruler, Salad, Salt and pepper shakers, Sandal, Sandwich, Saucer, Saxophone, Scale, Scarf, Scissors, Scoreboard, Scorpion, Screwdriver, Sculpture, Sea lion, Sea turtle, Seafood, Seahorse, Seat belt, Segway, Serving tray, Sewing machine, Shark, Sheep, Shelf, Shellfish, Shirt, Shorts, Shotgun, Shower, Shrimp, Sink, Skateboard, Ski, Skirt, Skull, Skunk, Skyscraper, Slow cooker, Snack, Snail, Snake, Snowboard, Snowman, Snowmobile, Snowplow, Soap dispenser, Sock, Sofa bed, Sombrero, Sparrow, Spatula, Spice rack, Spider, Spoon, Sports equipment, Sports uniform, Squash, Squid, Squirrel, Stairs, Stapler, Starfish, Stationary bicycle, Stethoscope, Stool, Stop sign, Strawberry, Street light, Stretcher, Studio couch, Submarine, Submarine sandwich, Suit, Suitcase, Sun hat, Sunflower, Sunglasses, Surfboard, Sushi, Swan, Swim cap, Swimming pool, Swimwear, Sword, Syringe, Table, Table tennis racket, Tablet computer, Tableware, Taco, Tank, Tap, Tart, Taxi, Tea, Teapot, Teddy bear, Telephone, Television, Tennis ball, Tennis racket, Tent, Tiara, Tick, Tie, Tiger, Tin can, Tire, Toaster, Toilet, Toilet paper, Tomato, Tool, Toothbrush, Torch, Tortoise, Towel, Tower, Toy, Traffic light, Traffic sign, Train, Training bench, Treadmill, Tree, Tree house, Tripod, Trombone, Trousers, Truck, Trumpet, Turkey, Turtle, Umbrella, Unicycle, Van, Vase, Vegetable, Vehicle, Vehicle registration plate, Violin, Volleyball, Waffle, Waffle iron, Wall clock, Wardrobe, Washing machine, Waste container, Watch, Watercraft, Watermelon, Weapon, Whale, Wheel, Wheelchair, Whisk, Whiteboard, Willow, Window, Window blind, Wine, Wine glass, Wine rack, Winter melon, Wok, Woman, Wood-burning stove, Woodpecker, Worm, Wrench, Zebra, Zucchini
- ID
- NameImage Detection
- Model Type IDVisual Detector
- DescriptionDetects a variety of common objects and the location and generates regions of an image that may contain that object.
- Last UpdatedDec 18, 2023
- PrivacyPUBLIC
- Toolkit
- License
- Share
- Badge