- Community
- Model
- image-general-segmentation
Notes
General Visual Segmenter
Introduction
Image segmentation is the process of partitioning an image into various segments. This is an important process as it separates various objects in complex visual environments. Powerful segmentation models are also able to perform instance segmentation, where objects of the same type are detected and emphasized as belonging to the same category. Generally speaking, opaque colors are used to denote each detected segment, thus providing a visual representation of the detected segments or instances.
Segmentation models are useful to extract objects of interest for further processing and recognition. A common approach is to disregard information that does not belong to the segment or to the instance of interest, allowing subsequent models to better utilize available resources directly towards the specified task.
Our model is trained on the COCO-Stuff dataset, relying on a DeepLabv3 architecture with a ResNet-50 backbone.
DeepLabv3
The DeepLab family of models are semantic segmentation architectures. In particular, DeepLabv3 is capable of segmenting models at multiple scales using custom modules that perform atrous convolution (also known as dilated convolution) in cascade or in parallel, utilizing multiple atrous rates to detect contexts in various scales.
Beginning with DeepLabv2, Atrous Spatial Pyramid Pooling (ASPP) modules have been implemented to detect a global context and boost performance.
DeepLabv3 was conceived by redesigning and improving the ASPP module by applying global average pooling on the last feature map of the model. The resulting features are fed to a 1x1 convolution layer with 256 filters and batch normalization, which is then bilinearly upsampled to the target dimensionality.
Cascading modules without and with atrous convolution.
The resulting DeepLabv3 ASPP module consists of one 1x1 convolution layer, and three 3x3 convolution layers with subsequent 6, 12, and 18 rates when the output stride is 16, all with 256 filters and batch normalization.
DeepLabv3 Paper
Rethinking Atrous Convolution for Semantic Image Segmentation
Authors: Liang-Chieh Chen, George Papandreou, Florian Schroff, Hartwig Adam
Abstract
In this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter’s field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects at multiple scales, we design modules which employ atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates. Furthermore, we propose to augment our previously proposed Atrous Spatial Pyramid Pooling module, which probes convolutional features at multiple scales, with image-level features encoding global context and further boost performance. We also elaborate on implementation details and share our experience on training our system. The proposed ‘DeepLabv3’ system significantly improves over our previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark.
DeepLab Paper
Authors: Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. Yuille
Abstract
In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or ‘atrous convolution’, as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed “DeepLab” system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.
DeepLabv3 Author
ResNet
ResNet is a family of five ML CNN architectures with different depths:
- ResNet-18 is 18 layers deep: 16 convolution layers, 1 max pool layer, and 1 average pool layer
- ResNet-34 is 34 layers deep: 32 convolution layers, 1 max pool layer, and 1 average pool layer
- ResNet-50 is 50 layers deep: 48 convolution layers, 1 max pool layer, and 1 average pool layer
- ResNet-101 is 101 layers deep: 99 convolution layers, 1 max pool layer, and 1 average pool layer
- ResNet-152 is 152 layers deep: 150 convolution layers, 1 max pool layer, and 1 average pool layer
Although ResNet architectures were designed for image recognition, they also exhibit successful results in non-computer vision applications.
The Res in ResNet stands for Residual learning, which is a process generally applied to very deep neural networks. This approach was created to mitigate the problem of vanishing and exploding gradients in very deep networks, a problem that prevented adding a large number of layers in search of better model performance. This problem is a training error that manifests as accuracy begins to saturate and degrade when very deep neural networks begin to approach convergence.
After a series of non-successful approaches, the proposed solution was a residual learning network that introduced shortcut connections that perform identity mappings.
An immediate benefit of shortcut identity mappings is that no additional parameters are added to the model, allowing computation time and resources to remain constant.
This approach allowed deeper networks to outperform shallow networks. The figure below demonstrates how the plain 18-layer shallow network performed better than the plain 34-layer deep network (left plot). On the other hand, when introducing residual learning, the 34-layer network was able to perform better than the 18-layer network (right plot).
Architecture of ResNet Networks
Note that ResNet family architectures implementing 50 or more layers contain a slight variation over the residual learning model displayed above. Instead of shortcut connections skipping 2 layers, they skip 3 instead, where that third layer is a 1x1 convolution layer. As you may know, 1x1 layers can act like a channel-wise pooling, which is useful for reducing dimensionality, thus preventing the number of operations from exploding.
ResNet Paper
Deep Residual Learning for Image Recognition
Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Abstract
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [41] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
ResNet Author
COCO-Stuff
The COCO-Stuff dataset was derived from the original COCO dataset to make up for the lack of stuff annotations. COCO-Stuff was constructed by adding the missing annotations.
The COCO-Stuff dataset is useful for scene understanding tasks, such as semantic segmentation, object detection, and image captioning.
COCO-Stuff features 164k images across 183 labels including 80 thing classes, 91 stuff classes, and 1 unlabeled class, meaning pixels that do not belong to any other class.
More Info
COCO-Stuff Paper
COCO-Stuff: Thing and Stuff Classes in Context
Authors: Holger Caesar, Jasper Uijlings, Vittorio Ferrari
Abstract
Semantic classes can be either things (objects with a well-defined shape, e.g. car, person) or stuff (amorphous background regions, e.g. grass, sky). While lots of classification and detection works focus on thing classes, less attention has been given to stuff classes. Nonetheless, stuff classes are important as they allow to explain important aspects of an image, including (1) scene type; (2) which thing classes are likely to be present and their location (through contextual reasoning); (3) physical attributes, material types and geometric properties of the scene. To understand stuff and things in context we introduce COCOStuff, which augments all 164K images of the COCO 2017 dataset with pixel-wise annotations for 91 stuff classes. We introduce an efficient stuff annotation protocol based on superpixels, which leverages the original thing annotations. We quantify the speed versus quality trade-off of our protocol and explore the relation between annotation time and boundary complexity. Furthermore, we use COCO-Stuff to analyze: (a) the importance of stuff and thing classes in terms of their surface cover and how frequently they are mentioned in image captions; (b) the spatial relations between stuff and things, highlighting the rich contextual relations that make our dataset unique; (c) the performance of a modern semantic segmentation method on stuff and thing classes, and whether stuff is easier to segment than things.
COCO-Stuff Author
University of Edinburgh and Google AI Perception
General Visual Segmenter Author
- ID
- Namegeneral
- Model Type IDVisual Segmenter
- DescriptionAI model for identifying different objects in an image or video at pixel level accuracy.
- Last UpdatedDec 18, 2023
- PrivacyPUBLIC
- Toolkit
- License
- Share
- Badge