general-image-embedding

AI visual recognition model for returning 1024-dimensional numerical vectors that represent the items in images and video.

Notes

General Image Embedding

This model produces embedding representations of input images.

An embedding is a numerical vectors that represent the input images in a 1024-dimensional space, which is computed by using Clarifai’s General Model. Vectors of similar images (visually speaking) will be close to each other in the 1024-dimensional space.

General embedding use cases include filtering, indexing, ranking, and organizing images according to visual similarity. They are also well suited to transfer learning tasks, given that categorical embeddings learned from other tasks may be used for improving model performance. Categorical embeddings tend to perform well due to the sense of similarity/dissimilarity among each other, allowing models to generalize better.

Architecture

The Clarifai general model is a multifunction computer vision model capable of classifying a variety of concepts and common objects. It's intended use is the identification and classification of image contents using tags, filtering, and cascade routines (i.e., sequential or concurrent analysis of image content using weak classifiers to avoid the usage of extensive computation resources and time).

We based our general classifier model on a ResNeXt architecture. Given that ResNeXt is based on the residual ResNet approach, we have provided an overview of ResNet architectures below.

Limitations

Due to the nature of the dataset, our general model performs best when the content that is to be recognized is prevalent in the image.

ResNet

ResNet is a family of five ML CNN architectures with different depths:

ResNet-18 is 18 layers deep: 16 convolution layers, 1 max pool layer, and 1 average pool layer
ResNet-34 is 34 layers deep: 32 convolution layers, 1 max pool layer, and 1 average pool layer
ResNet-50 is 50 layers deep: 48 convolution layers, 1 max pool layer, and 1 average pool layer
ResNet-101 is 101 layers deep: 99 convolution layers, 1 max pool layer, and 1 average pool layer
ResNet-152 is 152 layers deep: 150 convolution layers, 1 max pool layer, and 1 average pool layer

Although ResNet architectures were designed for image recognition, they also exhibit successful results in non-computer vision applications.

The Res in ResNet stands for Residual learning, which is a process generally applied to very deep neural networks. This approach was created to mitigate the problem of vanishing and exploding gradients in very deep networks, a problem that prevented adding a large number of layers in search of better model performance. This problem is a training error that manifests as accuracy begins to saturate and degrade when very deep neural networks begin to approach convergence.

After a series of non-successful approaches, the proposed solution was a residual learning network that introduced shortcut connections that perform identity mappings.

Residual learning framework

An immediate benefit of shortcut identity mappings is that no additional parameters are added to the model, allowing computation time and resources to remain constant.

This approach allowed deeper networks to outperform shallow networks. The figure below demonstrates how the plain 18-layer shallow network performed better than the plain 34-layer deep network (left plot). On the other hand, when introducing residual learning, the 34-layer network was able to perform better than the 18-layer network (right plot).

Plain vs. ResNet plots

Architecture of ResNet Networks

Note that ResNet family architectures implementing 50 or more layers contain a slight variation over the residual learning model displayed above. Instead of shortcut connections skipping 2 layers, they skip 3 instead, where that third layer is a 1x1 convolution layer. As you may know, 1x1 layers can act like a channel-wise pooling, which is useful for reducing dimensionality, thus preventing the number of operations from exploding.

ResNet Paper

Deep Residual Learning for Image Recognition

Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

Abstract

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [41] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.

The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

ResNet Author

Microsoft Research

ResNeXt

ResNeXt is also based on the residual approach introduced by ResNet architectures. However, the building block that ResNeXt is based on introduces a new cardinality C dimension, which represents the size (in terms of width as opposed to depth) of the set of transformations C. This usage of parallel stacked layers resembles inception modules, where an the parallel layers output of each parallel layer is aggregated before feeding it to the next inception module in the chain.

Original Repository: GitHub

ResNeXt Residual Block

ResNeXt residual block

The ResNeXt architecture demonstrated better results than its ResNet counter part. The following plot compares Top-1 Error performance on various ResNeXt architectures as well as their ResNet counterparts.

Single-crop (224x224) validation error rate

Network	GFLOPS	Top-1 Error
ResNet-50 (1x64d)	~4.1	23.9
ResNeXt-50 (32x4d)	~4.1	22.2
ResNet-101 (1x64d)	~7.8	22.0
ResNeXt-101 (32x4d)	~7.8	21.2
ResNeXt-101 (64x4d)	~15.6	20.4

Note that a gigaflop (GFLOP) is a measurement unit for calculating the speed of a computer equal to one billion floating-point operations per second.

ResNeXt Paper

Aggregated Residual Transformations for Deep Neural Networks

Authors: Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, Kaiming He

Abstract

We present a simple, highly modularized network architecture for image classification. Our network is constructed by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which we call “cardinality” (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width. On the ImageNet-1K dataset, we empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy. Moreover, increasing cardinality is more effective than going deeper or wider when we increase the capacity. Our models, named ResNeXt, are the foundations of our entry to the ILSVRC 2016 classification task in which we secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart. The code and models are publicly available online.

ResNeXt Author

Facebook AI Research

Dataset

Our model was trained on the stockcrawl_v8_ontology_labeler_freebase_corewn_imagenet11_filtered_v7_gfactor_25_method2_kmeans2 dataset, which includes 11043 classes and about 10,200,000 images. It is however worth noting that the total number of classes is 9758. This is due to not sampling the remaining class to scarce representation in the dataset.