Community
Model
moderation-recognition

moderation-recognition

Recognizes inappropriate content in images and video containing concepts: gore, drug, explicit, suggestive, and safe.

Notes

Image Moderation Classifier

Introduction

The Image Moderation Classifier model has been designed to moderate nudity, sexually-explicit, or otherwise harmful or abusive user-generated content (UGC) imagery. This is helpful to determine whether any given input image meets community and brand standards.

Our model utilizes and Inception V2 architecture, and we trained it using a proprietary dataset gathered by our data strategy team.

The model outputs a probability distribution among five different labels:

Drug
Explicit
Gore
Safe
Suggestive

Limitations

Due to the nature of our dataset, which heavily relies on real life imagery, the image moderation classifier model demonstrates weaker performance on hentai/cartoon harmful content.

Inception V2

Inception V2 is a CNN architecture building on the original Inception architecture (GoogLeNet). The purpose of Inception V2 was finding ways to scale up GoogLeNet while maintaining the added computation as efficient as possible, through by suitably factorized convolutions and aggressive regularization.

Architecture

The original Inception architecture was heavily engineered to push performance in terms of speed and accuracy while reducing computational cost. Prior CNN solutions would improve performance by increasing the depth of the network, which would compromise computational cost. Very deep networks are also susceptible to overfitting and present difficulties in passing gradient updates when performing backpropagation.

Inception networks heavily rely on 1x1 convolution layers, which are used as a dimensionality reduction module. This reduction allows for fewer operations, and thus, less computation. By using 1x1 convolution to reduce the model size, the overfitting problem is also reduced.

Although it might seem counter intuitive to add a layer to reduce the number of operations, performing a 3x3 or a 5x5 convolution on smaller dimension inputs will drastically reduce the number of operations, greatly improving performance.

The design of the inception module effectively causes the network to get wider rather than deeper. The "naive" inception module performs convolution on an input with 3 different sizes of filters (1x1, 3x3, 5x5) as well as 3x3 max pooling.

Naive Inception Module

Naive Inception Module

This is where the dimensionality reduction comes in via the 1x1 convolution layer, reducing size prior to the 3x3 and 5x5 convolution layers, and after the 3x3 max pooling.

Inception Module with Dimension Reductions

Inception module with dimension reductions

Inception modules can be connected to each other (linear stacking) by concatenating the outputs of the filters (both convolution and max pool), and feeding them as the input of the next inception layer.

By relying on these inception modules, the original inception network, GoogLeNet, was built. GoogLeNet is comprised on 9 inception modules stacked linearly. It is 22 layers deep (27, including the pooling layers), and it uses global average pooling at the end of the last inception module.

A 22 layer network is a very deep classifier, and just like any other very deep network, it faces the vanishing gradient problem, which preventing middle and early layers from learning during backpropagation.

This problem was overcome by introducing two auxiliary classifiers (see purple boxes in the figure below). These essentially apply softmax to the outputs of two inception modules, and compute the auxiliary loss over the same labels. They compute a total loss function as a weighted sum of the auxiliary loss and the real loss. The original paper is using a 0.3 weight value for each auxiliary loss.

total_loss = real_loss + 0.3 * aux_loss_1 + 0.3 * aux_loss_2

GoogLeNet Network

GoogLeNet Network

The Inception V2 improves upon its predecessor by replacing larger convolution kernels for sequential smaller kernels. For instance, a 5x5 convolution filter is replaced by two sequential 3x3 convolution filter, and a 7x7 filter is replaced by three sequential 3x3 filters. This results in 42 layer network, that requires roughly 2.5 more computational power than GoogLeNet (considerably more efficient than VGG architectures), but achieves considerably better results.

Inception V2 module replacing 5x5 convolution for two sequential 3x3 convolution filters

Inception V2 Module

Inception V2 Paper

Rethinking the Inception Architecture for Computer Vision

Abstract

Convolutional networks are at the core of most stateof-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error.

GoogLeNet Paper

Going Deeper with Convolutions

Authors: Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich

Abstract

We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

Architecture Author

Google Inc.

Image Moderation Classifier Author

Clarifai - GitHub

ID
Name
Image Moderation
Model Type ID
Visual Classifier
Description
Recognizes inappropriate content in images and video containing concepts: gore, drug, explicit, suggestive, and safe.
Last Updated
Oct 25, 2024
Privacy
PUBLIC
Use Case
Toolkit
License
Share
Badge