Welcome to text-moderation Overview page
Clarifai app is a place for you to organize all of the content including models, workflows, inputs and more.
For app owners, API keys and Collaborators have been moved under App Settings.
Text Moderation Template
Overview
Text Moderation Template discusses several Text Moderation use cases and comes with several ready-to-use Text Moderation workflows and models dealing with different use cases, leveraging different NLP Models and LLMs.
Text Moderation
Text moderation in machine learning involves leveraging algorithms and models to automatically monitor, detect, and act upon inappropriate or harmful text content in user-generated content. This can include anything from offensive language and hate speech to spam and misinformation. Machine learning models are trained on large datasets to recognize patterns, keywords, and sentiments that indicate whether a piece of text should be flagged for review or removed.
Use Cases
Text moderation has become increasingly crucial as online platforms grow in size and number, generating vast amounts of user-generated content. The automated moderation of this content helps maintain community standards, protect users, and comply with legal regulations. Here are some key use cases of text moderation in machine learning:
- Social Media Platforms: Social media platforms employ text moderation to filter out offensive, abusive, or spammy content from user posts, comments, and messages. This ensures a safer and more positive user experience.
- E-commerce Platforms: E-commerce websites often have user reviews and comments sections where text moderation is used to filter out fake reviews, spam, or inappropriate content, ensuring that product feedback is reliable and helpful to other shoppers.
- Chat Applications: Text moderation is used in chat applications to filter out spam messages, inappropriate content, and malicious links shared between users, maintaining the integrity and safety of communication channels.
- Educational Platforms: Educational websites and online learning platforms employ text moderation to ensure that discussions among students and instructors remain respectful and conducive to learning.
Text Moderation using Text-classifier
One approach to perform text moderation is by using a text-classifier model. Text-classifier models are trained on labeled datasets to classify text into different categories, such as "spam," "hate speech," "harassment," or “safe”. The model learns patterns and features from the training data to classify new unseen text, helping to automatically filter out content that violates guidelines.
Text Moderation Models
There are multiple text moderation text-classifier models available on Clarifai for different use cases. Let's look at each of the use cases:
Multilingual Text Moderation Classifier
Multilingual text moderation model can analyze text and detect harmful content. It is especially useful for detecting third-party content that may cause problems.
This model returns a list of concepts along with their corresponding probability scores indicating the likelihood that these concepts are present in the text. The list of concepts includes:
- toxic
- insult
- obscene
- identity_hate
- severe_toxic
Multilingual Text Moderation Workflow
Multilingual moderation Classifier Workflow: This workflow is wrapped around the Multilingual text moderation model that classifies text as toxic, insult, obscene, identity_hate and severe_toxic.
English Text Moderation Classifier
English text moderation classifier can analyze english text and detect harmful content. It is especially useful for detecting third-party content that may cause problems.
This model returns a list of concepts along with their corresponding probability scores indicating the likelihood that these concepts are present in the text. The list of concepts includes: toxic, severe_toxic, obscene, threat, insult, identity_hate.
English Text Moderation Workflow
English text moderation Classifier Workflow: This workflow is wrapped around the English text moderation model that classifies english text as toxic, insult, obscene, identity_hate and severe_toxic.
Text Moderation using LLMs
Large Language Models (LLMs), like GPT (Generative Pretrained Transformer) variants, have revolutionised the field of natural language processing, including the domain of text moderation.
Using LLMs for text moderation leverages their deep understanding of language nuances, context, and even the subtleties of human communication, to identify and filter out inappropriate or harmful content from digital platforms. Here's an overview of how LLMs can be applied to text moderation.
How LLMs Are Used in Text Moderation
- Content Analysis: LLMs can understand and generate human-like text, making them adept at analyzing user-generated content for potentially harmful or inappropriate material. They can evaluate the context, sentiment, and subtle nuances of language to identify content that might violate specific guidelines.
- Contextual Understanding: Unlike simpler models, LLMs are better at grasping the context in which words or phrases are used. This capability is crucial for distinguishing between harmful content and benign content that might contain similar words.
LLM Text Moderation Workflows
There are multiple workflows created using different LLMs optimised for text moderation.
- text-moderation-mistral-7b: Workflow uses Mistral-7b Model with specified prompt template for text moderation that identify and filter out any hate speech, violent language and explicit content and Respond with 'Inappropriate' if such content is present and 'Appropriate' otherwise.
- text-moderation-toxicity-mistral-7b: Workflow uses Mistral-7b Model with specified prompt template for Toxicity moderation that identify and filter out 'Toxic' sentiment includes aggression, hostility, or undue negativity and classifies the sentiment as 'Toxic', 'Suspicious', or 'Safe' based on the tone and content.
- text-moderation-hate-speech-gemma-7b: Workflow uses Gemma-7b-it Model with specified prompt template for Hate speech moderation that identify and filter out based on race, ethnicity, gender, sexual orientation, religion, or disability.
- text-moderation-misinformation-dbrx: Workflow uses DBRX Model with specified prompt template for Misinformation moderation that identify and filter out misinformation or unsubstantiated claims, especially related to health, science, or news events and respond with 'Potential Misinformation' if the content seems questionable or 'Likely Reliable' if the information appears to be credible.
LLM Moderation
It's also important the safety of LLMs input prompts and the responses output by LLM. Clarifai has Llama Guard Model that helps to achieve this goal.
Llama Gaurd
Llama Guard is a model for classifying the safety of LLM prompts and responses, using a taxonomy of safety risks.
Llama Guard is an LLM-based input-output safeguard model designed for Human-AI conversations. It incorporates a safety risk taxonomy and is geared towards classifying safety risks in prompts and responses for conversational AI agent use cases.
LlamaGuard is built on top of Llama2-7b, and utilizes a safety risk taxonomy, distinguishing between safety risks in user prompts (prompt classification) and generated responses (response classification).
Llama Guard Workflows
- llamaGuard-prompt-moderation: Workflow uses Llama Guard Model with specified prompt template for safety of LLMs input prompts.
Use Cases
Llama Guard as a language model, performing multi-class classification and generating binary decision scores can be used for various use cases involving Human-AI conversations, such as classifying safety risks in prompts and responses. For example:
- Prompt Classification: Identifying safety risks in user prompts, such as hate speech, violence, or discrimination.
- Response Classification: Evaluating AI-generated responses for potential safety risks, ensuring appropriate and safe interactions.
.
- DescriptionText Moderation Template provides ready-to-use workflows and models leveraging NLP Models and LLMs to automatically monitor and detect inappropriate or harmful text content.
- Base Workflow
- Last UpdatedApr 10, 2024
- Default Languageen
- Share