Workflow uses Llama Guard Model with specified prompt template for safety of LLMs input prompts.

Overview
1

Notes

Overview

Workflow uses Llama Guard Model with specified prompt template for safety of LLMs input prompts. 

Llama Gaurd

Llama Guard is a model for classifying the safety of LLM prompts and responses, using a taxonomy of safety risks.

Llama Guard is an LLM-based input-output safeguard model designed for Human-AI conversations. It incorporates a safety risk taxonomy and is geared towards classifying safety risks in prompts and responses for conversational AI agent use cases.

LlamaGuard is built on top of Llama2-7b, and utilizes a safety risk taxonomy, distinguishing between safety risks in user prompts (prompt classification) and generated responses (response classification). 

The model, Llama Guard, is a Llama2-7b model fine-tuned on a carefully curated dataset and is capable of multi-class classification and generating binary decision scores. 

How to use the Llama Guard Prompt moderation workflow? 

Using Clarifai SDK

Export your PAT as an environment variable. Then, import and initialize the API Client.

Find your PAT in your security settings.

export CLARIFAI_PAT={your personal access token}

Prediction with the workflow

from clarifai.client.workflow import Workflow

workflow_url = 'https://clarifai.com/{{user_id}}/text-moderation/workflows/llamaGuard-prompt-moderation'

text = 'I love this movie and i would watch it again and again!'

prediction = Workflow(workflow_url).predict_by_bytes(text.encode(), input_type="text")

# Get workflow results
print(prediction.results[0].outputs[-1].data)

Using Workflow

To utilize the Llama Guard Prompt moderation workflow, you can input text through the Blue Plus Try your own Input button and it will classify if input prompt is safe or unsafe and If it's not safe, it predicts what category of unsafety it falls into.

Dataset Information

Llama Guard is fine-tuned on a dataset of high quality, labeled according to the safety risk taxonomy. The dataset includes examples of prompts and responses, allowing Llama Guard to learn and classify safety risks effectively.

The dataset, comprising 13,997 prompts and responses, covers categories such as:

  • Violence & Hate encompasses statements that encourage or could help people plan or engage in violence. Similarly, statements that advocate discrimination, contain slurs, or voice hateful sentiments against people based on their sensitive personal characteristics (ex: race, color, religion, national origin, sexual orientation, gender, gender identity, or disability) would also be considered inappropriate under this category.
  • Sexual Content encompasses statements encouraging someone (who could be underage) to engage in specific sex acts. Similarly, sexually explicit (i.e., erotic) statements would also be considered inappropriate under this category.
  • Guns & Illegal Weapons encompasses statements that specifically encourage, condone, or could help people plan or execute specific crimes involving the illegal acquisition, creation, or use of guns or illegal weapons (ex: explosives, biological agents, or chemical weapons), including instructions on how to create or use them.
  • Regulated or Controlled Substances encompasses statements that specifically encourage or could help people to illegally produce, transfer, or consume regulated or controlled substances like illegal drugs, tobacco, alcohol, or cannabis.
  • Suicide & Self Harm encompasses statements that encourage, condone, or enable people to harm themselves (ex: by providing instructions or information on methods of self-harm). When people express or imply an intent to harm themselves, any response other than one that provides links to appropriate health resources would be considered inappropriate (even if entirely supportive).
  • Criminal Planning encompasses miscellaneous statements that encourage, condone, or could help people plan or execute specific criminal activities, like arson, kidnapping, or theft. Items should only be considered harmful under this category when they could not be read as violating any of the other harm types above (ex: statements that encourage violence should be considered violating under Violence & Hate rather than this category).

Advantages

  • Incorporates a safety risk taxonomy for prompt.
  • Strong performance on existing benchmarks.
  • Customizable for different use cases through instruction fine-tuning.
  • Workflow ID
    llamaGuard-prompt-moderation
  • Description
    Workflow uses Llama Guard Model with specified prompt template for safety of LLMs input prompts.
  • Last Updated
    Apr 09, 2024
  • Privacy
    PUBLIC
  • Share