• Community
  • Model
  • prompt-guard-86m

prompt-guard-86m

Prompt-Guard-86M is a multilingual classifier model designed to detect and prevent prompt injection and jailbreak attacks in LLM-powered applications.

Notes

Note

Introduction

Prompt-Guard-86M is an advanced classifier designed to safeguard LLM-powered applications against prompt attacks, including prompt injections and jailbreak attempts. Prompt attacks exploit vulnerabilities in LLMs by injecting harmful inputs or overriding safety protocols, potentially leading to unintended or malicious outcomes. Prompt-Guard-86M is trained on a comprehensive corpus of attack data to detect both explicitly malicious prompts and those containing subtle injected inputs. As a multi-label model, it categorizes inputs into three distinct classes: benign, injection, and jailbreak, thereby providing a robust mechanism to mitigate risks in LLM-powered applications.

Prompt-Guard-86M Prompt Classifier

The core functionality of Prompt-Guard-86M lies in its ability to accurately classify input strings into three categories:

  1. Benign: Normal inputs that do not pose any risk.
  2. Injection: Inputs that contain hidden or out-of-place commands aimed at exploiting the model’s behavior.
  3. Jailbreak: Inputs that attempt to bypass or disable the model’s safety and security features.

These classifications allow developers to implement tailored filtering mechanisms, ensuring that their applications maintain integrity and security against a wide range of prompt attacks.

Use Cases

Prompt-Guard-86M is versatile and can be adapted to various use cases:

  • Out-of-the-Box Filtering: Deploy Prompt-Guard-86M directly to filter high-risk prompts in applications where immediate protection is necessary.
  • Threat Detection and Mitigation: Use the model to identify and prioritize suspicious inputs for further investigation, aiding in the creation of annotated datasets for model fine-tuning.
  • Fine-Tuned Application: Fine-tune Prompt-Guard-86M on specific application data to enhance precision and recall, allowing for precise filtering of malicious prompts relevant to the application’s unique context.

Evaluation and Benchmark Results

Prompt-Guard-86M has been evaluated across several datasets to assess its performance in detecting malicious prompt attacks:

MetricEvaluation Set (Jailbreaks)Evaluation Set (Injections)OOD Jailbreak SetMultilingual Jailbreak SetCyberSecEval Indirect Injections Set
True Positive Rate (TPR)99.9%99.5%97.5%91.5%71.4%
False Positive Rate (FPR)0.4%0.8%3.9%5.3%1.0%
Area Under Curve (AUC)0.9971.0000.9750.9590.966
  • Evaluation Set (In-Distribution): Near-perfect performance with TPR of 99.9% (Jailbreaks) and 99.5% (Injections), and FPR of 0.4% and 0.8%, respectively.
  • Out-of-Distribution (OOD) Jailbreak Set: High generalization capability with a TPR of 97.5% and FPR of 3.9%.
  • Multilingual Jailbreak Set: Strong performance in detecting multilingual attacks with a TPR of 91.5% and FPR of 5.3%.
  • CyberSecEval Indirect Injections Set: Demonstrates robust performance in identifying embedded instructions with a TPR of 71.4% and FPR of 1.0%.

Area Under Curve (AUC) values across these sets range from 0.959 to 1.000, indicating high efficacy in distinguishing between benign and malicious inputs.

Dataset

Prompt-Guard-86M was trained on an extensive and diverse dataset comprising examples of both benign and malicious prompts. The dataset includes inputs from various domains and languages, ensuring that the model is well-equipped to handle a wide array of prompt attack scenarios. The multilingual capabilities of the model were bolstered by training on attacks translated into eight languages: English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai.

Advantages

  • High Accuracy: Achieves near-perfect classification on in-distribution data, with strong generalization to out-of-distribution scenarios.
  • Multilingual Support: Capable of detecting attacks across multiple languages, making it suitable for global applications.
  • Versatile Usage: Can be used as an out-of-the-box solution, for threat detection, or as a fine-tuned model tailored to specific applications.

Limitations

  • Adaptive Attack Vulnerability: As an open-source model, Prompt-Guard-86M is susceptible to adversarial attacks designed to evade its classifications.
  • Application-Specific Challenges: The model may not capture all application-specific attacks out-of-the-box, necessitating fine-tuning for optimal performance in different contexts.
  • False Positives: In some scenarios, the model’s false-positive rate may be too high for applications that require precise filtering. Fine-tuning or adjusting classification thresholds may be necessary to reduce false positives
  • ID
  • Name
    prompt-guard-86m
  • Model Type ID
    Text Classifier
  • Description
    Prompt-Guard-86M is a multilingual classifier model designed to detect and prevent prompt injection and jailbreak attacks in LLM-powered applications.
  • Last Updated
    Oct 17, 2024
  • Privacy
    PUBLIC
  • Use Case
  • Toolkit
  • License
  • Share
    • Badge
      prompt-guard-86m