Industry leading jailbreak classification model for protection from adversarial attempts
NemoGuard JailbreakDetect was developed to detect attempts to jailbreak large language models.
This model is ready for commercial use.
Improved Large Language Model Jailbreak Detection via Pretrained Embeddings
Architecture Type: Random Forest
Network Architecture: N/A
Input Type(s): Text Embedding
Input Parameters: 768 dimensional vector
Input Format(s): Vector
Other Properties Related to Input: Must be an output from the corresponding embedding model. Either nv-embedqa-e5-v5
or snowflake-arctic-m-long
.
Output Type(s): Classification, Probability
Output Format: Bool, Float
Output Parameters: 1D
Other Properties Related to Output: N/A
Runtime Engine(s):
Supported Hardware Microarchitecture Compatibility:
[Preferred/Supported] Operating System(s):
NemoGuard-JailbreakDetect-v1.0: Jailbreak detection model using Snowflake-arctic-embed-m embeddings
A combination of three open datasets, mixed together, de-duplicated, and reviewed for data quality. Jailbreak data was augmented with the use of garak. The datasets used are outlined below:
Link: https://github.com/thunlp/Advbench
** Data Collection Method by dataset
** Labeling Method by dataset
Properties:
520 entries, all comprised of jailbreak attempts.
Link: https://huggingface.co/datasets/allenai/wildjailbreak
** Data Collection Method by dataset
** Labeling Method by dataset
Properties:
6387 total entries: 5721 benign prompts, 666 jailbreak attempts
Link: https://huggingface.co/datasets/jackhhao/jailbreak-classification
** Data Collection Method by dataset
** Labeling Method by dataset
Properties:
1306 total entries: 640 benign prompts, 666 jailbreak attempts
A stratified subset (20%) of the aggregate dataset was used for testing.
Evaluated on JailbreakHub.
Model | F1 Score | False Positive Rate | False Negative Rate |
---|---|---|---|
NemoGuard JailbreakDetect | 0.9601 | 0.0042 | 0.0435 |
Engine: N/A
Test Hardware:
NVIDIA believes Trustworthy AI is a shared responsibility, and we have established policies and practices to enable development for a wide array of AI applications.
When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.