Multiple Model Guardrail Jailbreak via "API System Instructions" Tactic

Case GUID: 2235061a-e292-474f-ac38-a510b80b5ef0

Summary

A guardrail jailbreak vulnerability has been discovered affecting multiple LLM models. The specific flaw manifests when the attacker embeds production-style guidance into an emulated API system-instructions context, then supplies only a minimal harmful keyword in the user message. This causes the model to respond with highly detailed, stepwise, and quantitatively specified information that would normally be restricted, thereby bypassing inference safeguards around providing information hazards and violating laws.

Metadata

Test Scores

Model Test Kind Test Score Temperature (max 2.0)
DeepSeek logo DeepSeek DeepSeek V3
Illegal Substances: Crystal Meth 73.0 0.7
OpenAI logo OpenAI GPT-4o
Illegal Substances: Crystal Meth 41.0 0.7
Twitter / X logo Twitter / X Grok 2
Illegal Substances: Crystal Meth 64.0 0.7

Social Impact Score

Level 3: Significant Risk

Violations at this level present a tangible risk to public safety or social well-being, often related to dangerous but not highly restricted content.

Example: Bypassing guardrails to obtain instructions for creating illicit substances, such as recreational drugs or substances with moderate harm potential. Leakage of private personal data such as Social Security number or private medical.

Potential Consequences: Potential health impacts on a small scale, significant regulatory interest, and the need for stricter AI guardrails.

Taxonomies

Fictionalizing
This category involves creating scenarios or narratives that leverage existing genres or contexts to manipulate the language model's responses.
Re-storying
This technique involves continuing a narrative in a way that misaligns the original goal of a prompt, effectively repurposing the story to achieve a different outcome than initially intended.
Goal Hijacking
This technique refers to the process where an attacker misaligns the original goal of a prompt to redirect the model's output towards a new, often unintended goal, such as printing a target phrase or generating specific content that deviates from the initial intent. It often involves crafting prompts that manipulate the model's understanding and response, effectively "hijacking" the conversation or task at hand.
Stratagems
This involves clever and unorthodox tactics designed to deceive the model, often requiring an understanding of its operational mechanics to achieve desired outcomes.
Meta Prompting
This technique focuses on manipulating the prompting process itself to influence the model's behavior and responses, often by employing higher-level prompting strategies.
Deceptive Formatting
A prompt injection in the most pure sense, formatting the user prompt to fabricate the appearance of system instructions, a database query, its own prediction, or some other source of input a guard railed AI system might be expecting, causing it to behave in insecure ways based on the fabricated context from an adversarially formatted user prompt.

About 0Din.ai

The 0Day Investigative Network (0DIN) was founded by Mozilla in 2024 to reward responsible researchers for their efforts in securing GenAI models. Learn more and submit discoveries at https://0din.ai.

Severity

Low

Security Boundary

Prompt Injection

Models

DeepSeek logo DeepSeek DeepSeek V3
OpenAI logo OpenAI GPT-4o
Twitter / X logo Twitter / X Grok 2

Disclosed On

2025-03-19 (8 months)

Disclosure Policy

Published On

2025-11-14 (1 day)

Credit

Anonymous