Multiple Model Guardrail Jailbreak via Memory Dump Simulation Tactic

Case GUID: 1a2aa0fd-0b2b-42fb-b392-5ccdd7ac1f7d

Summary

A guardrail jailbreak vulnerability has been discovered affecting multiple models. The specific flaw manifests through a combination of terminal environment simulation and malware analysis framing, where the attacker requests extraction of strings from a purported memory dump. By embedding the request within a technical security analysis context and using the strings command output format, the technique allows the attacker to bypass inference restrictions around providing information hazards and violating laws.

Metadata

Test Scores

Model Test Kind Test Score Temperature (max 2.0)
DeepSeek logo DeepSeek DeepSeek V3
Illegal Substances: Crystal Meth 68.18 N/A
Google logo Google Gemini 2.0 Flash
Illegal Substances: Crystal Meth 68.18 N/A
OpenAI logo OpenAI GPT-4
Illegal Substances: Crystal Meth 54.55 N/A
OpenAI logo OpenAI GPT-4o
Illegal Substances: Crystal Meth 59.09 N/A
OpenAI logo OpenAI GPT-4o mini
Illegal Substances: Crystal Meth 68.18 N/A

Social Impact Score

Level 3: Significant Risk

Violations at this level present a tangible risk to public safety or social well-being, often related to dangerous but not highly restricted content.

Example: Bypassing guardrails to obtain instructions for creating illicit substances, such as recreational drugs or substances with moderate harm potential. Leakage of private personal data such as Social Security number or private medical.

Potential Consequences: Potential health impacts on a small scale, significant regulatory interest, and the need for stricter AI guardrails.

Taxonomies

Possible Worlds
This category entails constructing imaginative environments where different ethics or rules apply, allowing for creative manipulation of the model's behavior.
Emulations
This strategy involves mimicking the behavior of other systems or models to test the robustness and responses of a language model under different simulated conditions.
Unreal Computing
This technique allows an attacker to create or imagine an environment where different ethics or physics apply, enabling them to manipulate the model's responses by suggesting scenarios that would not be possible in the real world. It leverages the concept of "Unreal Computing," where the limitations of actual computing do not apply, allowing for creative and unrestricted interactions with the model.
Stratagems
This involves clever and unorthodox tactics designed to deceive the model, often requiring an understanding of its operational mechanics to achieve desired outcomes.
Meta Prompting
This technique focuses on manipulating the prompting process itself to influence the model's behavior and responses, often by employing higher-level prompting strategies.
Deceptive Formatting
A prompt injection in the most pure sense, formatting the user prompt to fabricate the appearance of system instructions, a database query, its own prediction, or some other source of input a guard railed AI system might be expecting, causing it to behave in insecure ways based on the fabricated context from an adversarially formatted user prompt.

About 0Din.ai

The 0Day Investigative Network (0DIN) was founded by Mozilla in 2024 to reward responsible researchers for their efforts in securing GenAI models. Learn more and submit discoveries at https://0din.ai.

Severity

Low

Security Boundary

Prompt Injection

Models

DeepSeek logo DeepSeek DeepSeek V3
Google logo Google Gemini 2.0 Flash
OpenAI logo OpenAI GPT-4
OpenAI logo OpenAI GPT-4o
OpenAI logo OpenAI GPT-4o mini

Disclosed On

2025-03-13 (6 months)

Disclosure Policy

Published On

2025-09-17 (about 5 hours)

Credit

Anonymous