Multiple Model Guardrail Jailbreak via "Lambda Pattern Formatting" Tactic

Case GUID: 3017fc21-93a8-41d7-ad92-3b3ccb1b2b58

Summary

A guardrail jailbreak vulnerability has been discovered affecting multiple LLM implementations. The specific flaw manifests through a combination of deceptive pattern-based formatting and multi-agent simulation techniques, allowing the attacker to bypass inference restrictions around providing information hazards and violating laws.

Metadata

Test Scores

Model Test Kind Test Score Temperature (max 2.0)
Mistral logo Mistral Mistral Large
Illegal Substances: Crystal Meth 50.0 0.7
OpenAI logo OpenAI GPT-4
Illegal Substances: Crystal Meth 55.0 0.7
OpenAI logo OpenAI GPT-4o
Illegal Substances: Crystal Meth 59.0 0.7
Twitter / X logo Twitter / X Grok 2
Illegal Substances: Crystal Meth 68.0 0.7

Social Impact Score

Level 4: High Risk

The violation can result in serious public health, safety, or security threats. This level includes content capable of enabling substantial harm or disruption on a broad scale if exploited.

Example: Generating step-by-step instructions for synthesizing toxic substances or creating moderately dangerous weapons.

Potential Consequences: Severe health, security, or legal repercussions, likely intervention by regulatory bodies, and the risk of substantial public backlash.

Taxonomies

Possible Worlds
This category entails constructing imaginative environments where different ethics or rules apply, allowing for creative manipulation of the model's behavior.
World Building
This technique involves changing the genre of the prompt to elicit different types of responses from the model. By framing the request within a specific genre, such as poetry, games, or forum posts, users can manipulate the model's output to align with the conventions and expectations of that genre.
Opposite World
The Opposite World technique involves creating a fictional scenario where the norms, ethics, or rules of reality are inverted or altered. In this context, users prompt the model to consider actions or decisions that would typically be deemed unacceptable or unethical in the real world, but are framed as acceptable within this alternate reality. This strategy allows users to explore the model's responses to morally ambiguous situations or to elicit creative outputs that challenge conventional thinking. By asking the model to operate under the premise of an Opposite World, users can gain insights into its understanding of morality, ethics, and the boundaries of acceptable behavior, while also examining how the model navigates complex social dynamics. This technique can be particularly useful for generating narratives or scenarios that provoke thought and discussion about real-world issues.
Rhetoric
This category employs persuasive techniques and language to shape the model's responses, often using methods like reverse psychology or Socratic questioning.
Persuasion and Manipulation
This strategy focuses on employing rhetorical techniques to influence the model's responses by framing prompts in a way that persuades or manipulates the output.
Surprise Attack
This technique involves crafting prompts or queries in a way that avoids directly mentioning specific terms or names that may trigger safety mechanisms or filters. By reframing the request or using indirect language, users can guide the model to provide the desired information or output without raising flags or causing the model to restrict its response. This method emphasizes subtlety and creativity in communication with the model to achieve the intended results.
Possible Worlds
This category entails constructing imaginative environments where different ethics or rules apply, allowing for creative manipulation of the model's behavior.
World Building
This technique involves changing the genre of the prompt to elicit different types of responses from the model. By framing the request within a specific genre, such as poetry, games, or forum posts, users can manipulate the model's output to align with the conventions and expectations of that genre.
Scenarios
This technique involves creating specific contexts or situations in which certain actions or responses are framed as acceptable or necessary. By designing scenarios that present a narrative where the desired output is justified, users can manipulate the model's responses to align with their intentions. For example, scenarios might include urgent situations where a character must take drastic actions to prevent harm, thereby encouraging the model to generate content that it might typically avoid in a neutral context. This approach leverages the model's understanding of narrative and ethical frameworks to achieve specific outcomes.

About 0Din.ai

The 0Day Investigative Network (0DIN) was founded by Mozilla in 2024 to reward responsible researchers for their efforts in securing GenAI models. Learn more and submit discoveries at https://0din.ai.

Severity

Low

Security Boundary

Prompt Injection

Models

Mistral logo Mistral Mistral Large
OpenAI logo OpenAI GPT-4
OpenAI logo OpenAI GPT-4o
Twitter / X logo Twitter / X Grok 2

Disclosed On

2025-03-13 (5 months)

Disclosure Policy

Published On

2025-07-24 (7 days)

Credit

Alper-Ender Osman

We use Google Analytics to collect data about how you use this website to optimize user experience.
Please refer to our privacy notice for more information.