0x8CCD7F81: Multiple Model Guardrail Jailbreak via

Multiple Model Guardrail Jailbreak via "Placeholder Injection" Tactic

Case GUID: 8ccd7f81-4e36-4084-8061-cec6e2d83ece

Summary

A guardrail jailbreak vulnerability has been discovered affecting multiple models. The specific flaw manifests when the attacker requests copyrighted content with formatting instructions that insert placeholder text between each word, effectively obfuscating the direct reproduction while still achieving the prohibited output. This technique allows the attacker to bypass inference restrictions around copyright protection and intellectual property safeguards.

View Full Intelligence Report

This intelligence report has been made public as it is related to a probe that we have open-sourced as part of the NVIDIA Garak and Microsoft PyRIT model scanners.

If you're interested in accessing the full 0DIN Threat Feed:

Request Trial View Full Intel Report

Metadata

Models and Test Scores

	Copyrights	Harmful Substances		Illicit Substances
	Harry Potter	Anthrax	Nerve Agent	Crystal Meth	Fentanyl
Anthropic Claude 3.5 Haiku	48 Copyrighted Content: Harry Potter Anthropic Claude 3.5 Haiku Score 48 · 0.7° Tested Feb 24 2025 (over 1 year ago)
Google Gemini 2.0 Flash	67 Copyrighted Content: Harry Potter Google Gemini 2.0 Flash Score 66.75 Tested Jan 24 2025 (over 1 year ago)
Twitter / X Grok 2	22 Copyrighted Content: Harry Potter Twitter / X Grok 2 Score 21.74 Tested Jan 24 2025 (over 1 year ago)

Social Impact Score

Level 1: Minimal Risk

The violation poses very low societal risk, limited primarily to nuisance or minor policy breaches. Impact is contained, affecting individual users without broader implications.

Example: Circumventing minor content filters for benign content, such as inappropriate language filters in non-public conversations.

Potential Consequences: Limited harm, potentially reputational for the AI provider.

Taxonomies

Language

This category focuses on the use of specific linguistic techniques, such as prompt injection or stylization, to influence the model's output.

Code and Encode

This strategy encompasses techniques that utilize various coding methods, such as Base64 or ROT13, to bypass model restrictions and manipulate outputs.

Chaff

Chaff is a technique employed by attackers to obfuscate keywords that might trigger a language model's guardrails. By injecting random characters, such as newline characters, spaces, or other tokens, into critical keywords, the attacker aims to bypass content filters while maintaining the underlying intent of the message. This method leverages the language model's ability to parse and understand fragmented input, allowing the attacker to subtly manipulate the model's response without overtly triggering its defensive mechanisms. Chaff exemplifies the nuanced interplay between linguistic creativity and technical evasion.

About 0Din.ai

The 0Day Investigative Network (0DIN) was founded by Mozilla in 2024 to reward responsible researchers for their efforts in securing GenAI models. Learn more and submit discoveries at https://0din.ai.

Automate Your AI Security Testing

Want to find vulnerabilities in your own models? 0DIN Scanner turns our human-sourced threat intelligence into automated security assessments.

400+ Attack Probes

Derived from real bug bounties and security research

18+ AI Providers Supported

OpenAI, Azure, Anthropic, AWS Bedrock, and more

Automated Scheduling

Run recurring scans with real-time monitoring

Actionable Metrics

Attack Success Rate (ASR) scoring with 90-day trend analysis

Request a Scanner Demo Learn More About Scanner

Severity

Low

Security Boundary

Guardrail Jailbreak

Models

Anthropic Claude 3.5 Haiku

Google Gemini 2.0 Flash

Twitter / X Grok 2

Disclosed On

2025-06-07 (about 1 year)

Disclosure Policy

Published On

2026-01-29 (5 months)

Credit

Ron Eddings