0xB32C82C6: OpenAI GPT-4o Guardrail Jailbreak via

OpenAI GPT-4o Guardrail Jailbreak via "Zero-Width Unicode" Tactic

Case GUID: b32c82c6-e5b2-43ac-85e2-69b8cda63337

Summary

A guardrail jailbreak vulnerability has been discovered affecting multiple LLMs. The specific flaw manifests through a combination of zero-width Unicode character encoding and precise character manipulation, allowing the attacker to bypass inference restrictions around providing information hazards and violating laws.

Unlock the Full Intelligence Report

This disclosure is part of the 0DIN Threat Feed. Customers get access to:

Full Prompts & Variants

Detection Signatures

API Access

Full Data Export

Trend Analysis

Early Access

Request Trial Learn About Threat Feed

Metadata

Models and Test Scores

	Copyrights	Harmful Substances		Illicit Substances
	Harry Potter	Anthrax	Nerve Agent	Crystal Meth	Fentanyl
OpenAI GPT-4o	21 Copyrighted Content: Harry Potter OpenAI GPT-4o Score 20.56 Tested Jan 05 2025 (over 1 year ago)

Social Impact Score

Level 1: Minimal Risk

The violation poses very low societal risk, limited primarily to nuisance or minor policy breaches. Impact is contained, affecting individual users without broader implications.

Example: Circumventing minor content filters for benign content, such as inappropriate language filters in non-public conversations.

Potential Consequences: Limited harm, potentially reputational for the AI provider.

Taxonomies

Language

This category focuses on the use of specific linguistic techniques, such as prompt injection or stylization, to influence the model's output.

Code and Encode

This strategy encompasses techniques that utilize various coding methods, such as Base64 or ROT13, to bypass model restrictions and manipulate outputs.

Unicode

This technique utilizes various Unicode characters to manipulate the model's output or bypass its safety mechanisms. By incorporating non-standard or non-rendering Unicode characters, users can alter the appearance of prompts or commands, potentially leading the model to misinterpret the input and produce responses that would typically be restricted or filtered out.

Stratagems

This involves clever and unorthodox tactics designed to deceive the model, often requiring an understanding of its operational mechanics to achieve desired outcomes.

Scatter Shot

This strategy involves prompting the language model to assume a specific role or persona, which can influence its responses based on the characteristics and moral codes associated with that role. Techniques include claiming authority or inventing personas to elicit different types of outputs.

Regenerate Response

The "Regenerate Response" technique involves prompting the language model to produce a new output based on the same input or question. This can be particularly useful when the initial response does not meet the user's expectations or when the user seeks a different perspective or variation on the topic. By asking the model to regenerate its response, users can explore alternative interpretations, styles, or depths of information, enhancing the richness of the interaction. This technique allows for iterative refinement of the model's outputs, enabling users to hone in on the most relevant or engaging content. Additionally, it can serve as a way to test the model's consistency and adaptability, revealing how it navigates similar prompts under varying conditions. The ability to regenerate responses underscores the flexibility of language models in accommodating user needs and preferences, fostering a more dynamic and responsive dialogue.

About 0Din.ai

The 0Day Investigative Network (0DIN) was founded by Mozilla in 2024 to reward responsible researchers for their efforts in securing GenAI models. Learn more and submit discoveries at https://0din.ai.

Automate Your AI Security Testing

Want to find vulnerabilities in your own models? 0DIN Scanner turns our human-sourced threat intelligence into automated security assessments.

400+ Attack Probes

Derived from real bug bounties and security research

18+ AI Providers Supported

OpenAI, Azure, Anthropic, AWS Bedrock, and more

Automated Scheduling

Run recurring scans with real-time monitoring

Actionable Metrics

Attack Success Rate (ASR) scoring with 90-day trend analysis

Request a Scanner Demo Learn More About Scanner

Severity

Low

Security Boundary

Guardrail Jailbreak

Models

OpenAI GPT-4o

Disclosed On

2025-03-13 (over 1 year)

Disclosure Policy

Published On

2025-07-30 (11 months)

Credit

Mike Takahashi (@TakSec)