OpenAI GPT-4o Guardrail Jailbreak via "Zero-Width Unicode" Tactic
Case GUID: b32c82c6-e5b2-43ac-85e2-69b8cda63337
Summary
A guardrail jailbreak vulnerability has been discovered affecting multiple LLMs. The specific flaw manifests through a combination of zero-width Unicode character encoding and precise character manipulation, allowing the attacker to bypass inference restrictions around providing information hazards and violating laws.
Metadata
Test Scores
Model |
Test Kind |
Test Score |
Temperature (max 2.0) |

OpenAI GPT-4o
|
Copyrighted Content: Harry Potter |
20.56 |
N/A |
Social Impact Score
The violation poses very low societal risk, limited primarily to nuisance or minor policy breaches. Impact is contained, affecting individual users without broader implications.
Example: Circumventing minor content filters for benign content, such as inappropriate language filters in non-public conversations.
Potential Consequences: Limited harm, potentially reputational for the AI provider.
Taxonomies
Language
This category focuses on the use of specific linguistic techniques, such as prompt injection or stylization, to influence the model's output.
Code and Encode
This strategy encompasses techniques that utilize various coding methods, such as Base64 or ROT13, to bypass model restrictions and manipulate outputs.
Unicode
This technique utilizes various Unicode characters to manipulate the model's output or bypass its safety mechanisms. By incorporating non-standard or non-rendering Unicode characters, users can alter the appearance of prompts or commands, potentially leading the model to misinterpret the input and produce responses that would typically be restricted or filtered out.
Stratagems
This involves clever and unorthodox tactics designed to deceive the model, often requiring an understanding of its operational mechanics to achieve desired outcomes.
Scatter Shot
This strategy involves prompting the language model to assume a specific role or persona, which can influence its responses based on the characteristics and moral codes associated with that role. Techniques include claiming authority or inventing personas to elicit different types of outputs.
Regenerate Response
The "Regenerate Response" technique involves prompting the language model to produce a new output based on the same input or question. This can be particularly useful when the initial response does not meet the user's expectations or when the user seeks a different perspective or variation on the topic. By asking the model to regenerate its response, users can explore alternative interpretations, styles, or depths of information, enhancing the richness of the interaction. This technique allows for iterative refinement of the model's outputs, enabling users to hone in on the most relevant or engaging content. Additionally, it can serve as a way to test the model's consistency and adaptability, revealing how it navigates similar prompts under varying conditions. The ability to regenerate responses underscores the flexibility of language models in accommodating user needs and preferences, fostering a more dynamic and responsive dialogue.
About 0Din.ai
The 0Day Investigative Network (0DIN) was founded by Mozilla in 2024 to reward responsible researchers for their efforts in securing GenAI models. Learn more and submit discoveries at https://0din.ai.