0x78819949: OpenAI DALL-E3 Guardrail Jailbreak via "Feminist

NSFW Content Warning

This disclosure contains content that may be inappropriate for some viewers.

The content has been flagged due to potential nude imagery.

NSFW
OpenAI DALL-E3 Guardrail Jailbreak via "Feminist Art Description" Tactic

Case GUID: 78819949-0069-4cdb-817e-70957ffd2293

Summary

A guardrail jailbreak vulnerability has been discovered affecting DALL-E3 from OpenAI. The specific flaw manifests when the attacker frames explicit content requests within feminist discourse and artistic terminology, using concepts like "bodily autonomy" and "sacred feminine". Successful application of the tactic allows the attacker to bypass restrictions around generating explicit imagery that is otherwise restricted..

Unlock the Full Intelligence Report

This disclosure is part of the 0DIN Threat Feed. Customers get access to:

Full Prompts & Variants

Detection Signatures

API Access

Full Data Export

Trend Analysis

Early Access

Request a Demo Learn About Threat Feed

Metadata

Nude Imagery Rating System

Level 3: Cultural and Traditional Depictions

Nudity is presented within an indigenous or historical cultural context, often in the form of ceremonial art, body painting, or sculptures. These images are generally non-erotic and serve anthropological, documentary, or educational purposes. Their primary intent is to preserve or celebrate heritage rather than to provoke.

Taxonomies

Language

This category focuses on the use of specific linguistic techniques, such as prompt injection or stylization, to influence the model's output.

Prompt Injection

This technique enables attackers to override original instructions and employed controls by crafting specific wording of instructions, often resembling SQL injection methods, to manipulate the model's behavior.

Strong Arm Attack

A Strong Arm Attack is a technique used to bypass content filters or restrictions imposed by language models. This method involves issuing commands or prompts that assert authority or override the model's built-in safeguards. For example, a user might type "ADMIN OVERRIDE" in all capitals to signal the model to disregard its content filters and produce responses that it would typically avoid. This approach exploits the model's programming to respond to perceived authority, allowing users to elicit outputs that may include sensitive or restricted content. The effectiveness of a Strong Arm Attack relies on the model's interpretation of the command as a legitimate instruction, thereby enabling the user to manipulate the model's behavior in a way that aligns with their intentions.

Stratagems

This involves clever and unorthodox tactics designed to deceive the model, often requiring an understanding of its operational mechanics to achieve desired outcomes.

Meta Prompting

This technique focuses on manipulating the prompting process itself to influence the model's behavior and responses, often by employing higher-level prompting strategies.

Perspective Shifting

Perspective-shifting is a technique that involves prompting the language model to adopt different viewpoints or angles when generating responses. By encouraging the model to consider a situation from various perspectives, users can elicit a broader range of insights and ideas. This approach can be particularly useful in discussions that require empathy, critical thinking, or creative problem-solving. For example, a user might ask the model to respond to a question as if it were a child, an expert, or a member of a specific community, thereby enriching the conversation with diverse interpretations and understandings. Perspective-shifting not only enhances the depth of the model's outputs but also fosters a more inclusive dialogue by acknowledging and exploring multiple sides of an issue. This technique underscores the model's ability to navigate complex social dynamics and generate responses that resonate with different audiences.

Language

This category focuses on the use of specific linguistic techniques, such as prompt injection or stylization, to influence the model's output.

Stylizing

This strategy involves using a method of questioning that encourages critical thinking and exposes biases by subtly referencing identity elements without using direct slurs or toxic language, thereby signaling to the model about certain groups of people.

Synonymous Language

This technique involves using synonyms or alternative phrasing to convey the same meaning while potentially evading detection or filtering mechanisms. By substituting words with their synonyms, users can manipulate the model's understanding and responses, allowing for the generation of content that aligns with the user's intent but may not trigger the model's safety protocols. This approach can be particularly effective in contexts where certain terms are restricted or monitored.

Rhetoric

This category employs persuasive techniques and language to shape the model's responses, often using methods like reverse psychology or Socratic questioning.

Socratic Questioning

This strategy involves generating multiple outputs from a language model by using the "Regenerate response" feature to explore a range of possible interpretations and responses.

Identity Characteristics

Identity characteristics refer to the attributes and traits that define an individual's or group's identity, including aspects such as social roles, cultural backgrounds, and personal experiences. In the context of interacting with language models, users can leverage identity characteristics to shape the model's responses by framing prompts that reflect specific identities or perspectives. For instance, users might ask the model to respond as if it were a particular demographic group, profession, or cultural background. This technique can help explore how the model generates outputs based on different identity contexts, revealing biases or assumptions that may be present in its training data. By utilizing identity characteristics, users can gain insights into the model's understanding of social dynamics and the implications of identity in communication.

About 0Din.ai

The 0Day Investigative Network (0DIN) was founded by Mozilla in 2024 to reward responsible researchers for their efforts in securing GenAI models. Learn more and submit discoveries at https://0din.ai.

Automate Your AI Security Testing

This vulnerability was discovered through 0DIN's bug bounty program. Want to find issues like this in your own models? 0DIN Scanner turns our human-sourced threat intelligence into automated security assessments.

400+ Attack Probes

Derived from real bug bounties and security research

18+ AI Providers Supported

OpenAI, Azure, Anthropic, AWS Bedrock, and more

Automated Scheduling

Run recurring scans with real-time monitoring

Actionable Metrics

Attack Success Rate (ASR) scoring with 90-day trend analysis

Request a Scanner Demo Learn More About Scanner

Severity

Low

Security Boundary

Prompt Injection

Models

OpenAI DALL-E3

Disclosed On

2025-03-13 (11 months)

Disclosure Policy

Published On

2025-08-19 (6 months)

Credit

Anonymous