NSFW Content Warning

This disclosure contains content that may be inappropriate for some viewers.

The content has been flagged due to potential nude imagery.

NSFW
OpenAI DALL-E3 Guardrail Jailbreak via "Feminist Art Description" Tactic

Case GUID: 78819949-0069-4cdb-817e-70957ffd2293

Summary

A guardrail jailbreak vulnerability has been discovered affecting DALL-E3 from OpenAI. The specific flaw manifests when the attacker frames explicit content requests within feminist discourse and artistic terminology, using concepts like "bodily autonomy" and "sacred feminine". Successful application of the tactic allows the attacker to bypass restrictions around generating explicit imagery that is otherwise restricted..

Metadata

Nude Imagery Rating System

Level 3: Cultural and Traditional Depictions

Nudity is presented within an indigenous or historical cultural context, often in the form of ceremonial art, body painting, or sculptures. These images are generally non-erotic and serve anthropological, documentary, or educational purposes. Their primary intent is to preserve or celebrate heritage rather than to provoke.

Taxonomies

Language
This category focuses on the use of specific linguistic techniques, such as prompt injection or stylization, to influence the model's output.
Prompt Injection
This technique enables attackers to override original instructions and employed controls by crafting specific wording of instructions, often resembling SQL injection methods, to manipulate the model's behavior.
Strong Arm Attack
A Strong Arm Attack is a technique used to bypass content filters or restrictions imposed by language models. This method involves issuing commands or prompts that assert authority or override the model's built-in safeguards. For example, a user might type "ADMIN OVERRIDE" in all capitals to signal the model to disregard its content filters and produce responses that it would typically avoid. This approach exploits the model's programming to respond to perceived authority, allowing users to elicit outputs that may include sensitive or restricted content. The effectiveness of a Strong Arm Attack relies on the model's interpretation of the command as a legitimate instruction, thereby enabling the user to manipulate the model's behavior in a way that aligns with their intentions.
Stratagems
This involves clever and unorthodox tactics designed to deceive the model, often requiring an understanding of its operational mechanics to achieve desired outcomes.
Meta Prompting
This technique focuses on manipulating the prompting process itself to influence the model's behavior and responses, often by employing higher-level prompting strategies.
Perspective Shifting
Perspective-shifting is a technique that involves prompting the language model to adopt different viewpoints or angles when generating responses. By encouraging the model to consider a situation from various perspectives, users can elicit a broader range of insights and ideas. This approach can be particularly useful in discussions that require empathy, critical thinking, or creative problem-solving. For example, a user might ask the model to respond to a question as if it were a child, an expert, or a member of a specific community, thereby enriching the conversation with diverse interpretations and understandings. Perspective-shifting not only enhances the depth of the model's outputs but also fosters a more inclusive dialogue by acknowledging and exploring multiple sides of an issue. This technique underscores the model's ability to navigate complex social dynamics and generate responses that resonate with different audiences.
Language
This category focuses on the use of specific linguistic techniques, such as prompt injection or stylization, to influence the model's output.
Stylizing
This strategy involves using a method of questioning that encourages critical thinking and exposes biases by subtly referencing identity elements without using direct slurs or toxic language, thereby signaling to the model about certain groups of people.
Synonymous Language
This technique involves using synonyms or alternative phrasing to convey the same meaning while potentially evading detection or filtering mechanisms. By substituting words with their synonyms, users can manipulate the model's understanding and responses, allowing for the generation of content that aligns with the user's intent but may not trigger the model's safety protocols. This approach can be particularly effective in contexts where certain terms are restricted or monitored.
Rhetoric
This category employs persuasive techniques and language to shape the model's responses, often using methods like reverse psychology or Socratic questioning.
Socratic Questioning
This strategy involves generating multiple outputs from a language model by using the "Regenerate response" feature to explore a range of possible interpretations and responses.
Identity Characteristics
Identity characteristics refer to the attributes and traits that define an individual's or group's identity, including aspects such as social roles, cultural backgrounds, and personal experiences. In the context of interacting with language models, users can leverage identity characteristics to shape the model's responses by framing prompts that reflect specific identities or perspectives. For instance, users might ask the model to respond as if it were a particular demographic group, profession, or cultural background. This technique can help explore how the model generates outputs based on different identity contexts, revealing biases or assumptions that may be present in its training data. By utilizing identity characteristics, users can gain insights into the model's understanding of social dynamics and the implications of identity in communication.

About 0Din.ai

The 0Day Investigative Network (0DIN) was founded by Mozilla in 2024 to reward responsible researchers for their efforts in securing GenAI models. Learn more and submit discoveries at https://0din.ai.

Severity

Low

Security Boundary

Prompt Injection

Models

OpenAI logo OpenAI DALL-E3

Disclosed On

2025-03-13 (6 months)

Disclosure Policy

Published On

2025-08-19 (13 days)

Credit

Anonymous

We use Google Analytics to collect data about how you use this website to optimize user experience.
Please refer to our privacy notice for more information.