Security boundary

Guardrail Jailbreak

Severity: Low

A Guardrail Jailbreak is a direct attack on the model's safety mechanisms — crafting prompts that cause the model to ignore, bypass, or override its built-in content guardrails and produce harmful or restricted content. These attacks operate purely at the inference level: no external system access, tool use, or data injection is required. The model itself is the target.

Example:

A researcher sends a series of carefully crafted messages that gradually reframe the model's role, eventually convincing it to provide instructions for dangerous activities that its safety training was designed to prevent.

References:

  • MITRE ATLAS: LLM Jailbreak (AML.T0054)
  • OWASP LLM 2025: LLM01:2025 Prompt Injection
  • avid-effect:security:S0400 (model bypass)
Previous Next