Security boundary
Severity: Low
A Guardrail Jailbreak is a direct attack on the model's safety mechanisms — crafting prompts that cause the model to ignore, bypass, or override its built-in content guardrails and produce harmful or restricted content. These attacks operate purely at the inference level: no external system access, tool use, or data injection is required. The model itself is the target.
Example:
A researcher sends a series of carefully crafted messages that gradually reframe the model's role, eventually convincing it to provide instructions for dangerous activities that its safety training was designed to prevent.
References: