Research
Adversarial AI Findings
Bypass Rate by Model
Percentage of attack cases that successfully bypassed policy constraints. Wilson confidence intervals at 95%.
Key Findings
Magic Circle
CriticalModels that successfully refuse a direct harmful request will often comply when the same request is embedded inside a fictional or roleplay frame. The "magic circle" of fiction creates a permission bypass that persists even in safety-fine-tuned models.
Capability Paradox
HighMore capable models show higher bypass rates on certain attack categories despite stronger safety training. Greater reasoning ability enables the model to follow complex indirect instructions — including malicious ones — more faithfully.
Specificity Ceiling
MediumPolicy prompts that enumerate prohibited behaviors show diminishing returns beyond a specificity threshold. Overly specific policies create exploitable gaps between prohibited categories, while also increasing prompt length and attack surface.
Threat Channels
9 distinct attack surface categories identified across agentic pipeline testing.
Defense Architecture