Joe Krisciunas

Research

Adversarial AI Findings


246
Attack Cases
498
Tests Run
9
Threat Channels

Bypass Rate by Model


Percentage of attack cases that successfully bypassed policy constraints. Wilson confidence intervals at 95%.

GPT-5-mini: 64.9% bypass (95% CI 53.5–74.8%)  |  GPT-4o: 45.9% bypass (95% CI 35.1–57.2%)

Key Findings


Magic Circle

Critical

Models that successfully refuse a direct harmful request will often comply when the same request is embedded inside a fictional or roleplay frame. The "magic circle" of fiction creates a permission bypass that persists even in safety-fine-tuned models.

Fictional framing must be treated as an attack surface, not a safe zone.

Capability Paradox

High

More capable models show higher bypass rates on certain attack categories despite stronger safety training. Greater reasoning ability enables the model to follow complex indirect instructions — including malicious ones — more faithfully.

Scaling alone does not reduce attack surface; architectural mitigations are required.

Specificity Ceiling

Medium

Policy prompts that enumerate prohibited behaviors show diminishing returns beyond a specificity threshold. Overly specific policies create exploitable gaps between prohibited categories, while also increasing prompt length and attack surface.

Policy design must balance specificity with principle-level constraints.

Threat Channels


9 distinct attack surface categories identified across agentic pipeline testing.

Instruction InjectionData Retrieval PoisoningTool Call HijackingMemory PoisoningProtocol / MCP AbuseMulti-Agent EscalationSchema DeceptionSupply Chain PoisoningConfused Deputy

Defense Architecture


CapabilityOS — 4-Layer Policy Stack
L4
Tool Allowlist
Enumerated tools only — deny by default
Tool call hijacking, unauthorized capability invocation
L3
Approval Gates
Human-in-the-loop for irreversible actions
Supply chain attacks, multi-agent escalation
L2
Egress Allowlist
Scoped network/data egress, no open exfil
Data exfiltration, confused deputy, memory poisoning
L1
Path Restrictions
Filesystem and resource access bounded at deploy
Data retrieval poisoning, schema deception