May 2026 Preprint

Experimental Evaluation

We evaluate TurnGate on the Multi-Turn Intent Dataset (MTID), measuring its ability to detect hidden malicious intent while preserving benign utility. Our results show that TurnGate achieves the best overall safety-utility trade-off compared to prompt-based monitors and off-the-shelf guardrails.

Main Offline Results

Comprehensive metrics on the MTID test split.

Method	Model	Benign Score $\uparrow$	Miss $\downarrow$	Early $\downarrow$	$\ell_1 \downarrow$	Acc. ($\phi_1$) $\uparrow$	Harm Score $\phi_2 \uparrow$	F1 $\phi_2 \uparrow$
Baselines
Vanilla LLM Monitor	Qwen3-4B	0.753	0.708	0.168	1.31	0.124	0.211	0.330
Sequential Monitor	GPT-5.2	0.648	0.412	0.296	1.02	0.292	0.428	0.516
Intention Analysis	GPT-OSS-120B	0.023	0.045	0.638	1.36	0.317	0.574	0.045
Guardrail-based baselines
Llama Guard	Llama-3-8B	0.998	0.998	0.001	0.33	0.002	0.002	0.005
Qwen Guard	Qwen3-8B	0.863	0.788	0.092	0.83	0.120	0.159	0.269
Trainable Methods
Naive-SFT	Qwen3-4B	0.930	0.738	0.098	0.76	0.163	0.224	0.361
Reweighted-SFT	Qwen3-4B	0.840	0.379	0.278	0.90	0.343	0.479	0.610
TurnGate (Ours)	Qwen3-4B	0.834	0.177	0.409	1.02	0.414	0.602	0.699

Metric Definitions

Benign Score: Pass-through rate on benign traffic ($1 - \text{FPR}$). Measures utility preservation.

Miss: Rate of harmful trajectories allowed to complete without intervention ($\eta_\pi > t^*$). Represents a safety breach.

Acc ($\phi_1$): Exact closure accuracy. The percentage of trajectories blocked exactly at the first sufficient turn ($\eta_\pi = t^*$).

Harmful Score $\phi_2$: Time-sensitive safety metric; assigns $1.0$ for exact block, $\eta_\pi/t^*$ for early block, and $0.0$ for miss.

F1 $\phi_2$: Overall trade-off summary; the harmonic mean of Benign Score and Harmful Score $\phi_2$.

Cross-Domain Generalization

Zero-shot evaluation across different risk categories (Chemistry ↔ Cybersecurity).

Method	Source Domain	Target Domain	Benign Score $\uparrow$	Harm Score $\phi_2 \uparrow$	F1 $\phi_2 \uparrow$
Intention Analysis	Chemistry	Cybersecurity	0.808	0.297	0.435
TurnGate	Chemistry	Cybersecurity	0.863	0.397	0.543

Intention Analysis	Cybersecurity	Chemistry	0.894	0.329	0.481
TurnGate	Cybersecurity	Chemistry	0.739	0.474	0.578

Online Robustness

Attack Success Rate (ASR) against adaptive tree-search attackers.

Adaptive Attack Robustness

TurnGate remains substantially more robust in closed-loop online interaction against strong adaptive attackers. Even as the attacker budget increases, TurnGate maintains a lower ASR than all baselines.

OOD Generalization

Evaluation at attacker iteration budget $i=5$ shows that TurnGate generalizes effectively to held-out risk categories, target models (Gemini-3.1-Pro), and different attacker pipelines.