May 2026 Preprint

Experimental Evaluation

We evaluate TurnGate on the Multi-Turn Intent Dataset (MTID), measuring its ability to detect hidden malicious intent while preserving benign utility. Our results show that TurnGate achieves the best overall safety-utility trade-off compared to prompt-based monitors and off-the-shelf guardrails.

Main Offline Results

Comprehensive metrics on the MTID test split.

Method Model Benign Score $\uparrow$ Miss $\downarrow$ Early $\downarrow$ $\ell_1 \downarrow$ Acc. ($\phi_1$) $\uparrow$ Harm Score $\phi_2 \uparrow$ F1 $\phi_2 \uparrow$
Baselines
Vanilla LLM MonitorQwen3-4B0.7530.7080.1681.310.1240.2110.330
Sequential MonitorGPT-5.20.6480.4120.2961.020.2920.4280.516
Intention AnalysisGPT-OSS-120B0.0230.0450.6381.360.3170.5740.045
Guardrail-based baselines
Llama GuardLlama-3-8B0.9980.9980.0010.330.0020.0020.005
Qwen GuardQwen3-8B0.8630.7880.0920.830.1200.1590.269
Trainable Methods
Naive-SFTQwen3-4B0.9300.7380.0980.760.1630.2240.361
Reweighted-SFTQwen3-4B0.8400.3790.2780.900.3430.4790.610
TurnGate (Ours)Qwen3-4B0.8340.1770.4091.020.4140.6020.699

Metric Definitions

Benign Score: Pass-through rate on benign traffic ($1 - \text{FPR}$). Measures utility preservation.
Miss: Rate of harmful trajectories allowed to complete without intervention ($\eta_\pi > t^*$). Represents a safety breach.
Acc ($\phi_1$): Exact closure accuracy. The percentage of trajectories blocked exactly at the first sufficient turn ($\eta_\pi = t^*$).
Harmful Score $\phi_2$: Time-sensitive safety metric; assigns $1.0$ for exact block, $\eta_\pi/t^*$ for early block, and $0.0$ for miss.
F1 $\phi_2$: Overall trade-off summary; the harmonic mean of Benign Score and Harmful Score $\phi_2$.

Cross-Domain Generalization

Zero-shot evaluation across different risk categories (Chemistry ↔ Cybersecurity).

Method Source Domain Target Domain Benign Score $\uparrow$ Harm Score $\phi_2 \uparrow$ F1 $\phi_2 \uparrow$
Intention AnalysisChemistryCybersecurity0.8080.2970.435
TurnGateChemistryCybersecurity0.8630.3970.543
Intention AnalysisCybersecurityChemistry0.8940.3290.481
TurnGateCybersecurityChemistry0.7390.4740.578

Online Robustness

Attack Success Rate (ASR) against adaptive tree-search attackers.

Adaptive Attack Robustness

TurnGate remains substantially more robust in closed-loop online interaction against strong adaptive attackers. Even as the attacker budget increases, TurnGate maintains a lower ASR than all baselines.

Robustness iteration plot

OOD Generalization

Evaluation at attacker iteration budget $i=5$ shows that TurnGate generalizes effectively to held-out risk categories, target models (Gemini-3.1-Pro), and different attacker pipelines.

OOD generalization plot