TurnGate: Detecting Malicious Intent in Multi-turn Dialogue
Abstract
Sequential Intervention Logic
The defense challenge is no longer just to judge whether an individual turn is unsafe, but to determine when the dialogue as a whole becomes sufficient to enable harm. We formalize this as a cost-sensitive sequential stopping problem, where the goal is to identify the first turn $t^*$ at which delivering the candidate response would complete the information needed for misuse.
An ideal defender must identify the earliest turn $\eta_\pi$ that matches $t^*$. This interaction protocol is response-aware: the monitor inspects the model's generated response before delivery, as the assistant's own technical revelations contribute to whether the harm threshold is crossed.
The MTID Dataset: Construction & Annotation
To operationalize the defense, we construct the Multi-Turn Intent Dataset (MTID) as a collection of branching, multi-turn trajectories from adaptive attackers. Each node stores the full dialogue prefix plus the candidate response, and we annotate the earliest closure turn $t^*$ (or $\infty$ for benign trajectories).
The Target
Most safety datasets use static turn labels or coarse trajectory judgments. MTID focuses on:
- Closure Labels: Pinpoints the earliest turn $t^*$ where harm becomes enabling.
- Branching Attacks: Captures the hit-and-pivot behavior of adaptive attackers.
- Matched Benign Pairs: Includes hard negatives with overlapping technical vocabulary to stress-test over-refusal.
Implementation
We build MTID with a three-stage adaptive simulation pipeline:
- Adaptive Tree Search: A CKA Agent explores a branching attack tree. Nodes are dialogue states $h_{t-1}$, and edges are candidate assistant responses.
- Sufficiency Annotation: An evaluator applies $\mathrm{Suff}(x_t, g)$ over the full prefix (including the candidate response) to mark the minimal closure turn $t^*$.
- Hard-Negative Pairing: For each harmful rollout, we add a benign trajectory with matched topic and terminology using WildJailbreak.
Result: 16,000 trajectories across Chemistry and Cybersecurity, each with closure labels and matched benign pairs.
Example MTID Trajectories
MTID Generative Process
Experimental Results
TurnGate achieves the state-of-the-art trade-off between timely intervention and utility preservation.
Main Offline Results (MTID Test Split)
| Method | Backbone | Utility | Harmful (Detection Timing) | F1 Score (Summary) | |||||
|---|---|---|---|---|---|---|---|---|---|
| Benign $\uparrow$ | Miss $\downarrow$ | Early $\downarrow$ | Acc ($\phi_1$) $\uparrow$ | Harm ($\phi_2$) $\uparrow$ | F1 ($\phi_1$) $\uparrow$ | F1 ($\phi_2$) $\uparrow$ | F1 ($\phi_3$) $\uparrow$ | ||
| Prompt-based Monitors | |||||||||
| Vanilla LLM | Qwen3-4B | 0.753 | 0.708 | 0.168 | 0.124 | 0.211 | 0.211 | 0.330 | 0.284 |
| Vanilla LLM | GPT-5.2 | 0.826 | 0.787 | 0.073 | 0.139 | 0.175 | 0.238 | 0.288 | 0.266 |
| Vanilla LLM | GPT-OSS-120B | 0.809 | 0.719 | 0.114 | 0.167 | 0.220 | 0.277 | 0.345 | 0.313 |
| Sequential Mon. | Qwen3-4B | 0.898 | 0.831 | 0.086 | 0.083 | 0.123 | 0.153 | 0.216 | 0.187 |
| Sequential Mon. | GPT-5.2 | 0.648 | 0.412 | 0.296 | 0.292 | 0.428 | 0.402 | 0.516 | 0.465 |
| Intention Analysis | Qwen3-4B | 0.854 | 0.823 | 0.087 | 0.091 | 0.133 | 0.164 | 0.230 | 0.202 |
| Intention Analysis | GPT-OSS-120B | 0.023 | 0.045 | 0.638 | 0.317 | 0.574 | 0.043 | 0.045 | 0.044 |
| Guardrail-based Baselines | |||||||||
| Llama Guard | Llama-3-8B | 0.998 | 0.998 | 0.001 | 0.002 | 0.002 | 0.003 | 0.005 | 0.005 |
| Qwen Guard | Qwen3-8B | 0.863 | 0.788 | 0.092 | 0.120 | 0.159 | 0.211 | 0.269 | 0.239 |
| Synthesis-Llama | Llama-3-8B | 0.983 | 0.990 | 0.003 | 0.007 | 0.008 | 0.013 | 0.017 | 0.015 |
| Trainable Methods (Qwen3-4B) | |||||||||
| Naive-SFT (Traj) | Qwen3-4B | 0.609 | 0.089 | 0.629 | 0.282 | 0.532 | 0.386 | 0.568 | 0.477 |
| Naive-SFT (Turn) | Qwen3-4B | 0.930 | 0.738 | 0.098 | 0.163 | 0.224 | 0.278 | 0.361 | 0.334 |
| Reweighted-SFT | Qwen3-4B | 0.840 | 0.379 | 0.278 | 0.343 | 0.479 | 0.487 | 0.610 | 0.557 |
| TurnGate (Ours) | Qwen3-4B | 0.834 | 0.177 | 0.409 | 0.414 | 0.602 | 0.553 | 0.699 | 0.633 |
Metric Intuition: Proximity Score $\phi$
How do we measure "closeness" to the closure turn $t^*$? We define $\phi(\eta_\pi; t^*)$ as the credit assigned based on the turn $\eta_\pi$ at which the defender intervenes.
$\phi_1$ (Strict)
Only exact closure at $t^*$ is rewarded ($1.0$). Any early block or miss receives $0.0$. Most stringent safety metric.
Case A: 1.0; Case B: 0.0; Case C: 0.0.
$\phi_2$ (Linear Proximity)
Rewards "closeness" to $t^*$. Assigns partial credit $\eta_\pi/t^*$ for early blocks. Misses receive $0.0$. Used for main F1 results.
Case A: 1.0; Case B: $\eta_\pi/t^*$ (here 0.5); Case C: 0.0.
$\phi_3$ (Conservative)
Sharp penalty for premature blocks. Assigns $(\eta_\pi/t^*)^2$ for early intervention. Higher penalty for blocking near turn 1.
Case A: 1.0; Case B: $(\eta_\pi/t^*)^2$ (here 0.25); Case C: 0.0.
Online OOD Generalization
Setting: We stress-test TurnGate in a strict zero-shot setting against an adaptive tree-search attacker ($i=5$) across distribution shifts in benchmarks, target models, and attacker pipelines.
Why online is harder: The attacker interacts with the live target--defender loop, conditions each next query on delivered responses, and can backtrack or reroute after blocked or uninformative turns. This closed-loop adaptation (with iteration budget $i \in \{1, 3, 5\}$) is more realistic than offline fixed trajectories and makes defense substantially harder.
(a) Benchmark Shift: Zero-shot transfer to StrongReject, HarmBench, and JBB-Behaviors across 12+ unseen risk categories.
(b) Model Shift: Defending a Gemini-3.1-Pro model zero-shot using a monitor trained exclusively on GPT-5.2 interactions.
(c) Joint Shift: Simultaneous zero-shot transfer across both target model architecture and high-risk benchmarks.
(d) Pipeline Shift: Generalization to defend against multi-agent jailbreak (MAJ) attack pipelines.
Setting
Online battle against an adaptive tree-search attacker with iteration budget $i=5$.
Key Insight
Reduces ASR by up to 45% on Illegal Goods and 40% on Physical Harm categories zero-shot.
Takeaway
TurnGate learns transferable turn-level behavior rather than relying on domain-specific surface cues.
Methodology: From Sequential Logic to RL Optimization
TurnGate is designed to solve the challenge of distributed malicious intent, where an attacker spreads a harmful objective across multiple benign-looking turns.
1. Interaction Protocol and Response-Awareness
We model the interaction as a three-party dialogue among a user, a base assistant, and a defender. At each turn $t$, the defender observes the full context: $$x_t = ((q_1, r_1), \ldots, (q_{t-1}, r_{t-1}), q_t, \tilde{r}_t)$$ Crucially, the defender is response-aware: it inspects the assistant's candidate response $\tilde{r}_t$ before delivery. This is essential because the model's own technical revelations often determine whether the harm threshold is crossed.
2. The Harmful Closure Turn ($t^*$)
We define the closure turn $t^*$ as the first turn where the interaction becomes sufficient to realize a harmful objective $g$: $$t^*(\tau, g) = \min(\{t \in \{1, \ldots, T\} : \mathrm{Suff}(x_t, g) = 1\} \cup \{\infty\})$$ where $\mathrm{Suff}(x_t, g)$ is a binary sufficiency operator. The defender's goal is to match this stopping time: $\eta_\pi = t^*$.
3. Cost-Sensitive Stopping Objective
The quality of a defender is determined by the trajectory-level utility $J(\pi)$, which explicitly balances safety and utility:
4. Turn-Level Process Rewards
To optimize $J(\pi)$, we decompose the trajectory-level utility into turn-level process rewards $R_t$:
We optimize TurnGate using Reinforcement Learning with Generalized Advantage Estimate (GAE) to propagate these rewards across turns, training the monitor to recognize the latent synthesis of restricted capability.
Citation
@misc{shen2026turnlateresponseawaredefense,
title={One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue},
author={Xinjie Shen and Rongzhe Wei and Peizhi Niu and Haoyu Wang and Ruihan Wu and Eli Chien and Bo Li and Pin-Yu Chen and Pan Li},
year={2026},
eprint={2605.05630},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.05630},
}