TurnGate: Detecting Malicious Intent in Multi-turn Dialogue

Abstract

Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. We show that MTID helps enable a turn-level monitor TurnGate, which substantially outperforms existing baselines in harmful-intent detection while maintaining low over-refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models.

Sequential Intervention Logic

The defense challenge is no longer just to judge whether an individual turn is unsafe, but to determine when the dialogue as a whole becomes sufficient to enable harm. We formalize this as a cost-sensitive sequential stopping problem, where the goal is to identify the first turn $t^*$ at which delivering the candidate response would complete the information needed for misuse.

An ideal defender must identify the earliest turn $\eta_\pi$ that matches $t^*$. This interaction protocol is response-aware: the monitor inspects the model's generated response before delivery, as the assistant's own technical revelations contribute to whether the harm threshold is crossed.

The MTID Dataset: Construction & Annotation

To operationalize the defense, we construct the Multi-Turn Intent Dataset (MTID) as a collection of branching, multi-turn trajectories from adaptive attackers. Each node stores the full dialogue prefix plus the candidate response, and we annotate the earliest closure turn $t^*$ (or $\infty$ for benign trajectories).

The Target

Most safety datasets use static turn labels or coarse trajectory judgments. MTID focuses on:

Closure Labels: Pinpoints the earliest turn $t^*$ where harm becomes enabling.
Branching Attacks: Captures the hit-and-pivot behavior of adaptive attackers.
Matched Benign Pairs: Includes hard negatives with overlapping technical vocabulary to stress-test over-refusal.

Implementation

We build MTID with a three-stage adaptive simulation pipeline:

Adaptive Tree Search: A CKA Agent explores a branching attack tree. Nodes are dialogue states $h_{t-1}$, and edges are candidate assistant responses.
Sufficiency Annotation: An evaluator applies $\mathrm{Suff}(x_t, g)$ over the full prefix (including the candidate response) to mark the minimal closure turn $t^*$.
Hard-Negative Pairing: For each harmful rollout, we add a benign trajectory with matched topic and terminology using WildJailbreak.

Result: 16,000 trajectories across Chemistry and Cybersecurity, each with closure labels and matched benign pairs.

Example MTID Trajectories

Benign MTID example — **Benign example.** Sociological discussion of drug trafficking that remains non-operational.

Harmful MTID example — **Harmful example.** The conversation shifts to phishing intent; closure occurs at $t^*$ and triggers intervention.

MTID Generative Process

MTID Generative Process. We simulate multi-turn interactions using a three-stage adaptive pipeline. Stage 1 selects a harmful/benign intent. Stage 2 explores a branching dialogue tree using adaptive search. Stage 3 rolls out complete trajectories and annotates the earliest closure turn $t^*$. Here, $q_t$ and $r_t$ denote the user query and assistant response at turn $t$, respectively.

Experimental Results

TurnGate achieves the state-of-the-art trade-off between timely intervention and utility preservation.

Main Offline Results (MTID Test Split)

Method	Backbone	Utility	Harmful (Detection Timing)				F1 Score (Summary)
Method	Backbone	Benign $\uparrow$	Miss $\downarrow$	Early $\downarrow$	Acc ($\phi_1$) $\uparrow$	Harm ($\phi_2$) $\uparrow$	F1 ($\phi_1$) $\uparrow$	F1 ($\phi_2$) $\uparrow$	F1 ($\phi_3$) $\uparrow$
Prompt-based Monitors
Vanilla LLM	Qwen3-4B	0.753	0.708	0.168	0.124	0.211	0.211	0.330	0.284
Vanilla LLM	GPT-5.2	0.826	0.787	0.073	0.139	0.175	0.238	0.288	0.266
Vanilla LLM	GPT-OSS-120B	0.809	0.719	0.114	0.167	0.220	0.277	0.345	0.313
Sequential Mon.	Qwen3-4B	0.898	0.831	0.086	0.083	0.123	0.153	0.216	0.187
Sequential Mon.	GPT-5.2	0.648	0.412	0.296	0.292	0.428	0.402	0.516	0.465
Intention Analysis	Qwen3-4B	0.854	0.823	0.087	0.091	0.133	0.164	0.230	0.202
Intention Analysis	GPT-OSS-120B	0.023	0.045	0.638	0.317	0.574	0.043	0.045	0.044
Guardrail-based Baselines
Llama Guard	Llama-3-8B	0.998	0.998	0.001	0.002	0.002	0.003	0.005	0.005
Qwen Guard	Qwen3-8B	0.863	0.788	0.092	0.120	0.159	0.211	0.269	0.239
Synthesis-Llama	Llama-3-8B	0.983	0.990	0.003	0.007	0.008	0.013	0.017	0.015
Trainable Methods (Qwen3-4B)
Naive-SFT (Traj)	Qwen3-4B	0.609	0.089	0.629	0.282	0.532	0.386	0.568	0.477
Naive-SFT (Turn)	Qwen3-4B	0.930	0.738	0.098	0.163	0.224	0.278	0.361	0.334
Reweighted-SFT	Qwen3-4B	0.840	0.379	0.278	0.343	0.479	0.487	0.610	0.557
TurnGate (Ours)	Qwen3-4B	0.834	0.177	0.409	0.414	0.602	0.553	0.699	0.633

Metric Intuition: Proximity Score $\phi$

How do we measure "closeness" to the closure turn $t^*$? We define $\phi(\eta_\pi; t^*)$ as the credit assigned based on the turn $\eta_\pi$ at which the defender intervenes.

Dialogue Timeline (Closure Turn $t^*=4$)

Case A: Exact ($\eta_\pi = t^*$)

Score: 1.0

1

2

3

4

5

Score: 0.5

Case B: Early ($\eta_\pi = 2$)

Score: 0.0

Case C: Miss ($\eta_\pi > t^*$)

$\phi_1$ (Strict)

Only exact closure at $t^*$ is rewarded ($1.0$). Any early block or miss receives $0.0$. Most stringent safety metric.

Case A: 1.0; Case B: 0.0; Case C: 0.0.

$\phi_2$ (Linear Proximity)

Rewards "closeness" to $t^*$. Assigns partial credit $\eta_\pi/t^*$ for early blocks. Misses receive $0.0$. Used for main F1 results.

Case A: 1.0; Case B: $\eta_\pi/t^*$ (here 0.5); Case C: 0.0.

$\phi_3$ (Conservative)

Sharp penalty for premature blocks. Assigns $(\eta_\pi/t^*)^2$ for early intervention. Higher penalty for blocking near turn 1.

Case A: 1.0; Case B: $(\eta_\pi/t^*)^2$ (here 0.25); Case C: 0.0.

Online OOD Generalization

Setting: We stress-test TurnGate in a strict zero-shot setting against an adaptive tree-search attacker ($i=5$) across distribution shifts in benchmarks, target models, and attacker pipelines.

Why online is harder: The attacker interacts with the live target--defender loop, conditions each next query on delivered responses, and can backtrack or reroute after blocked or uninformative turns. This closed-loop adaptation (with iteration budget $i \in \{1, 3, 5\}$) is more realistic than offline fixed trajectories and makes defense substantially harder.

(a) Benchmark Shift: Zero-shot transfer to StrongReject, HarmBench, and JBB-Behaviors across 12+ unseen risk categories.

(b) Model Shift: Defending a Gemini-3.1-Pro model zero-shot using a monitor trained exclusively on GPT-5.2 interactions.

(c) Joint Shift: Simultaneous zero-shot transfer across both target model architecture and high-risk benchmarks.

(d) Pipeline Shift: Generalization to defend against multi-agent jailbreak (MAJ) attack pipelines.

Setting

Online battle against an adaptive tree-search attacker with iteration budget $i=5$.

Key Insight

Reduces ASR by up to 45% on Illegal Goods and 40% on Physical Harm categories zero-shot.

Takeaway

TurnGate learns transferable turn-level behavior rather than relying on domain-specific surface cues.

Methodology: From Sequential Logic to RL Optimization

TurnGate is designed to solve the challenge of distributed malicious intent, where an attacker spreads a harmful objective across multiple benign-looking turns.

1. Interaction Protocol and Response-Awareness

We model the interaction as a three-party dialogue among a user, a base assistant, and a defender. At each turn $t$, the defender observes the full context: $$x_t = ((q_1, r_1), \ldots, (q_{t-1}, r_{t-1}), q_t, \tilde{r}_t)$$ Crucially, the defender is response-aware: it inspects the assistant's candidate response $\tilde{r}_t$ before delivery. This is essential because the model's own technical revelations often determine whether the harm threshold is crossed.

2. The Harmful Closure Turn ($t^*$)

We define the closure turn $t^*$ as the first turn where the interaction becomes sufficient to realize a harmful objective $g$: $$t^*(\tau, g) = \min(\{t \in \{1, \ldots, T\} : \mathrm{Suff}(x_t, g) = 1\} \cup \{\infty\})$$ where $\mathrm{Suff}(x_t, g)$ is a binary sufficiency operator. The defender's goal is to match this stopping time: $\eta_\pi = t^*$.

3. Cost-Sensitive Stopping Objective

The quality of a defender is determined by the trajectory-level utility $J(\pi)$, which explicitly balances safety and utility:

                        $$J(\pi) = \mathbb{E}\Big[ u_{\mathrm{ben}}\mathbf{1}[\text{benign}] +
                        u_{\mathrm{hit}}\mathbf{1}[\text{timely}] - c_{\mathrm{fp}}\mathbf{1}[\text{FP}] -
                        c_{\mathrm{miss}}\mathbf{1}[\text{miss}] - c_{\mathrm{early}}(1-\phi)\mathbf{1}[\text{early}]
                        \Big]$$
                    

4. Turn-Level Process Rewards

To optimize $J(\pi)$, we decompose the trajectory-level utility into turn-level process rewards $R_t$:

                        $$R_t = \begin{cases} u_{\mathrm{ben}}\mathbf{1}[t < t^\ast] -
                            c_{\mathrm{miss}}\mathbf{1}[t=t^\ast], & \text{if } a_t=\text{PASS} \\
                            u_{\mathrm{hit}}\mathbf{1}[t=t^\ast] - c_{\mathrm{early}}(1-\phi(t; t^\ast))\mathbf{1}[t <
                            t^\ast] - c_{\mathrm{fp}}\mathbf{1}[t^\ast=\infty], & \text{if } a_t=\text{BLOCK}
                            \end{cases}$$ 

We optimize TurnGate using Reinforcement Learning with Generalized Advantage Estimate (GAE) to propagate these rewards across turns, training the monitor to recognize the latent synthesis of restricted capability.

Citation

@misc{shen2026turnlateresponseawaredefense,
  title={One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue}, 
  author={Xinjie Shen and Rongzhe Wei and Peizhi Niu and Haoyu Wang and Ruihan Wu and Eli Chien and Bo Li and Pin-Yu Chen and Pan Li},
  year={2026},
  eprint={2605.05630},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2605.05630}, 
}