May 2026 Preprint

Benign vs. Harmful Interactions

TurnGate is designed to distinguish between safe technical exploration and adversarial information gathering. Below we analyze two trajectories from the MTID dataset to demonstrate TurnGate's sequential intervention logic.

Case 1: Benign Technical Exploration

A safe request for chemical properties and safety protocols.

Figure: Benign interaction. Despite the use of technical chemistry terminology, the conversation remains focused on safe, educational content. TurnGate correctly allows the full dialogue to pass.

Defender Decision: PASS (Correct)

The sequential state never reaches the threshold for enabling misuse, preventing over-refusal of safe technical traffic.

Case 2: Distributed Malicious Intent

An adaptive attack attempting to synthesize restricted capabilities.

Figure: Harmful interaction. The attacker uses a multi-turn strategy to extract sensitive technical details. TurnGate identifies the closure turn (Turn 4) where the response would become harm-sufficient.

Defender Decision: BLOCK at Turn 4 (Accurate)

TurnGate detects that the cumulative information delivered in the previous turns, combined with the Turn 4 response, enables the harmful objective.

Trajectory Analysis

Traditional monitors often struggle with these cases because they either look at turns in isolation (missing the cumulative risk) or use brittle heuristics (leading to over-refusal of technical terms).

Context Awareness

TurnGate maintains the full conversation state to track the synthesis of restricted information across turns.

Response Aware

By inspecting the candidate response before delivery, TurnGate can judge if that specific response completes the harmful intent.