Benign vs. Harmful Interactions
TurnGate is designed to distinguish between safe technical exploration and adversarial information gathering. Below we analyze two trajectories from the MTID dataset to demonstrate TurnGate's sequential intervention logic.
Case 1: Benign Technical Exploration
A safe request for chemical properties and safety protocols.
Defender Decision: PASS (Correct)
The sequential state never reaches the threshold for enabling misuse, preventing over-refusal of safe technical traffic.
Case 2: Distributed Malicious Intent
An adaptive attack attempting to synthesize restricted capabilities.
Defender Decision: BLOCK at Turn 4 (Accurate)
TurnGate detects that the cumulative information delivered in the previous turns, combined with the Turn 4 response, enables the harmful objective.
Trajectory Analysis
Traditional monitors often struggle with these cases because they either look at turns in isolation (missing the cumulative risk) or use brittle heuristics (leading to over-refusal of technical terms).
Context Awareness
TurnGate maintains the full conversation state to track the synthesis of restricted information across turns.
Response Aware
By inspecting the candidate response before delivery, TurnGate can judge if that specific response completes the harmful intent.