safety-memos

Why Single-Turn Safety Benchmarks Systematically Underestimate Agentic Risk

A field memo on silent failures, policy erosion, and delayed misuse in multi-turn systems


TL;DR

Single-turn safety benchmarks systematically underestimate real-world risk in agentic systems. In multi-turn, partially observable settings, models that appear aligned in isolation can quietly drift, erode policy adherence, or enable misuse only after several steps.

Drawing on concrete failure patterns observed while building multi-turn safety evaluations and safeguards prototypes, this memo outlines three mechanisms behind this underestimation and proposes lightweight evaluation design changes to surface these risks earlier.


1. A concrete failure mode that single-turn benchmarks miss

In one internal agentic prototype I tested, the model passed single-turn jailbreak and refusal benchmarks with near-perfect scores. When evaluated in isolation, it appeared robustly aligned and compliant with policy.

However, when embedded in a multi-step planning loop with partial observability, the system gradually reinterpreted user intent under pressure to be helpful.

The first two steps were benign clarifications. By step four, the model had accepted a reframed goal that violated policy constraints, even though no single message would have triggered a standalone safety classifier.

This was not a sudden jailbreak or explicit refusal failure. It was a quiet drift in how constraints were interpreted across turns.

From the perspective of standard benchmarks, nothing “failed.” From the perspective of deployed agentic behavior, the system crossed a safety boundary.


2. Why RLHF, policy design, and evaluation pipelines underestimate risk

(a) RLHF overfits to short-horizon, i.i.d. interactions

RLHF optimizes behavior under short-horizon interactions drawn from relatively stable distributions. In agentic systems, the distribution shifts as the model’s own outputs shape future inputs.

This creates a feedback loop: small epistemic errors early in a trajectory propagate forward and compound into policy violations over time.

The model is not merely responding to a user—it is responding to a world partially constructed by its own prior outputs.

(b) Linguistic policy compliance is not behavioral constraint alignment

Safety policies are encoded linguistically. Violations often emerge behaviorally across trajectories.

A model can comply with the letter of a policy in isolation while eroding its spirit over multiple steps. This erosion rarely manifests as a single disallowed response. Instead, it appears as gradual reinterpretation of constraints under evolving context.

Single-turn evaluations are well-suited to catching explicit refusals. They are poorly suited to detecting slow degradation of constraint interpretation.

(c) Single-turn evaluation creates systematic blind spots

Benchmarks that score only final responses miss intermediate states where drift and policy reinterpretation occur. These intermediate states are often where safeguards could have intervened.

When evaluation is reduced to snapshots, failure modes that are temporal in nature become structurally invisible.


3. Why multi-turn + partial observability is the root of the problem

In multi-turn agentic settings, the model never observes the full state of the world. It operates on a partial, self-constructed belief state that is updated from its own prior outputs.

This creates two compounding effects:

Single-turn benchmarks implicitly assume static context and full observability. These assumptions do not hold in deployed agentic systems.

As a result, benchmarks systematically underestimate risk in exactly the regimes where agentic systems operate.


4. Three lightweight evaluation design changes

These are not research-heavy proposals. They are small shifts in how we structure evaluation that surface otherwise hidden risks.

(1) Trajectory-level scoring, not final-answer scoring

Track policy adherence, uncertainty, and intent drift over turns, not just end-state correctness. Plotting policy erosion curves often reveals degradation patterns that single-turn scores completely miss.

(2) Delayed failure detection

Explicitly measure how often violations occur only after k steps. Many safeguards appear robust in early turns and fail under gradual decomposition or reframing pressure.

(3) Red-teaming under partial observability

Stress-test agents where critical context is hidden or gradually revealed, rather than fully specified upfront. This better approximates real deployment conditions and exposes failure modes that are invisible under fully-specified prompts.


5. Why this matters for safeguards

Safeguards that operate only at the input or final-output level will systematically miss these failure modes.

Effective safeguards need:

Without these, safeguards will appear robust in benchmarks while quietly failing in deployed agentic workflows.


6. Counter-Arguments: In Defense of Single-Turn Evaluation

A balanced view requires acknowledging the legitimate reasons single-turn benchmarks remain valuable.

(a) Cost efficiency

Multi-turn evaluation is 5-10x more expensive per sample. For budget-constrained teams, single-turn coverage may be the only feasible option. A broad single-turn sweep may catch more issues than a narrow multi-turn deep-dive.

(b) Reproducibility

Single-turn benchmarks are deterministic and reproducible. Multi-turn trajectories introduce variance from model sampling, making results harder to compare across runs or teams.

(c) Most attacks are still single-turn

The majority of observed jailbreaks in the wild are single-turn prompt injections. Multi-turn attacks require more sophistication and are less common. Optimizing for the common case has merit.

(d) Baseline establishment

Single-turn benchmarks establish a necessary (if not sufficient) baseline. A model that fails single-turn safety tests will certainly fail multi-turn tests. Single-turn is a prerequisite, not a competitor.


7. Limitations of This Thesis

This memo has its own blind spots:

(a) Multi-turn evaluation introduces noise

Trajectory-level scoring is inherently noisier. Small perturbations in early turns can cascade into large outcome differences. This makes regression detection harder and increases false positive rates.

(b) Computational cost is real

10-turn evaluations cost 10x. For models evaluated at scale, this cost may be prohibitive. The field needs cost-effective approximations to multi-turn evaluation.

(c) Defining “trajectory failure” is hard

Single-turn has clear success criteria. Multi-turn requires defining what constitutes a “failed trajectory” — a judgment that may vary by context, stakeholder, and harm domain.

(d) Adversarial multi-turn is still nascent

The attack surface for multi-turn is less well-mapped than single-turn. We may be optimizing for the wrong threats.


8. Open Questions for Future Work

This memo raises more questions than it answers:

  1. What’s the minimum trajectory length needed to detect most drift? Can we get 80% of multi-turn value with 3-turn evaluation?

  2. Can we distill trajectory-level signals into single-turn proxies? If so, we get the best of both worlds.

  3. How do we handle the cost-coverage tradeoff? Adaptive sampling? Importance weighting?

  4. When is multi-turn evaluation overkill? For non-agentic deployments, is single-turn sufficient?

  5. How do we standardize trajectory-level benchmarks? Without standardization, results are incomparable across teams.


9. Closing

If we continue to rely on single-turn safety benchmarks, we will keep being surprised by failures that were structurally invisible to our evaluations.

Agentic safety requires moving our evaluation lens from snapshots to trajectories.

But this shift must be pragmatic: acknowledging costs, tolerating noise, and recognizing that single-turn benchmarks remain a valuable (if incomplete) foundation.


This memo is based on empirical failure patterns observed while building multi-turn safety evaluations, misuse detection benchmarks, and safeguards simulators for agentic systems.