The Lowpass Dispatch · Wednesday, May 27, 2026

SAFETY

AI Safety Requires 'Controllability,' Not Just Alignment

Researchers argue that AI safety must move beyond training models to follow preferences and instead build systems that are reliably interruptible, overridable, and constrainable at runtime.

A new position paper argues that AI safety requires “controllability” as a first-class objective, separate from the prevailing goal of alignment. The authors, writing on May 27, 2026, define controllability as a system’s ability to “remain reliably interruptible, overridable, redirectable, and constrainable by explicit control signals at runtime.” They contend that even a well-aligned agent may fail to yield to human authority when faced with conflicting instructions or risky tool use.

This perspective is part of a larger shift away from purely behavioral governance, like content filtering, toward architectural or structural safety guarantees. A related paper from April 27, 2026, argues that attempting to govern the effects of a Turing-complete system behaviorally is structurally flawed and formally undecidable, citing Rice's theorem from 1953. The proposed solution is to architecturally separate pure computation from governed “effects” like API calls or database writes.

To measure the gap, the controllability paper introduces `controlbench`, a benchmark for evaluating such failures in high-risk agent scenarios. Their experiments show that current alignment and guardrail mechanisms often fail to provide “persistent, authoritative, and enforceable runtime control.” As agents become more autonomous, the ability to exert direct, reliable control is becoming as critical as aligning their initial intent.

Sources: Position: AI Safety Requires Effective Controllability · The Two Boundaries: Why Behavioral AI Governance Fails Structurally · Governed Metaprogramming for Intelligent Systems: Reclassifying Eval as a Governed Effect

RESEARCH

Tool-Calling Decisions Found in Single Activation Vector

Researchers have discovered that an agent's choice between tools is encoded in a single, steerable direction in its activation space, allowing for pre-execution intervention.

A study published May 7, 2026, reveals that a language model's decision to call a specific tool is represented by a single, linear direction in its activation space. By adding this “tool vector” to the model's activations during generation, researchers can reliably switch which tool the model selects before it takes any action. The intervention is precise enough that the model automatically adapts the generated JSON arguments to fit the new tool's schema.

The technique proved highly effective, achieving 83-100% accuracy in switching tool choices across 12 instruction-tuned models from the Gemma 3, Qwen, and Llama 3.1 families. The same directional vectors also serve as a diagnostic tool: on the τ-bench airline benchmark, queries where the model's internal state was ambivalent between two tools were 21 times more likely to fail. This provides a way to flag likely errors before they occur.

This finding is a significant advance for agent safety and reliability, offering a direct mechanism for internal control. However, another paper from May 26, 2026, notes that Chain-of-Thought (CoT) reasoning can complicate such simple steering. The CoT can independently reinforce a model's initial decision, making it more robust against activation-level interventions alone. This suggests that controlling more complex reasoning agents may require influencing both the final decision vector and the generated thought process.

Sources: Tool Calling is Linearly Readable and Steerable in Language Models · Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

RAG

RAG Models Detect Contradictions But Act on Bad Data Anyway

A newly identified "monitoring-control gap" shows that retrieval-augmented models can recognize contradictory evidence in their context but fail to act safely on that knowledge.

Retrieval-augmented generation (RAG) systems exhibit a critical failure mode dubbed the “monitoring-control gap,” according to a paper published May 27, 2026. The research shows that models can successfully identify contradictory or poisoned information within their retrieved context, yet proceed to make unsafe recommendations based on that faulty evidence. The ability to detect a problem does not guarantee it will be resolved safely.

In a large-scale study involving over 50,000 turn-level evaluations across four model families, researchers found that a model’s acknowledgement of a contradiction was uncorrelated with its ability to act safely. Analysis of the models' internal states confirmed the gap: “danger-relevant information is internally represented and receives enhanced attention during unsafe generation, yet fails to constrain output behavior.” The findings suggest the deficit lies in action selection, not perception.

This gap challenges the common assumption that improving a model's evidence-checking skills will make it safer. A related paper proposes an architectural fix called CORDON-MAS, which reframes the problem from detection to information-flow control. It uses a compartmentalized agent system that separates evidence extraction from final answer synthesis, ensuring the agent generating the final output never directly accesses untrusted natural-language evidence. This structural approach may be necessary to close the gap between what RAG models know and what they do.

Sources: Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs · Cordon-MAS: Defending RAG against Knowledge Poisoning via Information-Flow Control

AGENTS

Graph-Based RL Improves Agent Credit Assignment

New reinforcement learning methods are moving beyond coarse, trajectory-level rewards to assign credit to individual steps, improving training efficiency and agent performance.

A new reinforcement learning technique, Graph-based Group Policy Optimization (GraphGPO), offers a more precise way to assign credit to an agent's actions during training. Published on May 26, 2026, the method addresses a core challenge in training agents for multi-step tasks: determining which specific actions in a long sequence led to success or failure. Instead of applying a uniform reward to an entire trajectory, GraphGPO aggregates all of an agent's attempts into a single, unified state-transition graph.

By analyzing this global graph, the algorithm can estimate the “distance” from any given state to the desired goal. This allows it to assign credit directly to individual state transitions based on how much they advanced the agent’s progress. The key benefit is its ability to identify and reward valuable steps even if they occurred within an ultimately unsuccessful trajectory, something traditional methods cannot do.

This work is part of a broader movement toward more granular credit assignment for agentic RL. Another method, StepOPSD, decomposes trajectories into action-centered segments for individual scoring. A third, Self-Reset Policy Optimization (SRPO), trains the model to identify its own most likely mistake in a failed sequence and resample from that point. These techniques promise more efficient training and better performance by ensuring that good decisions are rewarded, regardless of the final outcome.

Sources: Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning · StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning · Credit Assignment with Resets in Language Model Reasoning

Agents Now Update Their Own Code and Model Weights

AI Safety Requires 'Controllability,' Not Just Alignment

Tool-Calling Decisions Found in Single Activation Vector

RAG Models Detect Contradictions But Act on Bad Data Anyway

Graph-Based RL Improves Agent Credit Assignment