The Lowpass Dispatch

Vol. I, No. 2026-05-26 Tuesday, May 26, 2026 131 articles surveyed
AGENTS

Agents Learn to Learn: Skill Optimizers and Evolving Memory Take Center Stage

The new frontier for AI agents is not just better base models, but self-improving "cognitive architectures" that learn and evolve their own skills and memory systems.

Agent capabilities are increasingly defined by their surrounding architecture, not just the base LLM. A wave of new research treats agent "skills" and "memory" as components to be systematically optimized and evolved, much like model weights during training.

The SkillOpt framework introduces a "text-space optimizer" that turns scored agent rollouts into edits on a skill document, accepting changes only if they improve validation scores. On GPT-5.5, it boosted accuracy by up to +24.8 points. Similarly, the PANDO framework uses "online skill distillation" to learn from web navigation trajectories, making agents more efficient over time, while DRIVE disentangles abstract "reasoning skills" from website-specific "interaction skills" to improve generalization.

This evolutionary pressure extends to memory. The M* framework automatically discovers task-optimized "memory harnesses"—Python programs defining data schemas and logic—through code evolution. MemSkill takes a similar approach, evolving a set of memory skills by analyzing and refining routines that failed on hard cases.

This marks a shift from hand-crafting agent control flows to creating systems that learn their own. The focus is moving up a level of abstraction: from prompting for a task to building optimizers that discover the best prompts and memory structures for a whole class of tasks.

Sources: SkillOpt: Executive Strategy for Self-Evolving Agent Skills · PANDO: Efficient Multimodal AI Agents via Online Skill Distillation · M*$\star$: Every Task Deserves Its Own Memory Harness · MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents · DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning
EVALUATION

Beyond Accuracy: The Push for Process-Level and Audited AI Benchmarks

As frontier models saturate leaderboards, researchers are shifting focus from final-answer correctness to evaluating the quality of the reasoning process and even auditing the benchmarks themselves.

Top AI models are acing benchmarks like MMLU, but these scores no longer reliably separate capabilities. A new paper argues leaderboards are suffering from "growing pains," forcing a move beyond simple accuracy metrics. The new frontier of evaluation is focusing on how a model arrives at an answer, not just the answer itself.

Researchers are proposing multi-dimensional frameworks that measure reasoning along axes like consistency, robustness, and logical coherence. One study found logical coherence was uncorrelated with correctness (r = -0.172), confirming that right answers can come from flawed reasoning. ProcCtrlBench evaluates agents on "process-level defects," standardizing execution logs to find errors that outcome-based tests miss.

The most significant shift is the auditing of benchmarks. The Auto Benchmark Audit (ABA) framework uses an agent to systematically probe benchmark tasks for flaws. It found critical issues like ambiguous design and incorrect ground truths in over 25.7% of tasks across 168 benchmarks. Filtering these flawed tasks changed model rankings on SWE-bench and Terminal-Bench, boosting average performance by over 9%.

For engineers building with AI, this means benchmark scores should be treated with skepticism. The real measure of a model's utility lies in the reliability of its reasoning process. The next generation of evaluation will provide more nuanced, process-oriented report cards, giving developers a clearer picture of a model's true capabilities and failure modes.

TRAINING

RL Gets Practical: Agents Learn from Terminal Output and Self-Identified Errors

Reinforcement learning for agents is maturing from a research curiosity into a set of practical engineering techniques focused on creating dense, high-quality learning signals.

Reinforcement learning (RL) has long promised to improve AI agents, but has been hampered by sparse rewards and unstable training. A new wave of research is making RL more practical by finding clever ways to generate dense, informative training signals from agent interactions, moving beyond simple outcome-based rewards.

The ECHO framework demonstrates this shift for command-line interface (CLI) agents. It argues that standard RL ignores a rich, free source of supervision: the terminal's output stream. ECHO adds an auxiliary training objective that teaches the agent to predict the `stdout` and `stderr` from its own commands. This turns every rollout, even failed ones, into a dense learning opportunity. On TerminalBench-2.0, ECHO doubled the pass rate for Qwen3 models.

Other methods focus on improving credit assignment. Self-Reset Policy Optimization (SRPO) trains a model to self-diagnose the first erroneous step in a failed reasoning chain, "reset" the trajectory to that point, and then learn from resampled continuations. This attributes failure more precisely than updating the entire trajectory. Another line of work, OASES, co-trains the agent's policy with an evaluator model that provides dense, outcome-aligned process rewards.

These techniques make RL a more viable tool for agent developers. By extracting learning signals from environmental feedback and counterfactual rollouts, they solve the credit assignment problem that has historically plagued RL for reasoning tasks. This enables more targeted and efficient agent improvement without requiring massive, hand-annotated datasets.

RESEARCH

Cracking the Black Box: Probes Reveal Internal Tool Graphs and Reasoning Phases

Researchers are beginning to decode the internal states of LLMs, finding that models build explicit representations of their own reasoning and undergo hidden "phase transitions" as they scale.

The inner workings of large language models are becoming less mysterious. New research using "structural probes" is demonstrating that LLMs build internal, decodable representations of complex structures, offering a glimpse into how they reason and plan.

A study on Qwen3-32B found that the dependency graph between an agent's tool calls is "linearly decodable" from the model's residual stream. A simple probe trained on the model's internal activations could reconstruct which tool outputs were used as inputs for later calls. This suggests the model isn't just chaining text, but representing an abstract execution plan internally.

Another study reveals a hidden "phase transition" in how models learn. Across 63 base models, researchers found that below a critical scale of roughly 3.5 billion parameters, truthfulness and reasoning capabilities are anti-correlated. Above this threshold, they become positively correlated. This transition, invisible to standard loss curves, suggests a fundamental shift in the model's internal organization as it scales.

These findings mark a step toward mechanistic interpretability for AI agents. Understanding how models represent plans and how their internal dynamics change with scale could lead to more reliable debugging, better steering of model behavior, and more predictable performance. For engineers, it's the first step from treating models as black boxes to understanding them as complex computational systems.

AGENTS

The Illusion of Agreement: Multi-Agent Systems Risk 'False Consensus'

The hype around multi-agent AI is meeting a harsh reality, as new research reveals they can create a "false consensus" by amplifying errors and hiding uncertainty.

Assembling teams of AI agents to tackle complex problems is a popular new pattern, but it introduces subtle and dangerous failure modes. A new audit of medical multi-agent systems warns of "false consensus," where agents converge on a confident but incorrect answer, a risk that standard accuracy benchmarks completely miss.

The MedAgentAudit framework analyzed thousands of multi-agent execution logs and identified recurring failures. In discussion, agents repeated their initial views in 98.42% of cases rather than re-examining evidence. Unsupported observations made by one agent affected 16.63% of cases and were propagated downstream without verification. The agents created an illusion of peer review while actually reinforcing initial mistakes.

These collaborative failures are compounded by training challenges. A separate study on multi-agent reinforcement learning found that training is often unstable. Depending on the workflow structure and whether agents share a single policy, they can suffer "terminal accuracy cliffs," where performance suddenly collapses after a period of improvement.

For engineers building multi-agent systems, the takeaway is that collaboration is not an automatic capability. It must be explicitly designed for and audited. Simply connecting agents is not enough; developers need to build in mechanisms for genuine critique, evidence verification, and uncertainty tracking to avoid the trap of a confident, collective, and incorrect conclusion.