The Lowpass Dispatch

Vol. I, No. 2026-05-21 Thursday, May 21, 2026 84 articles surveyed
EVALUATION

Frontier Model Leaderboards Are Saturating

As top models cluster on benchmarks like SWE-bench, researchers argue for new evaluation methods that resist contamination and measure capability trade-offs.

New analysis shows top AI models are hitting performance ceilings on popular benchmarks, making it harder to distinguish between them. A May 2026 paper, "The Growing Pains of Frontier Models," finds that while coding and reasoning capabilities generally improve together ($r = +0.72$), some labs are hitting trade-offs. The research decomposes scores on SWE-bench and GPQA Diamond, revealing that SWE-bench is now saturating and no longer effectively separates frontier models.

This saturation is compounded by data contamination. A separate May 2026 paper argues that many benchmarks are included in pretraining data, turning evaluations into tests of memorization, not generalization. The authors call for "contamination-resistant" datasets that are structured to be unlearnable during training but still support inference, leveraging architectural asymmetries in transformers.

In response, researchers are developing frameworks to automatically generate more robust, fine-grained benchmarks from reference materials like textbooks. One such system uses a multi-agent architecture to create problems with broad coverage and rich metadata, producing ground-truth solutions with a lower error rate than benchmarks like MMLU and GSM8K. These next-generation evaluations aim to restore a clear signal of model progress.

Sources: The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next · LLM Benchmark Datasets Should Be Contamination-Resistant · Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models
AGENTS

Researchers Call for Systems-Level Agent Security

A new wave of research argues that treating the LLM as an untrusted component is the only way to build secure and reliable agentic systems.

Security researchers are advocating for a fundamental shift in how AI agents are built, urging developers to enforce security at the system level rather than relying on model robustness alone. A May 2026 position paper titled "Agent Security is a Systems Problem" argues that the AI model must be treated as an "untrusted component," with security invariants enforced by the surrounding architecture.

The authors apply principles from decades of systems security research, analyzing 11 real-world attacks on agents and showing how system-level controls could have prevented them. This approach moves beyond simply trying to make the model itself refuse harmful requests, which has proven insufficient as agent capabilities expand.

This systems-first approach is appearing in new tools. PrivScope, a "payload governor," runs on-device to control what sensitive information is sent to a cloud LLM, enforcing task-scoped disclosure. Another system, ESLD, speeds up prompt-injection defense by more than 3x by reading signals directly from a guard model's latent space, treating the guard as one component in a larger security architecture.

As agents gain more autonomy and access to tools and private data, the consensus is that security must be baked into the agent's runtime environment, not bolted on as a prompt instruction.

RESEARCH

Agents Learn to Generate Their Own Memory

A new framework called Mem-π teaches agents to generate context-specific guidance on demand, outperforming traditional retrieval-based memory systems.

Researchers are moving beyond simple retrieval for agent memory, developing systems where agents learn to generate their own guidance. A May 2026 paper introduces Mem-π, a framework that uses a dedicated language model to produce context-specific advice for a separate downstream agent, rather than retrieving it from an external memory store.

Unlike systems that retrieve static entries from a memory bank, Mem-π's "memory model" is trained with reinforcement learning to decide both when to generate guidance and what guidance to produce. It learns to abstain if guidance would not be helpful and otherwise generates concise instructions tailored to the agent's current situation.

This generative approach consistently outperforms retrieval-based methods. On complex web navigation tasks, Mem-π achieved a relative improvement of over 30% compared to baselines. The technique also showed strong performance on terminal-based tool use and text-based embodied interaction benchmarks.

Generative memory offers a more dynamic and adaptive alternative to retrieval-augmented generation (RAG). Instead of being limited by the contents of an external database, agents can synthesize novel strategies and knowledge, potentially leading to more flexible and capable systems.

TOOLS

GraphRAG Gets Smarter With Retrieval-Time Reasoning

New frameworks like DotRAG are reformulating retrieval as a reasoning process, guiding graph exploration with query-specific logic instead of just fetching and then thinking.

The standard "retrieve-then-reason" paradigm for Retrieval-Augmented Generation is being challenged by new methods that integrate reasoning directly into the retrieval process. A May 2026 paper on DotRAG, a training-free framework, shows that making retrieval itself a reasoning task improves performance on complex, multi-hop questions by avoiding the accumulation of irrelevant context.

DotRAG treats retrieval as a process of reasoning along paths in a knowledge graph. It generates query-conditioned constraints to guide the search, pruning irrelevant sections of the graph and discovering relational paths iteratively. The framework achieved state-of-the-art performance on the MetaQA and UltraDomain benchmarks, with consistent gains on multi-hop tasks.

This is part of a broader trend toward more structured retrieval. Other new systems like ContextRAG build graph topologies without expensive LLM-based entity extraction, using fuzzy logic to induce relationships. Agentic GraphRAG applies collaborative agents to navigate complex financial data stored in knowledge graphs, combining structured queries with reflection loops.

By pushing reasoning into the retrieval step, these systems make RAG more efficient and accurate for questions that require connecting multiple pieces of information. This moves beyond simple semantic search to a more structured, logical exploration of knowledge.

BENCHMARKS

New Benchmarks Test Agents on Long, Realistic Workflows

A new generation of benchmarks like RoadmapBench and CHI-Bench is moving beyond single-file bug fixes to evaluate agents on complex, multi-file, policy-rich tasks.

AI agent evaluation is shifting from simple, single-issue tasks to long-horizon workflows that better reflect real-world complexity. New benchmarks introduced in May 2026, including RoadmapBench for coding and CHI-Bench for healthcare, show that even frontier models struggle with these challenges.

RoadmapBench evaluates agents on 115 tasks based on open-source version upgrades, requiring changes across a median of 51 files and 3,700 lines of code. The top-performing model, Claude-Opus-4.7, resolved only 39.1% of tasks, a stark contrast to high scores on simpler bug-fix benchmarks, suggesting long-horizon software development is a largely unsolved problem.

The challenge is even greater in policy-rich domains. CHI-Bench tests agents on healthcare workflows like prior authorization, requiring them to navigate a 1,290-document handbook and interact with 20 simulated apps. The best agent solved only 28% of tasks, highlighting major gaps in handling complex rules and multi-step, multi-role interactions.

These benchmarks reveal that current agent capabilities are brittle when faced with the scale and interdependencies of real engineering and enterprise work. They provide a clearer roadmap for what needs to improve: long-term planning, context management, and reasoning over complex constraints.