The Lowpass Dispatch · Thursday, May 21, 2026

AGENTS

Researchers Call for Systems-Level Agent Security

A new wave of research argues that treating the LLM as an untrusted component is the only way to build secure and reliable agentic systems.

Security researchers are advocating for a fundamental shift in how AI agents are built, urging developers to enforce security at the system level rather than relying on model robustness alone. A May 2026 position paper titled "Agent Security is a Systems Problem" argues that the AI model must be treated as an "untrusted component," with security invariants enforced by the surrounding architecture.

The authors apply principles from decades of systems security research, analyzing 11 real-world attacks on agents and showing how system-level controls could have prevented them. This approach moves beyond simply trying to make the model itself refuse harmful requests, which has proven insufficient as agent capabilities expand.

This systems-first approach is appearing in new tools. PrivScope, a "payload governor," runs on-device to control what sensitive information is sent to a cloud LLM, enforcing task-scoped disclosure. Another system, ESLD, speeds up prompt-injection defense by more than 3x by reading signals directly from a guard model's latent space, treating the guard as one component in a larger security architecture.

As agents gain more autonomy and access to tools and private data, the consensus is that security must be baked into the agent's runtime environment, not bolted on as a prompt instruction.

Sources: Agent Security is a Systems Problem · PrivScope: Task-scoped Disclosure Control for Hybrid Agentic Systems · ESLD (External Surrogate Latent Defense): A Latent-Space Architecture for Faster, Stronger Prompt-Injection Defense

RESEARCH

Agents Learn to Generate Their Own Memory

A new framework called Mem-π teaches agents to generate context-specific guidance on demand, outperforming traditional retrieval-based memory systems.

Researchers are moving beyond simple retrieval for agent memory, developing systems where agents learn to generate their own guidance. A May 2026 paper introduces Mem-π, a framework that uses a dedicated language model to produce context-specific advice for a separate downstream agent, rather than retrieving it from an external memory store.

Unlike systems that retrieve static entries from a memory bank, Mem-π's "memory model" is trained with reinforcement learning to decide both when to generate guidance and what guidance to produce. It learns to abstain if guidance would not be helpful and otherwise generates concise instructions tailored to the agent's current situation.

This generative approach consistently outperforms retrieval-based methods. On complex web navigation tasks, Mem-π achieved a relative improvement of over 30% compared to baselines. The technique also showed strong performance on terminal-based tool use and text-based embodied interaction benchmarks.

Generative memory offers a more dynamic and adaptive alternative to retrieval-augmented generation (RAG). Instead of being limited by the contents of an external database, agents can synthesize novel strategies and knowledge, potentially leading to more flexible and capable systems.

Sources: Mem-$π$: Adaptive Memory through Learning When and What to Generate

TOOLS

GraphRAG Gets Smarter With Retrieval-Time Reasoning

New frameworks like DotRAG are reformulating retrieval as a reasoning process, guiding graph exploration with query-specific logic instead of just fetching and then thinking.

The standard "retrieve-then-reason" paradigm for Retrieval-Augmented Generation is being challenged by new methods that integrate reasoning directly into the retrieval process. A May 2026 paper on DotRAG, a training-free framework, shows that making retrieval itself a reasoning task improves performance on complex, multi-hop questions by avoiding the accumulation of irrelevant context.

DotRAG treats retrieval as a process of reasoning along paths in a knowledge graph. It generates query-conditioned constraints to guide the search, pruning irrelevant sections of the graph and discovering relational paths iteratively. The framework achieved state-of-the-art performance on the MetaQA and UltraDomain benchmarks, with consistent gains on multi-hop tasks.

This is part of a broader trend toward more structured retrieval. Other new systems like ContextRAG build graph topologies without expensive LLM-based entity extraction, using fuzzy logic to induce relationships. Agentic GraphRAG applies collaborative agents to navigate complex financial data stored in knowledge graphs, combining structured queries with reflection loops.

By pushing reasoning into the retrieval step, these systems make RAG more efficient and accurate for questions that require connecting multiple pieces of information. This moves beyond simple semantic search to a more structured, logical exploration of knowledge.

Sources: DOTRAG: Retrieval-Time Reasoning Along Paths · ContextRAG: Extraction-Free Hierarchical Graph Construction for Retrieval-Augmented Generation · Agentic GraphRAG: Navigating Unstructured Financial Data with Collaborative AI

BENCHMARKS

New Benchmarks Test Agents on Long, Realistic Workflows

A new generation of benchmarks like RoadmapBench and CHI-Bench is moving beyond single-file bug fixes to evaluate agents on complex, multi-file, policy-rich tasks.

AI agent evaluation is shifting from simple, single-issue tasks to long-horizon workflows that better reflect real-world complexity. New benchmarks introduced in May 2026, including RoadmapBench for coding and CHI-Bench for healthcare, show that even frontier models struggle with these challenges.

RoadmapBench evaluates agents on 115 tasks based on open-source version upgrades, requiring changes across a median of 51 files and 3,700 lines of code. The top-performing model, Claude-Opus-4.7, resolved only 39.1% of tasks, a stark contrast to high scores on simpler bug-fix benchmarks, suggesting long-horizon software development is a largely unsolved problem.

The challenge is even greater in policy-rich domains. CHI-Bench tests agents on healthcare workflows like prior authorization, requiring them to navigate a 1,290-document handbook and interact with 20 simulated apps. The best agent solved only 28% of tasks, highlighting major gaps in handling complex rules and multi-step, multi-role interactions.

These benchmarks reveal that current agent capabilities are brittle when faced with the scale and interdependencies of real engineering and enterprise work. They provide a clearer roadmap for what needs to improve: long-term planning, context management, and reasoning over complex constraints.

Sources: RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades · CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

Frontier Model Leaderboards Are Saturating

Researchers Call for Systems-Level Agent Security

Agents Learn to Generate Their Own Memory

GraphRAG Gets Smarter With Retrieval-Time Reasoning

New Benchmarks Test Agents on Long, Realistic Workflows