The Lowpass Dispatch · Thursday, May 28, 2026

AGENTS

Agent Skills Move From Prose to Pseudocode and Optimization

A new wave of research is refactoring agent skill libraries from ambiguous markdown into structured, verifiable, and optimizable formats to improve reliability.

LLM agents are getting a skill upgrade, as researchers move to replace free-form prose libraries with structured, optimizable alternatives. Two new papers, "Skill-as-Pseudocode" and "SkillGrad," propose methods to automatically refactor skill libraries into typed pseudocode and optimize them using a framework inspired by gradient descent.

Agents often fail when relying on markdown skill descriptions, entering what one paper calls a "confused -> re-retrieve -> still confused" loop. The Skill-as-Pseudocode (SaP) method addresses this by automatically converting prose into typed contracts and concrete action templates. On the ALFWorld benchmark, SaP won 82 paired games versus 47 for a baseline, using 22.8% fewer input tokens.

SkillGrad takes this further, treating the entire skill package as a parameter to be optimized. It uses task failures to generate text-based "gradients" that indicate correction directions, applying edits like a patch. On SpreadsheetBench, SkillGrad improved over the strongest training-based baseline by an average of 6.7 percentage points.

This trend marks a shift toward more rigorous software engineering for agent capabilities. By making skills structured and machine-optimizable, developers can build more reliable agents that learn from failure systematically, rather than relying on better prompting alone.

Sources: Skill-as-Pseudocode: Refactoring Skill Libraries to Pseudocode for LLM Agents · SkillGrad: Optimizing Agent Skills Like Gradient Descent

TOOLS

New JS-Native Query Engine Brings Analytics to Agent Traces

Hyperparam, a new set of open-source JavaScript libraries, enables developers to query agent-generated Parquet and Iceberg data directly from the browser or Node.js.

A new open-source query engine called Hyperparam aims to analyze the fastest-growing type of production data: unstructured text from agent traces, chat logs, and model outputs. Presented in the paper "A Query Engine for the Agents," the project provides a JavaScript-native toolchain for querying Parquet and Apache Iceberg data directly from object storage without a traditional data warehouse.

AI applications like Cursor or in-browser agents run client-side, hosting both a user and an LLM. Analyzing data in this environment is difficult because existing query engines like Spark or Trino don't fit in a JS runtime. Hyperparam is designed to be small enough (under 70 KB total) to ship inside a browser tab or a per-turn agent sandbox.

The toolchain consists of three libraries: Hyparquet, Squirreling, and Icebird. A key feature is its ability to interleave standard analytic operators with model-based interpretation of text. It uses per-cell, async-native SQL execution, meaning expensive cells containing LLM calls fire only when downstream operators demand them.

Hyperparam provides a missing piece of the AI application stack, enabling developers to build data-aware applications that can analyze their own behavior. This could power a new class of tools for debugging, observing, and improving agent performance directly within the application environment.

Sources: A Query Engine for the Agents

SAFETY

Study Finds 'Verified Misguidance' in RAG Systems

A large-scale audit of search-augmented LLMs reveals that 30.6% of citations are misleading, even when they point to real, accessible sources.

Search-augmented LLMs systematically mislead users by citing real sources that fail to support their claims, a pattern researchers have named "Verified Misguidance." A new study traced 761,495 citation pairs from 112,000 responses across ten models, finding that 30.6% of citations were flawed in their intent, suitability, or fidelity to the source.

The work, detailed in "Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs," created the CITETRACE dataset to audit the full chain from query to retrieval to generated answer. Users often trust citations as proof of grounding and rarely check the sources themselves, making citation quality a critical, unmeasured vector for misinformation.

The researchers identified a "fidelity-suitability trade-off" where models that were faithful to their sources often selected inappropriate sources, and models that selected suitable sources often failed to represent them faithfully. The three-dimensional evaluation framework scored citations on intent alignment, source suitability, and answer-source fidelity using a five-level rubric.

This study provides a quantitative measure for a common failure mode in production RAG systems. It shows that simply retrieving a relevant-looking document is not enough to ensure a trustworthy answer, highlighting the need for more rigorous verification steps in any system that presents citations to users.

Sources: Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs

RESEARCH

LLMs Fundamentally Fail at Causal Discovery, Researchers Prove

A new paper proves that LLMs cannot distinguish between causal graphs from observational data alone, a limitation intrinsic to the learning paradigm, not any specific model.

Researchers have formally proven that large language models are fundamentally incapable of reliable causal discovery from observational data. The paper, "Why LLMs Fail at Causal Discovery and How Interventional Agents Escape," establishes that this failure is not a flaw of any particular model or dataset, but an intrinsic limitation of learning paradigms like supervised fine-tuning and direct preference optimization.

The core issue, formalized in a "kernel obstruction theorem," is that these methods cannot distinguish between different causal graphs that produce similar observational data. Any attempt to force a model to do so would require its internal representations to grow unboundedly, violating the conditions under which the learning methods work.

The paper proposes a way out: Agentic Causal Bayesian Optimization (A-CBO). This approach uses a frozen LLM as an "interventional oracle" that answers targeted "what if" queries about intervention effects. An external Bayesian loop then uses these answers to systematically narrow down the possibilities and identify the correct causal graph, provably converging on the right answer in logarithmically many rounds without any model training.

This work draws a hard theoretical boundary around the capabilities of current LLMs, showing that true causal reasoning requires more than just scaling up observational data. The proposed agentic, interventional approach points toward a new architecture for building AI systems that can reason about cause and effect.

Sources: Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

From Mixture-of-Experts to Dense: New Paths for Efficient Deployment

Agent Skills Move From Prose to Pseudocode and Optimization

New JS-Native Query Engine Brings Analytics to Agent Traces

Study Finds 'Verified Misguidance' in RAG Systems

LLMs Fundamentally Fail at Causal Discovery, Researchers Prove