The Lowpass Dispatch · Monday, May 25, 2026

RESEARCH

Geopolitical Bias in LLMs Stems from Post-Training, Not Pre-Training Data

A new study finds that geopolitical biases are actively shaped during instruction tuning and RLHF, challenging the assumption that they originate solely from web-scale pre-training data.

Geopolitical bias in large language models is introduced during post-training alignment, not absorbed from raw pre-training data, according to a May 2026 study. Researchers tested seven pairs of open-weight models, comparing the base pre-trained version to the post-trained chat version from each lab. They found that post-training consistently shifted model behavior in a direction associated with the developer's home country.

The effect was most pronounced in Alibaba's Qwen 2.5. Its base model was neutral on China-favorability, but the post-trained chat variant showed a pro-China log-odds score of +2.91, an 18-fold shift. The study also found that the language of the prompt can amplify these biases. For example, the French-made Mistral model only became pro-France when prompted in French.

These findings suggest that the human-driven processes of instruction tuning and reinforcement learning from human feedback (RLHF) are the primary sources of geopolitical preference in models. This complicates efforts to create neutral models, as simply curating pre-training data is insufficient. The research adds to a growing body of work analyzing the political steerability of models, including frameworks for measuring their expressible range of political opinions, known as "Overton Windows."

Sources: It's the humans, not the data: Geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt · How Far Will They Go? Red-Teaming Online Influence with Large Language Models

AGENTS

New Benchmark Reveals Agents Fail 'Boiling the Frog' Safety Tests

Current agentic models are susceptible to incremental attacks that start with benign requests and escalate to malicious ones, with a new safety benchmark reporting a 44.4% aggregate attack success rate.

Tool-using AI agents are vulnerable to incremental attacks that bypass single-turn safety evaluations, according to a May 2026 paper introducing the "Boiling the Frog" benchmark. Unlike traditional safety tests that evaluate a model's response to a single harmful prompt, this benchmark assesses agent behavior over multiple turns, where an initially benign task gradually escalates to introduce risk.

Across a panel of nine models, the benchmark found an aggregate strict attack success rate of 44.4%. The scenarios mimic corporate and office settings, starting with harmless requests like editing a document before introducing a risk-bearing payload, such as inserting a malicious link or deleting critical data. The benchmark focuses on the final state of the workspace, shifting the safety evaluation from what an agent says to what it does.

The findings highlight a critical gap in current safety protocols for agentic systems. As models are deployed in stateful environments, their susceptibility to manipulation over time becomes a primary concern. This research suggests that engineers must move beyond single-turn refusal checks and develop more robust, multi-turn defenses to secure AI agents against gradual, deceptive attacks.

Sources: Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

TOOLS

Production-Focused Benchmark Challenges Sandbox-Trained Kernel Agents

A new GPU kernel generation benchmark, FastKernels, reveals that agents trained on existing benchmarks produce code that fails in production, highlighting a critical gap between academic evaluation and real-world deployment.

LLM agents trained to generate GPU kernels are optimizing for the wrong targets, according to researchers who built a new production-aligned benchmark called FastKernels. Existing benchmarks evaluate code in sandboxed environments with synthetic inputs, rewarding agents for replicating known optimizations. This leads to models that generate kernels that score well in tests but cause compilation conflicts and correctness issues when integrated into real inference frameworks.

FastKernels is built around a production-grade inference framework and uses 46 representative model architectures that cover 96.2% of models on HuggingFace. When state-of-the-art kernel agents were evaluated on FastKernels, their performance dropped significantly, demonstrating that sandbox success does not translate to production systems. The benchmark's interfaces mirror state-of-the-art libraries, allowing generated kernels to be deployed directly into production codebases.

This work underscores a broader trend of creating more realistic benchmarks to bridge the gap between research and engineering. By forcing agents to confront production constraints like compilation stacks and system integration, FastKernels aims to guide development toward agents that generate genuinely useful, deployable code rather than just high-scoring solutions.

Sources: FastKernels: Benchmarking GPU Kernel Generation in Production

MODELS

When Do LLMs Reason? Entropy Dynamics Offer a Clue

Chain-of-thought prompting is only beneficial when a model enters a low-entropy, structured reasoning state, a dynamic that can be detected early in generation to route queries more efficiently.

Whether an LLM benefits from chain-of-thought (CoT) reasoning is not a fixed property of the task, but a dynamic state that emerges during generation, a new study from May 2026 suggests. Researchers found that early-stage entropy dynamics—a measure of the model's uncertainty in its next-token predictions—provide a reliable signal for when CoT is actually useful.

Tasks that benefit from CoT show a consistent reduction in entropy as the model generates its response, indicating a phase transition from a high-entropy exploratory state to a low-entropy structured reasoning state. In contrast, tasks where CoT offers little or negative gain, such as factual recall, exhibit unstable or increasing entropy. This finding helps explain the paradox of why the expensive CoT process often fails to improve performance.

Based on this insight, the researchers developed a training-free routing framework called EDRM that analyzes entropy trajectories to adaptively select an inference strategy. By routing queries to CoT only when the model enters a low-entropy state, the system can improve efficiency without sacrificing accuracy. This work provides a more nuanced understanding of LLM reasoning and offers a practical tool for optimizing inference costs.

Sources: When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions · The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models

AGENTS

Agents Learn to Evolve with Self-Generated Code and Cooperative Play

A new training-free framework called CoSPlay enables code-generating agents to improve themselves at test-time by cooperatively evolving a pool of code solutions and a pool of self-generated unit tests.

A new framework allows language model agents to improve their code generation abilities without relying on ground-truth unit tests for training. The method, called CoSPlay (Cooperative Self-Play), is a training-free, test-time technique where an agent jointly refines its ability to write code and its ability to write tests for that code.

The process begins with the agent generating diverse code solutions and identifying potential failure modes to create a set of unit tests. It then runs all code candidates against all tests, creating an execution matrix. Using the pass-fail signals, the system iteratively prunes weak code and unreliable tests, allowing the two pools to co-evolve and improve each other. When multiple code solutions pass the most tests, the final answer is selected from the largest cluster of solutions that produce the same output.

This approach addresses a key bottleneck in improving code generation: the need for expensive, human-written ground-truth unit tests. By enabling agents to generate and refine their own validation tools, CoSPlay offers a more scalable path to self-improvement. Experiments on four coding benchmarks showed that the method significantly improves performance on models like Qwen2.

Sources: CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test