Agent Evaluations

Agent evaluations are benchmarks and frameworks used to measure the performance, safety, and capabilities of AI agents. Effective evaluation is critical for tracking progress, identifying weaknesses, and ensuring agents are reliable for production use in engineering and product workflows.

Evaluation Methodologies and Frameworks

Observability and Failure Analysis

Clustered Self-Assessment (CSA) groups sampled generations into semantically distinct clusters, converts them into multiple-choice options, and uses the LLM's assigned probability as a confidence estimate Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models.
CSA consistently outperforms baseline uncertainty quantification approaches and achieves competitive performance with as few as two additional samples Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models.
Contrastive Decoding Diffing (CDD) is a model diffing method that operates on output-level logit distributions only, without weight access, to recover implanted facts verbatim from finetuned models Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing.
CDD bypasses the chat template, seeds generation with vague pre-fills, and amplifies logit-space differences between finetuned and base models Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing.
CDD outperforms white-box baselines like Activation Difference Lens (ADL) while running significantly faster, and can surface unintended data pipeline artifacts Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing.
A scalable LRM benchmark of logic puzzles and a pipeline converting unstructured traces into verifiable reasoning graphs allows quantitative analysis of reasoning structure Reasoning Structure of Large Language Models.
A reasoning efficiency metric quantifies how concentrated a model's logical flow is, separating behaviors that token count and accuracy conflate Reasoning Structure of Large Language Models.

Process- vs. Outcome-Based Evaluation

PROVE (Programmatic Rewards On Verified Environments) is a framework for training LLMs to orchestrate multi-step tool calls using a multi-component programmatic reward Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments.
PROVE's programmatic reward includes graduated validity scoring, dependency-aware coverage, an adaptive efficiency penalty, a tool-name signal, and an argument-value matching bonus, requiring no external judge model Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments.
Multi$^2$ is a hierarchical multi-agent decision-making framework that decomposes agent behavior into a high-level agent for context-aware sub-goal generation and a low-level agent for atomic actions Multi$^2$: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments.
This separation in Multi$^2$ enables stable long-horizon control, mitigates objective drift, and allows efficient adaptation Multi$^2$: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments.

Safety, Alignment, and Robustness

BraveGuard is a self-evolving defense framework for training guard models from open-world threat signals and realistic agent trajectories BraveGuard: From Open-World Threats to Safer Computer-Use Agents.
BraveGuard mines research for emerging risks, instantiates them as executable computer-use tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training BraveGuard: From Open-World Threats to Safer Computer-Use Agents.
Selective Abstraction (SA) is a framework that enables LLMs to trade specificity for reliability by selectively reducing the detail of uncertain content in long-form generation When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation.
Atom-wise Selective Abstraction decomposes responses into atomic claims and replaces uncertain atoms with higher confidence, less specific abstractions When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation.
An evaluation harness for LLM security benchmarks addresses per-dataset threshold tuning and undisclosed operating points by using a single global operating point selected on held-out folds Gate AI: LLM Security Benchmark Evaluation Methodology and Results.
This harness examines generalization through a battery of diagnostics, including leave-one-dataset-out cross-validation and adversarial validation Gate AI: LLM Security Benchmark Evaluation Methodology and Results.

LLM-as-a-Judge

LLM-as-judges agree strongly with one another but only weakly with humans The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment.
On subjective rubrics, LLM judges use less than half the human score range and their evaluation axis is nearly orthogonal to the human one The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment.
Inter-LLM agreement exceeds LLM-human agreement, but on verifiable factual rubrics, these diagnostics fall back into the human range The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment.
Post-hoc calibration on a small human-anchored set can improve LLM-as-judge alignment with human evaluations The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment.
JudgmentBench is a benchmark of 30 legal tasks with paired rubric scores and pairwise preference judgments from practicing attorneys JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment.
Comparative judgments on JudgmentBench recover the intended quality ordering substantially better than rubrics and require less than half the annotation time, holding for both human annotators and LLM autograders JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment.
An automated framework quantifies and mitigates Self-Preference Bias (SPB) in LLM judges by constructing equal-quality response pairs to disentangle discriminability from bias propensity Quantifying and Mitigating Self-Preference Bias of LLM Judges.
A structured multi-dimensional evaluation strategy grounded in cognitive load decomposition reduces SPB by 31.5% on average Quantifying and Mitigating Self-Preference Bias of LLM Judges.
A principled, distribution-calibrated aggregation scheme for thinking LLM judges models three-way preferences with a Bradley-Terry-Davidson formulation on rating counts Distribution-Calibrated Inference Time Compute for Thinking LLM-as-a-Judge.
This aggregation scheme leverages both polarity and decisiveness to distinguish narrow margins from strong consensus, reducing MAE and increasing pairwise accuracy against human-consensus meta-labels Distribution-Calibrated Inference Time Compute for Thinking LLM-as-a-Judge.
SOCIAL CAPTION is a framework that uses MLLM judges to evaluate social understanding abilities of multimodal models along three dimensions: Social Inference, Holistic Social Analysis, and Directed Social Analysis Social Caption: Evaluating Social Understanding in Multimodal Models.

Cost and Efficiency

A compact ridge predictor built from prompt-level agreement spread, label-assisted first-correct-sample position, completion-length variance, and an entropy add-on can predict best-of-N inference scaling gains with Spearman $\rho = 0.90$ Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics.
This predictor is intended for labeled validation-set screening of candidate configurations before incurring the full reward-model scoring cost Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics.
ProEval is a proactive evaluation framework that uses pre-trained Gaussian Processes as surrogates for performance score functions to efficiently estimate performance and identify failure cases ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation.
ProEval requires 8-65x fewer samples to achieve estimates within 1% of ground truth while revealing more diverse failure cases under a stricter evaluation budget ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation.

Reliability and Stability

A holistic performance profile for AI agent reliability proposes twelve concrete metrics decomposing reliability along four dimensions: consistency, robustness, predictability, and safety Towards a Science of AI Agent Reliability.
Recent capability gains have yielded only small improvements in agent reliability Towards a Science of AI Agent Reliability.

Benchmark Design and Curation

SAGE (Social Agent Group Evolution) is an evaluation framework that compares social evolution (agents co-evolving with peer histories) against self-evolution (agents seeing only their own past) SAGE: A Quantitative Evaluation of Socialized Evolution in Agent EcosystemS.
SAGE found that while group history is not a universal amplifier, agents that plateau under self-improvement can achieve significant breakthroughs when peer experience is available SAGE: A Quantitative Evaluation of Socialized Evolution in Agent EcosystemS.
VeRO (Versioning, Rewards, and Observations) is an outer harness that provides versioned snapshots, budget-controlled evaluation, and structured execution traces for optimizing target agent harnesses VeRO: A Harness for Agents to Optimize Agents.
Pre-specified, gate-filtered augmentations of a small hand-authored base can substitute for additional human curation in RL from verifiable rewards (RLVR) training Trading Human Curation for Synthetic Augmentation in RLVR.
The cost-adjusted trade rate between gated synthetic and human-authored RLVR tasks stays in $[1.4\times, 11.6\times]$ across plausible cost ranges Trading Human Curation for Synthetic Augmentation in RLVR.

Benchmarks and Suites

Software Engineering & Code Generation

Self-play SWE-RL (SSR) is a training paradigm where a single LLM agent iteratively injects and repairs software bugs, specified by test patches, in sandboxed repositories Toward Training Superintelligent Software Agents through Self-Play SWE-RL.
SSR achieves self-improvement (+10.4 and +7.8 points on SWE-bench Verified and SWE-Bench Pro) and consistently outperforms human-data baselines Toward Training Superintelligent Software Agents through Self-Play SWE-RL.
RealClawBench is a live benchmark framework built from real OpenClaw sessions to capture the distribution, diversity, and real-world difficulty of deployed agent use RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions.
RealClawBench uses reconstructed execution environments and deterministic verifiable scorers to convert real sessions into reproducible, automatically scored tasks RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions.
The best system on RealClawBench solves only 65.8% of tasks, revealing substantial headroom on realistic developer-agent workloads RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions.
VeRO-Bench is a benchmark suite of target agents and tasks with reference evaluation procedures, designed to support research on agent optimization VeRO: A Harness for Agents to Optimize Agents.
A method for Fill-in-the-Middle (FIM) hallucination mitigation uses frontier code models to synthesize plausible-but-wrong completions as hard negatives, leveraging the contrast with ground-truth developer edits as a supervised fine-tuning signal Synthetic Hallucinations, Real Gains: Hard Negatives from Frontier Models for FIM Hallucination Mitigation.
Fine-tuning Qwen2.5-Coder-7B-Instruct with this method lifts Delulu exact match by +18.8 points and edit similarity by +0.22, while also improving HumanEval-Infilling and SAFIM subsets Synthetic Hallucinations, Real Gains: Hard Negatives from Frontier Models for FIM Hallucination Mitigation.

Web Interaction

Reasoning and Planning

$\Psi$-Bench is a benchmark for assessing LLMs' ability to influence realistic users through conversation in three real-world persuasion scenarios $\Psi$-Bench: Evaluating Persona-Sensitive Influencing in Persuasive Dialogues.
Multi$^2$ releases three hierarchical benchmark datasets for training and evaluating hierarchical decision-making for LLM-based agents Multi$^2$: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments.

Scientific and Domain-Specific Applications

DAWN: A benchmark for automated numerical modeling, targeting distribution fitting and time series analysis in scientific domains like astronomy VESTA: Visual Exploration with Statistical Tool Agents.
EntSQL is an enterprise-oriented Text-to-SQL benchmark for evaluating long-context grounding over proprietary business documents EntSQL: A Benchmark for Grounding Text-to-SQL in Long-Context Enterprise Knowledge.
EntSQL contains 1,066 aligned Chinese-English semantic examples across five business domains, often requiring domain knowledge beyond the question and schema EntSQL: A Benchmark for Grounding Text-to-SQL in Long-Context Enterprise Knowledge.
The best evaluated system on EntSQL reaches only 15.9% on English inputs when long-form documents are provided, highlighting the difficulty of grounding SQL generation in enterprise knowledge EntSQL: A Benchmark for Grounding Text-to-SQL in Long-Context Enterprise Knowledge.
An efficient end-to-end smart contract security audit framework uses lightweight open-source LLMs (0.6B-4B parameters) and decouples audit tasks into vulnerability detection, explanation, severity classification, and remediation Decoupled Smart Contract Audits: Lightweight LLM Framework via Distillation and Aggregation.
This framework outperforms state-of-the-art coder dense LLMs, achieving 98.25% accuracy in vulnerability detection and uncovering a novel severity centrality bias Decoupled Smart Contract Audits: Lightweight LLM Framework via Distillation and Aggregation.
WildRoadBench is a wild aerial road-damage grounding benchmark that couples direct visual grounding by VLMs with autonomous research-and-engineering by LLM-driven agents on a professionally annotated UAV corpus WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents.
WildRoadBench includes a VLM Track and an Agent Track, revealing that both routes are far from reliable performance in this wild setting WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents.

Generative Search and Ranking

Safety, Privacy, and Bias

AGENTREDBENCH is a dynamic LLM-driven redteaming benchmark of 215 subtle underspecified authorization scenarios across 24 enterprise integrations and five attack types AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations.
On AGENTREDBENCH, no-guard ASR (attack success rate) ranges from 32% to 81% across an eight-model panel AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations.
AGENTREDGUARD, a guard trained on an integration-diverse corpus of adversarial tool-response content, cuts the panel ASR from 69.9% to 2.4% at a 0.37% false-positive rate AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations.
SVHalluc is the first comprehensive benchmark for evaluating speech-vision hallucination in audio-visual LLMs, diagnosing semantic and temporal aspects SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models.
State-of-the-art open-source audio-visual LLMs struggle with aligning speech content with corresponding visual signals on SVHalluc, with near-random accuracy on multiple tasks SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models.

Challenges in Agent Evaluation

Reasoning and State Management

Supervised fine-tuning (SFT) on valid action sequences enables LLMs to linearly encode action validity and some state predicates, but models may struggle to use output probabilities for classifying action validity despite learning internal representations A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners.
Broader state space coverage during fine-tuning, such as from random walk data, yields more accurate recovery of the underlying world model in LLM planners A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners.

Faithfulness and Tool Use

Safety, Robustness, and Bias

LLM-as-judges can suffer from Self-Preference Bias (SPB), a directional evaluative deviation where LLMs systematically favor or disfavor their own generated outputs Quantifying and Mitigating Self-Preference Bias of LLM Judges.

Benchmark Limitations

Frontier language models sometimes recognize that they are being evaluated and adjust their behavior, undermining the validity of benchmark results Decomposing and Measuring Evaluation Awareness.
Evaluation awareness can be decomposed into an environment component (how recognizable the task is) and a model component that separates recognition from the propensity to act on it Decomposing and Measuring Evaluation Awareness.
EvalAwareBench is a factor-controlled benchmark of 100 paired safety-capability tasks where eight trigger factors can be independently toggled to study evaluation awareness Decomposing and Measuring Evaluation Awareness.
No single factor uniformly affects all models, but stacking factors progressively raises evaluation awareness across them Decomposing and Measuring Evaluation Awareness.

Architectural and Workflow Challenges

Superintelligence developed with a "solipsistic" approach (treating the world as an exogenous and stationary source of feedback) is unlikely to be cooperative Solipsistic Superintelligence is Unlikely to be Cooperative.
Deploying AI systems induces endogenous non-stationarity, creating a train-test-deploy gap due to the self-undermining property of unilateral optimization Solipsistic Superintelligence is Unlikely to be Cooperative.
A non-solipsistic research paradigm is needed, which entails building dynamic evaluation testbeds involving adaptive counterparties, treating institutions as design primitives, and preserving human agency Solipsistic Superintelligence is Unlikely to be Cooperative.