Agent evaluations are benchmarks and frameworks used to measure the performance, safety, and capabilities of AI agents. Effective evaluation is critical for tracking progress, identifying weaknesses, and ensuring agents are reliable for production use in engineering and product workflows.
A scalable LRM benchmark of logic puzzles and a pipeline converting unstructured traces into verifiable reasoning graphs allows quantitative analysis of reasoning structure Reasoning Structure of Large Language Models.
A reasoning efficiency metric quantifies how concentrated a model's logical flow is, separating behaviors that token count and accuracy conflate Reasoning Structure of Large Language Models.
An automated framework quantifies and mitigates Self-Preference Bias (SPB) in LLM judges by constructing equal-quality response pairs to disentangle discriminability from bias propensity Quantifying and Mitigating Self-Preference Bias of LLM Judges.
SOCIAL CAPTION is a framework that uses MLLM judges to evaluate social understanding abilities of multimodal models along three dimensions: Social Inference, Holistic Social Analysis, and Directed Social Analysis Social Caption: Evaluating Social Understanding in Multimodal Models.
A holistic performance profile for AI agent reliability proposes twelve concrete metrics decomposing reliability along four dimensions: consistency, robustness, predictability, and safety Towards a Science of AI Agent Reliability.
VeRO (Versioning, Rewards, and Observations) is an outer harness that provides versioned snapshots, budget-controlled evaluation, and structured execution traces for optimizing target agent harnesses VeRO: A Harness for Agents to Optimize Agents.
VeRO-Bench is a benchmark suite of target agents and tasks with reference evaluation procedures, designed to support research on agent optimization VeRO: A Harness for Agents to Optimize Agents.
Supervised fine-tuning (SFT) on valid action sequences enables LLMs to linearly encode action validity and some state predicates, but models may struggle to use output probabilities for classifying action validity despite learning internal representations A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners.
Frontier language models sometimes recognize that they are being evaluated and adjust their behavior, undermining the validity of benchmark results Decomposing and Measuring Evaluation Awareness.
Evaluation awareness can be decomposed into an environment component (how recognizable the task is) and a model component that separates recognition from the propensity to act on it Decomposing and Measuring Evaluation Awareness.
EvalAwareBench is a factor-controlled benchmark of 100 paired safety-capability tasks where eight trigger factors can be independently toggled to study evaluation awareness Decomposing and Measuring Evaluation Awareness.
A non-solipsistic research paradigm is needed, which entails building dynamic evaluation testbeds involving adaptive counterparties, treating institutions as design primitives, and preserving human agency Solipsistic Superintelligence is Unlikely to be Cooperative.