Agent Architectures

Agent architectures define the structure and control flow of an autonomous agent. They range from simple loops to complex multi-agent systems and self-improving frameworks. Recent work emphasizes structured processes, where persistent artifacts, work contracts, and human review coordinate agents, moving beyond single, isolated prompts From Prompt to Process: a Process Taxonomy and Comparative Assessment of Frameworks Supporting AI Software Development Agents.

System-Level Architectures

These concepts frame agentic computation at the level of an operating system or computer architecture, providing foundational abstractions for managing agent execution, communication, and safety.

Agent libOS: A library-OS-inspired runtime that treats an agent as a schedulable AgentProcess with identity, state, capabilities, and audit records.
OpenAgenet/OAN: An open infrastructure for trusted agent interconnection, providing a protocol-neutral trust layer for identity provenance, governance, discovery authorization, and signed invocation before agents interact OpenAgenet/OAN: Open Infrastructure for Trusted Agent Interconnection.
Agentic Redux: An architecture for provably auditable and safe agents in domains like healthcare billing, using human-authored ontologies and recording all decisions in an append-only ledger Provably Auditable and Safe LLM Agents from Human-Authored Ontologies.
Formal Protocol Semantics: Process calculus can be used to formalize agent-tool protocols like MCP, proving properties about their expressivity and establishing schema quality as a provable safety property Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach.
Lab Agent Protocol (LAP): A protocol for agent-to-instrument interaction in autonomous science, featuring primitives for capabilities, reservations, safety handshakes, and reproducible results.

Embodied Agent Architectures

These architectures focus on agents that interact with physical or simulated environments, often incorporating multimodal perception and control.

SCOPE: A modular agent for natural-language, open-vocabulary pan-tilt-zoom (PTZ) camera control and visual scene understanding, designed for edge deployment.
AgenticDiffusion: A multi-view UAV navigation framework that coordinates language-guided reasoning, open-vocabulary target grounding, vision-based diffusion planning, and NMPC in a unified pipeline AgenticDiffusion: Agentic Diffusion-based Path Planning for Vision-Based UAV Navigation.

Core Agent Loops and Control Flow

The fundamental agent loop involves cycles of observation, thought, and action. Modern architectures add sophisticated mechanisms for planning, memory management, and adaptive control.

ReAct: The ReAct (Reason-Act) framework is a foundational pattern that combines reasoning and action generation in an interleaved manner.

Agentic Harnesses

Harnesses wrap and augment existing models with structured execution, verification, and repair capabilities without requiring model retraining.

Microsoft Agent Framework (MAF) Agent Harness: A production-oriented harness providing first-class patterns like automatic context compaction, instruction merging, session-scoped file memory, plan vs. execute modes, skill discovery, and background agent delegation Microsoft Agent Framework at BUILD 2026: Agent Harness, Hosted Agents, CodeAct, and more.
Deontic Agentic Reasoning (DAR): An agentic setup where a model interacts with statutes on demand to perform deontic reasoning (applying rules to facts), which can be improved by agentic harnesses DAR: Deontic Reasoning with Agentic Harnesses.
MUSE: A multimodal unified structured execution harness that wraps any off-the-shelf MLLM with composable modules for task representation, visual processing, tool use, parsing, verification, and verifier-guided repair.

Reasoning and Decision-Making

Adaptive Reasoning: The AgentMob framework for mobility prediction uses a fast path for routine cases and triggers iterative tool use for ambiguous ones, improving accuracy by adaptively gathering evidence Towards Efficient and Evidence-grounded Mobility Prediction with LLM-Driven Agent.
Consequence-Aware Compute Allocation: A scheduler can route tasks with higher predicted real-world cost to larger compute tiers, reducing cost-weighted loss compared to difficulty-aware routing alone Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation.
Cognitive Memory Management: The SALIMORY framework trains a single model to manage a cognitively-structured memory (user facts, preferences, working memory) using a hierarchical stage-wise process reward for distinct memory operations SaliMory: Orchestrating Cognitive Memory for Conversational Agents.
Preference Stability: Inclusion-of-Thoughts (IoT) is a progressive self-filtering strategy that reconstructs multiple-choice questions using only plausible options to mitigate the model's preference instability caused by distractors Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space.
Confidence Metrics: Current probabilistic confidence metrics used in Best-of-N selection primarily capture surface-level fluency rather than the logical structure of reasoning steps Reasoning or Fluency? Dissecting Probabilistic Confidence in Best-of-N Selection.

Control Flow and Verification

Self-Reflective APIs: An API design pattern where validation failures return machine-readable recovery suggestions, enabling an agent to repair its request and retry without external reasoning Self-Reflective APIs: Structure Beats Verbosity for AI Agent Recovery.
Cascading Hallucination Detection: The CHARM framework detects and mitigates cascading hallucinations in agentic RAG by using stage-level fact verification, cross-stage consistency tracking, and confidence monitoring to interrupt error propagation Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation.
Constraint Injection: A verification technique for LLM-generated optimization models that uses feasible and one-constraint-violating probes to reveal spurious over-constraint or silent constraint omission Beyond Objective Equivalence: Constraint Injection for LLM-Based Optimization Modeling on Vehicle Routing Problems.
Constitutional AI Verification: Glass Box is a runtime verification layer that intercepts AI policy actions and evaluates them against physics-grounded constitutional constraints and safety invariants before execution.

World Models

World models are internal simulators that learn the structure and dynamics of an environment, enabling agents to predict, plan, and reason within learned representations.

Planning and Execution Patterns

These patterns focus on how agents decompose tasks, generate steps, and manage resources.

Planner-Executor and Multi-Stage Workflows

Planner-Executor: A common pattern where a "planner" LLM decomposes a task into steps and an "executor" carries them out.
Decoupled Architectures:
- STREAMS: A dual-module framework for strategic counseling that decouples strategic reasoning (Thinker) from response generation (Presenter) When Clients Stop Following: A Cognitive Conceptualization Diagram-driven Framework for Strategic Counseling.
- Proactive Assistance: A decoupled planner-interaction architecture for proactive procedural assistance is specialized for procedural state, visual cues, and injecting recovery steps when users deviate from the plan Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance.
Specialized Workflows:
- LEAP: An agentic framework for formal theorem proving that decomposes complex problems and bridges formal proof construction with informal blueprints through continuous interaction with a compiler LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks.
- MetaPoint: A planner agent can decompose a high-level user request into a structured sequence of spatial primitives (MetaPoint tokens) for a visual generator to achieve precise spatial control MetaPoint: Unlocking Precise Spatial Control in Agentic Visual Generation.
- BioManus: A biomedical agent that uses graph-scaffolded planning over a structured ecosystem of bioinformatics tools that have been converted into standardized MCP servers, decoupling planning complexity from the raw number of tools Beyond Prompt-Based Planning: MCP-Native Graph Planning-based Biomedical Agent System.
Reason, Retrieve, Re-rank (R3/R3G): A modular pipeline for retrieval-augmented tasks where an LLM first reasons about information needs, a retriever fetches candidates, and a re-ranker selects the best evidence R3G: A Reasoning-Retrieval-Reranking Framework for Vision-Centric Answer Generation.

Multi-Agent Systems (MAS)

Multi-agent systems decompose complex tasks among multiple, often specialized, agents that coordinate to achieve a goal.

Communication and Coordination

Streaming Communication: The StreamMA system streams each reasoning step to downstream agents as it is generated, reducing latency and improving effectiveness by preventing error-prone late steps from misleading other agents Streaming Communication in Multi-Agent Reasoning.
Dynamic Ensembling: Dynamic Logit-Level Gating (DLLG) is a framework where a lightweight gating module learns to predict token-level fusion weights to ensemble multiple specialized LLM experts without retraining them DLLG: Dynamic Logit-Level Gating of LLM Experts.
Multi-Agent Debate: Deliberation in groups can significantly improve accuracy on complex reasoning tasks like chess puzzles DeliChess: A Multi-party Dialogue Dataset for Deliberation in Chess Puzzle Solving.
Disagreement as Signal: Reasoning-trace disagreement between agents can be abstracted into symbolic states (e.g., convergent agreement, divergent disagreement) to support strategic routing rules for value-laden tasks Consensus is Strategically Insufficient: Reasoning-Trace Disagreement as a Knowledge-Representation Signal.

Hierarchical and Role-Based Patterns

Manager-Worker with DAGs: A manager model can decompose tasks into a directed acyclic graph (DAG), with parallel subagents executing nodes.
Specialized Agent Teams:
- DMAIC-IAD: A "Plan First, Judge Later" system for industrial anomaly detection where agents first distill references into standard operating procedures, and a pre-trained judge model ranks candidate strategies before costly execution Plan First, Judge Later, Run Better: A DMAIC-Inspired Agentic System for Industrial Anomaly Detection.
- Ptah: A multi-agent harness for verifiable multimodal deep research that orchestrates planning, research, and writing stages with specialized agents and a verifier agent to enforce grounding and consistency Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation.
- VulnAgent-R2: A budget-aware agentic auditing framework for repository-level vulnerability detection that combines graph triage, role-specialized agents, sceptic counter-evidence, and selective dynamic verification VulnAgent-R2: Evidence-Calibrated Multi-Agent Auditing for Repository-Level Vulnerability Detection.
- Multi-Scale O-RAN Control: A hierarchical framework for Open Radio Access Networks uses an LLM agent for high-level policy, SLM agents for near-real-time optimization, and Wireless Physical-layer Foundation Model agents for fast inference at the edge Toward Autonomous O-RAN: A Multi-Scale Agentic AI Framework for Real-Time Network Control and Management.
Hierarchical Reasoning:
- Speculative Thinking: A training-free framework where a large reasoning model guides a smaller one during inference by strategically delegating reflective steps to the more capable model Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time.

Training Dynamics and Stability

Swarm Training: AgentJet is a distributed swarm training framework with a decoupled architecture where server nodes host and optimize models while client nodes execute agents, enabling heterogeneous multi-model RL, fault-tolerance, and live code iteration AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning.
Policy Sharing Tradeoffs: In multi-agent RL, isolated-policy training (separate parameters per role) can reach higher peak accuracy but risks collapse, while shared-policy training can be "captured" by a dominant role.

System Stability and Safety

Safety Under Scaffolding: Agentic scaffolds like ReAct or multi-agent debate can alter a model's measured safety, with effects varying significantly by model, undermining the utility of a single composite safety score Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety.
Intervention Timing: Deciding when to interrupt a long-running agent is a low-reliability construct; affect-based triggers saturate, LLM judges are unreliable, and human annotators show low agreement on when to intervene The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents.

Self-Improving and Evolutionary Architectures

These architectures enable agents to learn from experience, evolve their capabilities, and improve over time.

Theoretical Foundations

Iterative Self-Improvement: Each round of self-improvement can be modeled as maximum-likelihood fine-tuning on a reward-filtered distribution.
Emergence of Reasoning: Transformers trained with outcome-based RL can spontaneously learn to generate reasoning steps, but this emergence requires the training distribution to contain a sufficient mass of "simple examples" that need fewer steps Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data.

Search and Evolution for Algorithm Discovery

Bilevel Autoresearch: A framework where an outer autoresearch loop improves an inner autoresearch loop by reading its code and traces, identifying bottlenecks, and generating injectable search mechanisms at runtime Bilevel Autoresearch: Meta-Autoresearching Itself.
Self-Play: A prover-conjecturer system can enable self-improvement in theorem proving. SSR trains software agents by having a single agent iteratively inject and repair bugs of increasing complexity.

Skill and Prompt Evolution

Co-Evolutionary Frameworks:
- SCORE: A self-evolving co-evolutionary framework for deep research that tightly couples an evaluator and a solver in a shared-parameter learning process, allowing them to improve jointly Self-Evolving Deep Research via Joint Generation and Evaluation.
- LifeSkill: A two-stage RL framework for online lifelong learning agents that uses verifier-guided skill learning and continuously internalizes test-time feedback into model parameters Learning While Acting: A Skill-Enhanced Test-Time Co-Evolution Framework for Online Lifelong Learning Agents.
Human-Directed Evolution:
- Parthenon: A self-evolving legal-agent framework that converts scored failures into task-agnostic edits to skills, tools, and knowledge without changing model weights, mimicking how a firm refines its playbooks Parthenon Law: A Self-Evolving Legal-Agent Framework.
- Digital Apprentice: A framework where an agent earns autonomy through per-skill tiers, graduating only when empirical evidence gated by human approval justifies it The Digital Apprentice: A Framework for Human-Directed Agentic AI Development.
Skill Representation and Adaptation:
- Agent Instruction Protocol (AIP): Models a skill as a directed execution graph with nodes backed by deterministic scripts or natural language, improving reliability by giving the agent runnable units instead of prose to interpret AIP: A Graph Representation for Learning and Governing Agent Skills.
- State-Grounded Dynamic Retrieval (SGDR): An online skill learning method for web agents that retrieves reusable skills based on both the task goal and the current webpage state at each step Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval.

Learning from Experience and Feedback

Continual Learning: Deployed RL agents should be continual learners to adapt to the four sources of non-stationarity they face after deployment Position: Deployed Reinforcement Learning should be Continual.
Learning from Rich Feedback:
- DistIL: A distributional variant of DAgger that uses a forward cross-entropy objective to learn from rich feedback (e.g., execution traces, expert corrections) by propagating future expert-student disagreement back to earlier decisions Reinforcement Learning from Rich Feedback with Distributional DAgger.
- Modality-Aware Credit Assignment (MoCA): An RL framework that improves multimodal synergy by decoupling generation into perception and reasoning steps, allowing it to reward perceptual fidelity independently of reasoning outcomes Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning.
Process Rewards:
- Sci-PRM: A process reward model for scientific reasoning that provides fine-grained supervision on tool selection, execution accuracy, and result interpretation, serving as a dense reward signal for RL SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification.
Long-Horizon Improvement: Success on long-horizon, closed-loop optimization tasks is predicted not by the quality of an agent's first attempt, but by its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?.