AI Agents#

Definition#

An AI agent is an LLM placed inside a loop with tools, memory, and an explicit goal — the model decides which tool to call next, observes the result, updates its plan, and repeats until the goal is satisfied or a termination condition fires. Anthropic’s “Building effective agents” draws a sharp line between workflows (LLMs orchestrated through code-defined paths) and agents (LLMs that dynamically direct their own processes and tool usage); a system is only “agentic” when control over the next step lives with the model rather than the developer. The minimal recipe is unchanged across vendors: one chat model + a tool schema + a runner that feeds results back as tool_result messages + a stop condition.

Why it matters#

Agents are the unit of composition for any task that can’t be solved by a single prompt — multi-step research, codebase refactors, ticket triage, data extraction across messy sources, browser or computer automation, anything that needs to react to intermediate observations. They turn a stateless completion API into a goal-driven worker, which is why almost every “AI feature” shipped since 2024 (coding assistants, customer-support copilots, browser agents like Operator, deep-research products) is some flavour of agent loop under the hood. Picking the right abstraction matters: Anthropic’s research finding — repeated across LangChain, OpenAI, and crewAI post-mortems — is that simpler, composable patterns beat heavyweight frameworks for most production use cases, and that complexity should only be added when measurable evaluation says it pays off.

How it works#

An agent loop is a small state machine that the model drives.

System prompt + goal. The developer seeds the conversation with a role, constraints, and the user request. Tool schemas (JSON Schema for OpenAI/Anthropic, function signatures for SDKs) are passed alongside the messages so the model knows what’s callable.
Plan / act. The model emits either a final answer or a tool_use block — name + arguments. The classic ReAct pattern (Yao et al., 2022) interleaves a visible “Thought:” before each action so the trace is auditable; Toolformer (Schick et al., 2023) showed models can learn when to call tools without explicit scaffolding.
Execute. The harness runs the tool — code interpreter, shell, HTTP call, vector search, sub-agent, MCP server — and returns a tool_result content block. MCP (Model Context Protocol) is the emerging interop standard: any MCP server is a drop-in tool surface for any MCP-aware agent.
Observe / update. The result is appended to the message list; the model re-reads the trajectory, updates its plan, and emits the next action. Long traces get compacted, summarised, or offloaded to memory — a key-value store, a vector index, or a scratchpad file.
Terminate. The loop ends when the model returns stop_reason: end_turn, when a max_turns budget is hit, when a guardrail trips, or when a human-in-the-loop step rejects an action.

Patterns layer on top: single-agent + tools (the default), router (one agent chooses among specialists), multi-agent debate / review (AutoGen’s signature pattern), role-based crews (crewAI’s planner → researcher → writer chain), graph-based stateful workflows (LangGraph’s directed graph with checkpoints and time-travel), and sub-agents (Claude Code’s Task tool, Codex’s /agent) for context-window isolation and parallelism.

Frameworks land on different points of the trade-off curve:

Claude Agent SDK — safety-first, MCP-native, ships computer use; locked to Claude models.
OpenAI Agents SDK — clean handoff model, built-in tracing and guardrails; locked to OpenAI models.
LangGraph — fully model-agnostic, stateful graphs with checkpointing and time-travel debugging via LangSmith.
AutoGen / crewAI / LlamaIndex / Haystack — opinionated higher-level surfaces for multi-agent, role-based, document-centric, or pipeline-DAG patterns respectively.

Evaluation has matured alongside the runtimes. SWE-bench Verified (500 real GitHub issues) is the de-facto coding-agent benchmark — Claude Sonnet 4.5 leads at ~77 % as of 2026, up from 4 % three years earlier. Adversarial variants like SWE-ABS show the headline numbers drop ~15 points under strengthened test suites, so always pair a public benchmark with task-specific evals before trusting an agent in production.

Common pitfalls#

No termination condition. Multi-agent runs without max_turns or an explicit termination_condition will loop until token budgets explode. Always cap the loop and alarm on runaway cost.
Reaching for a framework before the prompt works. A single well-scoped tool-use call often beats a 6-agent crew. Start with the model’s native tool-use API; promote to a framework only when evals justify it.
Vague tool descriptions. Tools are selected by the model from their description field. “Get weather” is worse than “Get current weather for a city; call this whenever the user asks about temperature, rain, or forecasts.” Write descriptions from the model’s perspective.
Overlapping agent roles. Two agents with near-identical role/goal produce contradictory output. Each agent in a crew needs a clearly differentiated responsibility, or collapse them into one.
Context-window poisoning. Long tool traces, retries, and verbose errors crowd out the task. Spawn a sub-agent (Claude Code Task, Codex sub-agent, LangGraph sub-graph) for sub-problems, and keep the parent’s context lean.
Skipping evaluation. Headline benchmark scores don’t predict your workload. Build a small, task-specific eval set early and re-run it on every prompt or tool change — agents regress silently.
Conflating agents with workflows. If every step is pre-determined, you don’t have an agent — you have a chain. That’s often better (cheaper, more predictable). Only adopt agency where dynamic decision-making genuinely helps.

Where to go next#

Sibling concepts, tool-specific cheat sheets, and external references for going deeper.

/concepts/rag — retrieval is the most common tool an agent calls; RAG and agents compose constantly.
/concepts/api — every tool an agent invokes is an API call under the hood.
/sections/claude-api/tool-use — the canonical tool-use loop on the Claude API.
/sections/claude-code/subagents — Claude Code’s Task tool for spawning isolated child agents.
/sections/claude-code/mcp — Model Context Protocol, the standard tool-surface for cross-vendor agents.
/sections/codex/subagents — Codex CLI’s /agent and recursive codex exec patterns.
/sections/frameworks/agent-frameworks-comparison — decision matrix across LangChain, LlamaIndex, AutoGen, crewAI, Haystack, Semantic Kernel.
/sections/ai/langchain, /sections/ai/autogen, /sections/ai/crewai — per-framework cheat sheets.

Sources#

References consulted while writing this concept page. Links open in a new tab.

Anthropic — Building Effective Agents — Source of the canonical workflows-vs-agents distinction and the “start simple, add complexity only when measured” principle used in the Definition and Why-it-matters sections.
Yao et al. — ReAct: Synergizing Reasoning and Acting in Language Models (arXiv 2210.03629) — Foundational paper for the interleaved Thought/Action/Observation loop described in How-it-works.
Schick et al. — Toolformer: Language Models Can Teach Themselves to Use Tools — Reference for self-supervised tool invocation, contrasted with explicit ReAct scaffolding.
SWE-bench Verified (Agentic Coding) Leaderboard — Source of the 2026 leaderboard standings (Claude Sonnet 4.5 ~77 %) cited in the evaluation discussion.
Morph — SWE-Bench Explained: Verified, Pro, and the 2026 Leaderboard — Background on the benchmark family, the 4 % → 80 % climb, and the SWE-ABS adversarial variant.
QubitTool — 2026 AI Agent Framework Showdown — Comparative source for Claude Agent SDK / OpenAI Agents SDK / LangGraph trade-offs (orchestration model, state, model-lock-in).
LangChain Docs — Comparison with Claude Agent SDK — Authoritative side-by-side on graph-based vs tool-chain agent architectures.
Model Context Protocol (MCP) — Spec home for the interop layer referenced as the emerging cross-vendor tool surface.

g h	home
g p	Programming section
g p	Python section
g j	JavaScript section
g t	TypeScript section
g o	OS section
g l	Linux section
g w	Windows section
g z	z/OS section
g o	macOS section
g a	AI section
g c	Claude Code section
g c	Codex CLI section
g c	Claude API section
g p	Prompting section
g f	Frameworks section
g p	Packages section
g p	Pip (Python) section
g p	npm (Node) section
g p	Cargo (Rust) section
g p	Go modules section
g g	graph view
g t	tags index

⌘K / /	open search palette
t	cycle theme (dark → light → system)
?	toggle this panel

[ / ]	previous / next sheet in section
j / k	scroll down / up