skip to content

LLM-driven systems that pursue a goal by interleaving reasoning, tool calls, and observations inside a loop — and that decide for themselves which step to take next.

AI Agents#

Definition#

An AI agent is an LLM placed inside a loop with tools, memory, and an explicit goal — the model decides which tool to call next, observes the result, updates its plan, and repeats until the goal is satisfied or a termination condition fires. Anthropic’s “Building effective agents” draws a sharp line between workflows (LLMs orchestrated through code-defined paths) and agents (LLMs that dynamically direct their own processes and tool usage); a system is only “agentic” when control over the next step lives with the model rather than the developer. The minimal recipe is unchanged across vendors: one chat model + a tool schema + a runner that feeds results back as tool_result messages + a stop condition.

Why it matters#

Agents are the unit of composition for any task that can’t be solved by a single prompt — multi-step research, codebase refactors, ticket triage, data extraction across messy sources, browser or computer automation, anything that needs to react to intermediate observations. They turn a stateless completion API into a goal-driven worker, which is why almost every “AI feature” shipped since 2024 (coding assistants, customer-support copilots, browser agents like Operator, deep-research products) is some flavour of agent loop under the hood. Picking the right abstraction matters: Anthropic’s research finding — repeated across LangChain, OpenAI, and crewAI post-mortems — is that simpler, composable patterns beat heavyweight frameworks for most production use cases, and that complexity should only be added when measurable evaluation says it pays off.

How it works#

An agent loop is a small state machine that the model drives.

  1. System prompt + goal. The developer seeds the conversation with a role, constraints, and the user request. Tool schemas (JSON Schema for OpenAI/Anthropic, function signatures for SDKs) are passed alongside the messages so the model knows what’s callable.
  2. Plan / act. The model emits either a final answer or a tool_use block — name + arguments. The classic ReAct pattern (Yao et al., 2022) interleaves a visible “Thought:” before each action so the trace is auditable; Toolformer (Schick et al., 2023) showed models can learn when to call tools without explicit scaffolding.
  3. Execute. The harness runs the tool — code interpreter, shell, HTTP call, vector search, sub-agent, MCP server — and returns a tool_result content block. MCP (Model Context Protocol) is the emerging interop standard: any MCP server is a drop-in tool surface for any MCP-aware agent.
  4. Observe / update. The result is appended to the message list; the model re-reads the trajectory, updates its plan, and emits the next action. Long traces get compacted, summarised, or offloaded to memory — a key-value store, a vector index, or a scratchpad file.
  5. Terminate. The loop ends when the model returns stop_reason: end_turn, when a max_turns budget is hit, when a guardrail trips, or when a human-in-the-loop step rejects an action.

Patterns layer on top: single-agent + tools (the default), router (one agent chooses among specialists), multi-agent debate / review (AutoGen’s signature pattern), role-based crews (crewAI’s planner → researcher → writer chain), graph-based stateful workflows (LangGraph’s directed graph with checkpoints and time-travel), and sub-agents (Claude Code’s Task tool, Codex’s /agent) for context-window isolation and parallelism.

Frameworks land on different points of the trade-off curve:

  • Claude Agent SDK — safety-first, MCP-native, ships computer use; locked to Claude models.
  • OpenAI Agents SDK — clean handoff model, built-in tracing and guardrails; locked to OpenAI models.
  • LangGraph — fully model-agnostic, stateful graphs with checkpointing and time-travel debugging via LangSmith.
  • AutoGen / crewAI / LlamaIndex / Haystack — opinionated higher-level surfaces for multi-agent, role-based, document-centric, or pipeline-DAG patterns respectively.

Evaluation has matured alongside the runtimes. SWE-bench Verified (500 real GitHub issues) is the de-facto coding-agent benchmark — Claude Sonnet 4.5 leads at ~77 % as of 2026, up from 4 % three years earlier. Adversarial variants like SWE-ABS show the headline numbers drop ~15 points under strengthened test suites, so always pair a public benchmark with task-specific evals before trusting an agent in production.

Common pitfalls#

  1. No termination condition. Multi-agent runs without max_turns or an explicit termination_condition will loop until token budgets explode. Always cap the loop and alarm on runaway cost.
  2. Reaching for a framework before the prompt works. A single well-scoped tool-use call often beats a 6-agent crew. Start with the model’s native tool-use API; promote to a framework only when evals justify it.
  3. Vague tool descriptions. Tools are selected by the model from their description field. “Get weather” is worse than “Get current weather for a city; call this whenever the user asks about temperature, rain, or forecasts.” Write descriptions from the model’s perspective.
  4. Overlapping agent roles. Two agents with near-identical role/goal produce contradictory output. Each agent in a crew needs a clearly differentiated responsibility, or collapse them into one.
  5. Context-window poisoning. Long tool traces, retries, and verbose errors crowd out the task. Spawn a sub-agent (Claude Code Task, Codex sub-agent, LangGraph sub-graph) for sub-problems, and keep the parent’s context lean.
  6. Skipping evaluation. Headline benchmark scores don’t predict your workload. Build a small, task-specific eval set early and re-run it on every prompt or tool change — agents regress silently.
  7. Conflating agents with workflows. If every step is pre-determined, you don’t have an agent — you have a chain. That’s often better (cheaper, more predictable). Only adopt agency where dynamic decision-making genuinely helps.

Where to go next#

Sibling concepts, tool-specific cheat sheets, and external references for going deeper.

Sources#

References consulted while writing this concept page. Links open in a new tab.

See also

Used in (7)