What is an AI agent (and what isn't)
An AI agent is a system where an LLM decides which actions to take and in what order — calling tools, writing to memory, deciding when it's done — instead of executing a fixed pipeline. The "agent" label is overused: a single LLM call that returns JSON is not an agent. A workflow with one tool call is not an agent. A real agent has at least: dynamic tool selection, multi-step planning, and the ability to detect and recover from its own errors.
If you're shipping something that fits the JSON-API-call pattern, build the workflow — it'll be cheaper, faster, and more reliable. Agents are the right answer when the task genuinely needs the LLM to plan dynamically.
The 2026 production stack
| Layer | What we use | Why |
|---|---|---|
| Orchestration | LangGraph, OpenAI Agents SDK, or hand-rolled Python state machine | Production-grade error recovery. Visual debugging. Resumable runs. |
| Reasoning model | Claude 3.5 Sonnet (default), GPT-4o (cheaper tools), Claude 4 Opus (hard reasoning) | Best quality-cost-latency at each tier. See our model benchmark. |
| Tool calling | Native function calling on OpenAI/Anthropic | Mature, fast, ~96% reliability. Don't roll your own. |
| Memory | Postgres + pgvector for long-term; Redis for working memory | Boring, fast, transactional. Skip the agent-memory startups. |
| RAG | pgvector or Qdrant + Cohere Rerank 3.5 | Reranking lifts quality more than any embedding-model upgrade. |
| Evals | LangSmith, Braintrust, or custom | Track quality regressions across deployments. Non-negotiable. |
| Observability | LangSmith, Helicone, or OpenTelemetry | You cannot debug agents without trace-level logs. |
| Hosting | Modal, Render, Railway, or AWS Fargate | Agents need long-running workers + queues. Don't deploy to Vercel. |
Core design patterns
- ReAct loop — reason, act, observe, repeat. The foundational pattern. Most agents are ReAct underneath.
- Planner / executor split — one LLM call plans the steps, another (or many) executes. Better for long-horizon tasks.
- Reflection — agent reviews its own output and decides whether to retry. Adds 30–50% to latency but reliably improves quality on hard tasks.
- Tool-use only — no free-form reasoning, just sequential tool calls. Surprisingly powerful for well-bounded tasks.
- Human-in-the-loop — agent pauses for approval before consequential actions. Mandatory for anything customer-affecting or money-moving.
Tool design — the part everyone gets wrong
The single biggest determinant of agent reliability is tool design. Most agents fail not because the model is dumb but because the tools are badly designed. Rules we live by:
- Tool names should read like a sentence:
search_inventory(query)beatsdb_q(s). - Every parameter needs a clear description and example. The model uses docstrings, not just types.
- Tool outputs should be parseable and bounded. Returning 50K tokens of JSON wrecks context.
- Failures should return helpful errors.
"Item not found — did you mean SKU-1234?"beats"404". - Idempotent by default. The agent will retry. Plan for it.
- Number of tools matters. Agents start degrading past ~30 tools. Bundle related ones into a single tool with an action parameter.
Memory & state
Three layers, in priority order:
- Working memory — the current conversation / task context. Keep it in Redis with a TTL.
- Episodic memory — past interactions, indexed for retrieval. pgvector + a "what is this conversation about" embedding works well.
- Semantic memory — distilled facts about the user or domain. Surfaced via a retrieval step at the start of each task.
Skip the "agent operating system" platforms unless you've already hit a real limitation. They solve problems most agents don't have.
Evals and guardrails
If you can't measure agent quality, you can't ship agents. Minimum eval setup:
- A golden dataset of 50–200 input → expected output pairs, curated by hand.
- An automated scorer (LLM-as-judge or rules-based) that runs on every PR.
- A regression dashboard that shows score deltas across deployments.
- Output guardrails: PII redaction, profanity filters, refusal patterns for out-of-scope topics.
- Cost/latency budgets per agent turn — fail loud when exceeded.
Deployment & observability
Three rules from running agents in production:
- Trace everything. Every LLM call, tool call, retry, and decision needs to be logged with input + output + duration + cost. LangSmith or OpenTelemetry.
- Queue agent runs. Synchronous HTTP requests die under load. Push work onto a queue (Redis, SQS, BullMQ) and stream results back.
- Have a kill switch. Every agent should be 1 environment variable away from being stopped if it goes sideways.
Top 7 mistakes we see
- Building an agent when a workflow would do — adds 5x cost and 3x failure modes.
- Using GPT-4o-mini for the orchestration model — false economy, agent runs cost more in retries.
- Skipping evals — you'll regress quality without knowing it.
- Letting the agent decide which model to call — too much variance, save it for true multi-model agents.
- Not bounding context — agents accumulate trash context and start hallucinating.
- Tools that return unparseable strings — the model retries 4 times then gives up.
- No human-in-the-loop for consequential actions — the agent will eventually do something dumb.
Want this built for you? Get in touch or run our free AI audit to see if an agent is even the right architecture for your use case.