AI Agent Development: Building Production Systems Beyond the Demo

A chatbot is a demo. An AI agent that books appointments, reconciles invoices, or routes claims is a production system - with retry semantics, observability, audit trails, and a cost ceiling. This article describes how MIT-DEV builds AI agents for production use: architecture, evaluation, guardrails, and the operational discipline that separates a working agent from a Friday-afternoon demo.

Demo vs Production: Why the Gap Is Wider Than for Classical Software

Conventional software is deterministic: given the same input, you get the same output. AI agents are probabilistic. The same prompt twice can return two different answers, two different tool calls, and two different downstream effects. Everything downstream - observability, retries, idempotency, evaluation - needs to be designed for that fundamental shift.

The Anthropic engineering team's Building Effective Agents makes the point directly: most production-grade systems do not need agentic autonomy; they need a workflow with one or two LLM-driven steps. Use the simplest construct that solves the problem.

When to Use What

Single LLM call: classification, summarisation, extraction. Most "AI features" should stop here.
Workflow with LLM steps: parallel calls, routing, prompt chaining. Deterministic graph, probabilistic nodes.
Agent: LLM decides its own next tool call inside a loop, with terminating conditions. Reserved for open-ended problems where the workflow graph cannot be drawn in advance.

Production Architecture: The Pieces You Cannot Skip

Tool Layer

Every external action is a typed tool with input validation, idempotency, and a permission boundary. Tools fail in predictable ways the agent can react to; tools never call irreversible operations without a confirmation token.

Memory Layer

Short-term context lives in the conversation. Long-term memory lives in a typed store - PostgreSQL plus a vector index where similarity matters. Never store secrets in the conversation context; never let the agent be the source of truth for a record that already lives in a database.

Evaluation Layer

Every production agent ships with a benchmark of held-out cases that runs in CI. Regression on the benchmark blocks deployment. Without this, prompt changes are guesses.

Observability Layer

Trace every LLM call: model, prompt template version, tools called, tokens in/out, latency, cost. Without this, you cannot debug, optimise, or budget.

Guardrails

PII redaction, prompt-injection defence (treat untrusted input as untrusted, exactly as you would in a SQL query), rate limiting, content filtering. OWASP's Top 10 for LLM Applications is the right starting checklist.

Cost Control: The Invisible Production Killer

An agent that retries three tool calls per turn, at four turns per user interaction, against a frontier model, can quietly cost $1–2 per interaction. Multiply by 10,000 users and the AWS bill is not the surprise - the model bill is. We instrument cost per interaction from the first sprint and build budget cut-offs before launch, not after.

Model Selection: Right Model for the Right Step

Classification and routing → small fast model
Tool selection and short reasoning → mid-tier model
Open-ended planning, multi-step reasoning, high-stakes decisions → frontier model
Embedding generation → dedicated embedding model

Mixing models in one workflow is the single largest cost lever after caching. We treat the model choice per step as an architecture decision, recorded in an ADR.

MIT-DEV's AI Track Record

Production AI workflows across fintech (KYC, fraud triage), travel (itinerary generation), and operations (document extraction)
Evaluation benchmarks in CI on every agent we ship; regressions block release
Cost ceiling per interaction documented in the ADR before launch
Full observability (prompt, model, tools, tokens, cost, latency) by default

Frequently Asked Questions

Do we need an agent or a workflow? If you can draw the decision graph on a whiteboard, you need a workflow, not an agent. Reserve agentic autonomy for problems where the graph itself is the unknown.

Which model do you default to? We default to Claude Sonnet for general reasoning, Haiku for cheap classification, and Opus for the hardest planning steps. Other providers are evaluated case by case.

How do you stop prompt injection? Treat any text that did not originate from a trusted source as untrusted; sanitise before insertion into the prompt; never grant tool-call permissions on the basis of instructions found in user content.

Can you build on top of an existing LLM stack? Yes. We have integrated with LangChain, LangGraph, custom orchestrators, and direct Anthropic/OpenAI SDKs. The orchestrator choice is a per-project decision.

Ready to Move From Demo to Production?

Book an AI scoping session - we will sketch the workflow or agent boundary, model selection, cost ceiling, and evaluation plan together.