# 12-Factor Agents: Patterns of reliable LLM applications — Dex Horthy, HumanLayer

Tom Brewer
Table of Contents

These notes are based on the YouTube video by AI Engineer


Key Takeaways

  • Agents are just software – building reliable LLM‑driven agents boils down to classic software‑engineering practices: clear control flow, explicit state, and deterministic orchestration.
  • Own every token – the quality of an agent hinges on the prompts and context you feed the model. Treat prompts like hand‑crafted code; iterate, test, and optimise them. As explained in Why Your AI Gets Dumber After 10 Minutes, context management matters—owning each token pays off in cost and reliability.
  • Treat tools as deterministic functions – rather than “magical” external calls, view tool usage as a pure JSON‑in → JSON‑out transformation that your code executes.
  • Control‑flow ownership is critical – implement your own loops, switches, and DAG orchestration so you can pause, resume, retry, and serialize state safely.
  • Small, focused agents win – keep individual agent loops short (roughly 3‑10 steps) and embed them in an otherwise deterministic pipeline.
  • Error handling & context hygiene – surface tool errors to the model deliberately, but prune noisy stack traces before they re‑enter the context window.
  • Human‑in‑the‑loop is a first‑class feature – expose a clear “tool‑or‑human” decision point early in the generation so the model can ask for clarification or escalation.
  • Multi‑channel reach – let agents surface through email, Slack, Discord, SMS, etc., meeting users where they already work. For practical tips on shipping such multi‑channel agents, see Ship Production Software in Minutes, Not Months.
  • Stateless reducers → owned state – keep the LLM itself stateless; persist and manage all execution state yourself (e.g., in a DB).

Detailed Explanations of Core Concepts

1. The “JSON‑from‑sentence” primitive (Factor 1)

The most reliable thing an LLM can do today is translate a natural‑language request into a well‑structured JSON payload:

{
"action": "create_ticket",
"priority": "high",
"summary": "User cannot log in"
}

The downstream code consumes this JSON deterministically. The surrounding factors (prompt design, error handling, etc.) make the whole system robust.


2. Own Your Prompts (Factor 2)

  • Prompt as code: Think of a prompt as a function definition. The fewer “hand‑wavy” instructions, the more predictable the output.
  • Iterative refinement: Use A/B tests, token‑level analysis, and prompt‑engineering tools to converge on a “banger” prompt. The five techniques that separate top agentic engineers—covered in The 5 Techniques Separating Top Agentic Engineers Right Now—are especially useful here.
  • Context engineering: Decide how to pack history, memory, RAG results, and system instructions into the OpenAI messages format (or equivalent). Example layout:
[
{"role": "system", "content": "You are a helpful assistant that returns JSON only."},
{"role": "user", "content": "Schedule a meeting with Alice tomorrow at 10 am."},
{"role": "assistant", "content": "{\"action\":\"schedule_meeting\",\"participants\":[\"Alice\"],\"time\":\"2026-01-18T10:00:00\"}"}
]

3. Tool Use Is Not “Magical” (Factor 4)

  • Misconception: Treating tool calls as a mystical “agent‑to‑world” interface leads to brittle pipelines.
  • Correct view: The LLM emits JSON that your deterministic code executes (e.g., an HTTP request, a database query). The tool itself is just another pure function.
  • Implementation sketch:
def run_step(model_output):
# model_output is JSON like {"tool":"search","query":"latest LLM papers"}
if model_output["tool"] == "search":
return web_search(model_output["query"])
elif model_output["tool"] == "db_query":
return db.execute(model_output["sql"])
# … more deterministic branches

4. Own the Control Flow (Factor 8)

  • DAG vs. naïve loop: A simple “LLM decides next step, feed back, repeat” works only for tiny workflows.
  • Explicit DAG orchestration: Model each step as a node with defined inputs/outputs. Use a lightweight orchestrator (or a custom state machine) to enforce ordering, retries, and branching.
  • Pause/Resume: Serialize the current context window and any pending external calls to a DB. When a long‑running tool finishes, load the saved state, append the result, and continue.
graph LR
A[Receive Event] --> B[Prompt LLM for next step]
B --> C{Tool Call?}
C -->|Yes| D[Execute deterministic tool]
D --> E[Append result to context]
E --> B
C -->|No| F[Return final answer]

5. Small, Focused Agents & Hybrid Pipelines (Factor 10)

  • Pattern: Deterministic CI/CD pipeline → LLM decides a small, ambiguous step → human approval (if needed) → back to deterministic code.
  • Benefits: Keeps context windows tiny, isolates uncertainty to a bounded sub‑task, and makes debugging straightforward.

6. Error Propagation & Context Hygiene (Factor 9)

  • Surface errors: When a tool fails, inject a concise error summary into the next prompt so the model can retry or ask for clarification.
  • Avoid noise: Do not dump full stack traces into the context; they consume tokens and confuse the model.
error_summary = {"error":"Timeout while calling external API","retry":True}
prompt_context.append(error_summary) # concise, useful info only

7. Human‑in‑the‑Loop as a First‑Class Decision (Factor 7)

  • Early branching: The model should decide immediately whether to:

    1. Return a final answer,
    2. Ask the user for clarification,
    3. Escalate to a human operator.
  • Natural‑language token: Encode this decision in the first token(s) of the output, e.g., "human" vs. "tool" vs. "done".


8. Multi‑Channel Delivery (Factor 11)

Agents should be reachable via the channels users already use—Slack, Discord, email, SMS, etc. The underlying logic stays the same; only the transport layer changes. For a deeper dive on building multi‑channel bots quickly, see Ship Production Software in Minutes, Not Months.


9. Stateless LLM, Owned State (Factor 12)

  • Statelessness: The LLM never retains memory between calls. All conversation history, business state, and workflow progress live in your persistence layer.
  • Transducer pattern: Think of the LLM as a pure function that transforms the current state (JSON) into the next action.

10. “Create 12‑Factor Agent” – A Scaffold, Not a Wrapper (Factor 13)

  • Scaffold, not bootstrap: Provide the minimal plumbing (state store, orchestrator, API surface) and let developers own the actual agent code.
  • Goal: Shift the burden away from “hard AI parts” (prompt engineering, flow control) to the framework, letting teams focus on domain‑specific logic.

Summary

Dex Horthy’s talk reframes LLM agents as ordinary software systems that happen to use a stateless language model as a pure function. Reliability comes from:

  1. Explicit ownership of prompts, context windows, and execution state.
  2. Deterministic orchestration of tool calls via JSON contracts and a clear control‑flow graph.
  3. Small, focused agent loops that keep LLM involvement limited to the parts of a workflow that truly need natural‑language reasoning.
  4. Robust error handling and human‑in‑the‑loop pathways that are baked into the model’s output format.
  5. Multi‑channel exposure so agents meet users where they already work.

By treating agents as modular, testable software components and applying classic engineering patterns (DAGs, state serialization, retry logic), developers can build production‑grade, customer‑facing LLM applications without relying on heavyweight “agent frameworks.” The 12‑factor checklist serves as a practical wish‑list for any framework aiming to support high‑velocity, high‑reliability AI agent development.

🔗 See Also: Why Your AI Gets Dumber After 10 Minutes 💡 Related: Claude Code Agents: The Feature That Changes Everything

Tom Brewer Avatar

Thanks for reading my notes! Feel free to check out my other notes or contact me via the social links in the footer.

# Frequently Asked Questions

What does “own every token” mean and how can I practically apply it when building an LLM‑driven agent?

“Own every token” means treating the prompt and context as code you control line‑by‑line, rather than leaving the model to guess what to include. Start by writing a minimal system prompt that forces JSON‑only output, then iteratively A/B test variations while measuring token usage and output quality. Use token‑level analysis tools to prune unnecessary words, cache static parts, and explicitly manage RAG snippets so you know exactly which tokens are sent to the model each call.

Why should tool calls be viewed as deterministic JSON‑in → JSON‑out functions, and how do I implement that pattern?

Viewing tools as pure functions removes the “magical” uncertainty of external calls; the LLM only decides *what* to do, not *how* it happens. Define a schema for each tool (e.g., {"tool":"search","query":"…"}) and write a dispatcher that parses the JSON and invokes a deterministic function such as a HTTP request or DB query. The function returns a clean JSON payload that you feed back to the model, keeping the whole pipeline reproducible and testable.

What is meant by “control‑flow ownership” and what are the key steps to build my own loops or DAG orchestration for an agent?

Control‑flow ownership means you, not the LLM, manage the execution sequence, retries, and state persistence. Implement explicit loops (while < max_steps) that read the model’s JSON output, decide the next action, and optionally pause for human input. Use a DAG library or simple state machine to serialize each step’s inputs/outputs to a database, allowing you to resume, debug, or replay the workflow reliably.

How should I handle errors and keep the LLM’s context window clean to avoid noisy outputs?

Surface tool errors to the model in a concise JSON format (e.g., {"error":"timeout","detail":"search API 504"}) so the LLM can decide whether to retry or ask the user for clarification. Before appending the error back into the conversation, strip stack traces and large payloads that would consume tokens without adding value. Persist full error logs elsewhere (e.g., log service) for debugging while keeping the model’s context focused on actionable information.

Why do small, focused agents (3‑10 steps) work better, and how do I decide the right granularity for my agent’s loop?

Short loops limit the amount of state the LLM must keep in memory, reducing drift and token cost, while making retries and human‑in‑the‑loop hand‑offs simpler. Break a complex task into micro‑agents that each perform a single, well‑defined JSON transformation, then chain them in a deterministic pipeline. Measure average step count during testing; if a loop frequently exceeds ten iterations, consider refactoring the logic into separate agents or adding explicit checkpoints.

Continue Reading