The DispatchArchive

Field notes from the operation.

Working papers on Transfer of Experience and AI agents, shipped by teams running agents in production.

AI ProductivityApril 5, 2026AI Jungle

How to Build an AI Agent That Gets Smarter Without Retraining

How to build an AI agent that improves itself from user corrections — no model retraining needed. Real production system: two Python scripts, emergent learning behavior.

How to Build an AI Agent That Gets Smarter Without Retraining

TL;DRA client makes a correction. The AI logs it. Third time the same pattern shows up, a rule gets drafted automatically and flagged for review. No model update. No retraining. Two Python scripts. This is the double-loop feedback architecture we run in production — and the behavior that emerged from it was never explicitly coded.

The Problem With "Set It and Forget It" AI

Most AI agents are deployed once and then stay static.

You prompt-engineer your way to a decent output, deploy, and hope nothing changes. Two months later your client is silently working around the AI's blind spots because fixing them was never part of the plan.

This is the single biggest gap between AI demos and production AI systems. The demo is optimized for the moment it was built. Real usage reveals friction the demo never encountered. Without a mechanism to capture and act on that friction, the system slowly drifts toward irrelevance.

The alternative is not complex. It requires two things: a way to observe when the AI is wrong, and a way to act on that observation. Everything else is implementation detail.

The System We Built

We built a sales copilot for a boutique executive consulting firm. The AI drafts outreach messages, scores leads, and prepares meeting briefs. The consultant reviews every output before it goes out. He accepts some, edits others, and occasionally rejects outright.

Every one of those decisions is data. The question was how to turn that data into improvement.

Three numbers define the architecture:

3 — occurrences before a rule is proposed
2 — Python scripts powering the loop
0 — model retraining required

Script 1 — correction_handler.py

Every time the consultant modifies an AI output, correction_handler.py captures the edit. It stores three things: what type of correction it was, the original AI output, and the corrected version.

The correction types are simple but precise:

wrong_title — the AI used the wrong job title for a contact
wrong_company — wrong company name or context
tone_too_formal — the consultant writes more directly than the AI defaults
opening_removed — the consultant consistently cuts the AI's opening questions
length — messages too long for the target seniority
skip — the entire output was discarded

All of this goes into a PostgreSQL table called correction_log. Each row is a signal. One row is noise. Thirty rows is a pattern. The handler's job is just to capture the signal — no analysis, no rules, no judgment. Just logging.

Script 2 — rule_graduation.py

rule_graduation.py runs on a daily cron. It scans the correction log and looks for patterns. The threshold is simple: if the same correction type occurs 3 or more times, something systematic is happening.

When a pattern crosses the threshold, the script drafts a candidate rule in plain English and flags it for human review:

"Proposed rule: For C-level contacts in manufacturing, use 2-sentence max opening. Observed 5 times: consultant shortened opening by >40% in this contact type. Review and approve?"

If approved, the rule gets committed to the system prompt. The AI's behavior changes for all future outputs of that type. The correction stops recurring.

What Makes This Different From RAG or Fine-Tuning

When people talk about making AI "smarter," they usually mean one of two things: better retrieval (RAG) or better weights (fine-tuning). Both are valid. Neither is what we built here.

RAG gives the AI better information. It doesn't change how the AI behaves with that information. The corrections weren't about missing facts — the AI had the right contact data. It was making systematic behavioral choices that didn't match the consultant's communication style.

Fine-tuning changes the model's weights. It's expensive, slow, and requires significant data volume before it moves the needle. You're not fine-tuning a model to stop adding unnecessary opening questions — that's overkill and the wrong tool.

Rule graduation changes the system prompt. It's fast, cheap, reviewable, and precisely targeted. The human stays in control of every change. The AI can't add rules to itself — it can only surface patterns that a human then approves. This is the "human in the loop" property that distinguishes production AI from experimental AI.

The Emergent Behavior

Here is the part that surprised us.

We built a correction logger. We expected to get a list of things to fix manually. What we got instead was a system that gradually aligned itself to the consultant's preferences without anyone sitting down to rewrite the prompts.

After four weeks, the acceptance rate on AI-drafted messages was measurably higher. Not because we engineered that outcome directly — but because the feedback loop was closing corrections faster than new ones appeared. The system was converging on the consultant's communication style through accumulated evidence, not through anyone explicitly describing that style.

Two scripts produced emergent behavior. That is the architectural lesson.

Why This Pattern Works for Any AI System

The pattern is not specific to sales copilots. It works anywhere a human is reviewing AI output and making corrections:

Document drafting agents — every edit the user makes is a correction signal
Customer support AI — every escalation or manual override is a pattern to capture
Financial analysis agents — every time a human changes a classification, that's data
Research summarizers — every cut or rewrite reveals a systematic mismatch

The key requirement is that a human is already reviewing the output. If you have human review, you have correction data. If you have correction data, you can build this loop. Most teams have the review process but no mechanism to capture and act on what that review reveals.

The Architecture, Simply

If you want to build this, the components are minimal:

1. Correction capture A way to log when a human changes an AI output. Minimum: store (type, original, corrected, timestamp). PostgreSQL, SQLite, even a JSON file works for v0.

2. Pattern detection A script that counts occurrences by type. Runs on a schedule. Threshold triggers a flag. No ML required — this is COUNT(*) GROUP BY.

3. Human review gate The proposed rule is not applied automatically. A human approves it. This is non-negotiable. The AI surfaces patterns. Humans make decisions.

4. Rule application Approved rules are committed to the system prompt. Nothing else changes. No model update. The behavior shifts forward from that point.

You can build v0 of this in an afternoon. The sophistication comes from what you measure (correction types), how you present candidate rules, and how you handle edge cases over time.

What We Learned in Production

Four things surprised us after six weeks of running this system:

The threshold matters more than you expect. We started at 5 occurrences before triggering a rule proposal. It was too high — patterns took too long to surface. At 3, the system became responsive without generating noise.

Correction type granularity is everything. Early on, we logged corrections as "edited" or "rejected." That's useless — you can't derive a rule from "the user changed something." When we introduced specific correction types (tone_too_formal, opening_removed, length), the patterns became actionable immediately.

Rejected rule proposals are as valuable as approved ones. When a proposed rule gets declined, that's signal too. It means the pattern is real but the proposed rule was wrong, or the context matters more than the frequency. We log rejections with the explanation, which itself feeds into better rule proposals later.

The system reveals things the human didn't know they knew. The consultant didn't consciously know he always shortened messages to C-level prospects. The system found it before he would have articulated it. This is the value of systematic observation over intuition.

The Bigger Picture

The conventional AI deployment model is: build once, prompt engineer until it's good enough, deploy, and manage feedback through support tickets or quarterly reviews. This is the wrong model. It treats AI systems like software releases instead of like employees.

An employee who makes the same mistake three times and doesn't get corrected is being managed badly. An AI system that makes the same mistake three times and has no mechanism to capture that feedback is architected badly.

The correction loop we built is not sophisticated. It's disciplined. It instruments what was already happening (human reviewing outputs), captures the signal (what they change and why), and closes the loop (proposing rules that shift behavior). The human stays in control at every step.

That is what self-improving AI actually looks like in production. Not autonomous optimization. Not unsupervised learning. A tight feedback loop where human judgment and machine pattern recognition each do what they're good at.

The system that learns from its user beats the system that's configured for its user. Every time. The first adapts. The second decays.

What This Means for Your AI Systems

If you have a human reviewing AI output right now and no mechanism to capture their corrections — you are generating feedback data and throwing it away. Every edit, every override, every rejection is a signal about the gap between what your AI does and what you actually need.

The gap is not fixed by better prompts at deployment. It is fixed by a system that observes the gap continuously and narrows it over time.

Two scripts and a threshold. That is all it takes to start.