AI Jungle
ProductsInsightsResourcesHow We WorkAbout
Book a Call →
AI Jungle

Custom AI agents, consulting infrastructure, and autonomous systems.

[email protected]
Book a Call →

Services

  • Tensor Advisory →
  • MAIDA
  • All Services

Content

  • Field Notes
  • Products
  • Resources
  • Newsletter

Company

  • About
  • How We Work
  • Book a Call
  • Privacy
  • Terms

© 2026 AI Jungle.

  1. Home
  2. /Field Notes
  3. /Harness Engineering: Why the Structure Around the Model Matters More Than the Model
AI & Productivity11 min readApril 17, 2026

Harness Engineering: Why the Structure Around the Model Matters More Than the Model

By AI Jungle

The model debate is mostly settled. What separates production AI from toy demos is the structure around it — memory, learning loops, tenant isolation. Here is what we learned building an agent template.

Harness Engineering: Why the Structure Around the Model Matters More Than the Model

TL;DR The agent question has changed. It is no longer which model to pick, it is which structure to remove. Claude Routines launched last week. A French solo operator ran 51 agents for €350,000 last year. 1 in 4 community-shared agent skills ships with a vulnerability. The bottleneck is not the model anymore — it is the structure around it. Three moves separate a harness that compounds from one that rots: keep memory in files not runtime, gate the learning loop on humans first, treat per-tenant isolation as a first-class invariant.


The model debate is mostly settled

The community caught up to this last week. A widely-shared video framed it in one line: it is no longer which model to pick, it is which structure to remove. Claude Routines launched the same week, turning natural language into a full automation runtime with scheduled triggers, webhooks, and cloud containers. A French solo operator made €350,000 last year running 51 agents with zero employees. The pattern is visible across every credible signal right now. Model selection stopped being the bottleneck. Structure is.

Here is the problem. That structure is fragile. A recent audit found that 1 in 4 community-contributed agent skills contains a vulnerability. Most agent failures we see in production are not "the model got it wrong." They are hallucinated database writes, silent permission leaks, memory layouts that amnesia out after the runtime restarts, skills that import each other into a spaghetti mess.

So the question for anyone building an agent that actually runs in production is not "which model should I pick?" It is "what is the smallest, most opinionated structure I can wrap around a model so that the thing is debuggable, recoverable, and portable?"

That is harness engineering. It is not a job title yet. It will be.

What harness engineering actually is

The harness is everything between the model call and the useful outcome. Memory layout. Skill loading. Tool schemas. Permissions. The error path when the LLM returns malformed JSON. The cron that wakes the agent up. The audit trail when something goes wrong.

From what we have seen building and breaking our own agents over the last six months, three things separate a harness that compounds from one that rots.

Memory lives in files, not in runtime. If the agent's memory dies when the process dies, the harness is too thick. If editing a skill requires a migration, the harness is too heavy. The agentic-stack project shipped v0.5 in April 2026 under MIT, and its core thesis is exactly this. Files are authoritative. Git is the version control. SQLite is a derived read index. Nothing precious lives in RAM or in a proprietary format. The agent is a thin conductor reading files.

Opinionated defaults beat configurable knobs. Every option that exists is an option that breaks. Every flag is a switch future-you or another operator forgets to flip. The best harnesses pick trade-offs and enforce them. Our template's memory model is one example: four layers (episodic JSONL, semantic markdown, procedural skills, identity), fixed. No "memory backend plugin" abstraction. If you want a different layout, you fork the template.

Observability is the default, not an add-on. Production agents silently succeed or silently fail. You cannot fix what you cannot see. A working harness emits structured events on every tool call, every memory write, every permission check. Cost attribution per tenant. Error traces with the full tool-call sequence. This is not glamorous work. It is the thing that separates "demo" from "customer."

Three moves that matter more than picking a model

Move 1: Make memory files, not a database

The temptation when you start is to reach for Postgres and pgvector. Resist it. A week ago we finished designing our agent template's memory layer, and the load-bearing decision was this: files are the source of truth, SQLite with full-text search is a derived index, and a vector layer is opt-in per tenant.

Why files? Three reasons, and they compound.

First, git gives you audit and revert for free. Every lesson learned, every promoted procedural skill, every identity update has a commit hash behind it. When the agent graduates a bad lesson into its identity layer, you can see what changed, when, and why. You can roll back. Try doing that with a mutated Postgres row.

Second, files are portable. A customer wants to self-host? Zip the brain repo. Wants to switch runtime providers? Same answer. The file-first model means the brain is not locked to the runtime.

Third, files are debuggable by a human. You can open a markdown file in Obsidian and read what the agent believes. You cannot do that with pgvector. This matters more than it sounds. The single biggest accelerant to improving an agent is a human who can read its state and correct it without running migrations.

The cost is real. Full-text search on files is slower than SQL on rows for large datasets. You need a derived index for anything above a few thousand entries. But the derived index is throwaway. Rebuild it from the files. The files are the truth.

Move 2: Gate the learning loop on humans first

The seductive idea in agent design is the fully automated dream cycle. The agent reflects on its day, graduates lessons into procedural skills, mutates its own identity file, and wakes up smarter tomorrow.

Every team that ships this in week one regrets it by week four.

The failure mode is not that the agent fails to learn. It is that it learns the wrong thing and codifies it. A single bad session where the agent had a hallucination can get promoted into a "rule" that poisons every future interaction. And because the graduate path is automated, no human ever sees the mutation before it is binding. By the time you notice the bad rule, three weeks of downstream behavior have been shaped by it.

Our template starts fully human-gated. The dream cycle runs. It proposes graduations. A human opens a weekly review session, reads each proposed lesson, and approves, rejects, or rewrites it before anything binds. Once you have seen 40 graduations in a row and the shape of the failures becomes predictable, you can automate the easy ones. Not before.

The rule we landed on: never automate a decision until you have made it manually 40 times. Most teams try to automate after 2. That is the bug.

Move 3: Treat per-tenant isolation as a first-class invariant

If you build an agent for one person, this move is free. If you build it for more than one person, it is the difference between a sustainable product and a lawsuit.

Our rule: per-agent database isolation, brain repo ownership separated from data ownership, wipe-and-archive on churn. No shared tables. No "tenant_id" column pretending to be a boundary. Each agent lives in its own database, its own filesystem, its own permission scope.

This costs more in infrastructure. It buys three things. One agent's corrupted state cannot damage another. Compliance requests ("delete everything about us") become a file operation instead of a migration. And the moment you want to offer on-prem or sovereign deployment, you ship the same artifact. The only difference is which server it runs on.

We flagged one coupling early and fixed it before it shipped. An internal ops agent had started sharing a database with another product. The coupling looked cheap. The unwind would have been a year-one footgun if we had let it run. Writing the invariant down first is what caught it.

The competitive reality

Harness engineering is not a settled field. It is an emerging one, and that means the best minds are currently disagreeing in public, which is a great sign.

Garry Tan's GBrain uses Postgres with pgvector and ships 24 opinionated skills; the trade-off is richer query semantics at the cost of file portability. The agentic-stack project goes file-first with a mechanical dream cycle and a host-agent review CLI. Anthropic's Claude Routines sits closer to the workflow-engine end of the spectrum, with scheduled triggers and cloud containers. Each one is solving the same problem from a different angle: what is the smallest, most opinionated structure I can wrap around a model?

The meaningful work is the structural trade-off, not the model choice. Whether you pick Opus, Sonnet, Haiku, or GPT for your runtime matters less than whether your memory is recoverable, your learning loop is gated, and your tenants are isolated.

There is a security dimension worth stating plainly. The 1-in-4 vulnerability rate in community skills is a harness problem, not a model problem. Shared and portable structures amplify every flaw. If you are installing a community skill into your agent without reading it, you are running untrusted code inside your memory layer. Treat skills like npm packages. Fewer. Audited. Pinned.

Whichever architecture you pick, you are a harness engineer now. You might as well admit it, write down the trade-offs you are making, and own them.

What we are actually building

The first working version of our agent template is being built this month. It is a fork of agentic-stack (MIT, with attribution), rewired for multi-tenant deployment, with an MCP server as the primary binding and file-reads as an emergency fallback. 25 tasks. Test-driven from day one. Every decision on memory layout, permissions, learning loop, and isolation is written down before a line of code ships.

Next month we migrate our own tools into it. The month after, our first customer's agent moves onto it. By then we will know which of these moves were right and which we were wrong about. We will write both up.

If you are wrestling with the same trade-offs (memory layout, when to automate the learning loop, per-tenant isolation), we would like to compare notes. The AI Jungle Dispatch goes out Fridays with the build log and the decisions we changed our minds about that week.

Frequently Asked Questions

What is harness engineering?

Harness engineering is the discipline of designing the structure that wraps a language model so it can actually run in production. The harness includes memory layout, skill loading, tool schemas, permissions, error paths, observability, and the scheduling that wakes the agent up. The model does the reasoning; the harness determines whether the result is usable, debuggable, and recoverable when things go wrong.

Why does the structure matter more than the model choice?

Frontier models are now close enough in capability that the differentiator for production agents has moved to the surrounding structure. Most agent failures in production are not reasoning failures — they are hallucinated database writes, silent permission leaks, broken memory layouts, and fragile skill dependencies. All of those are harness problems, not model problems.

Should agent memory live in a database or in files?

For most use cases, files are the better default. Files give you git-level audit and revert for free, they are portable between runtimes, and they are debuggable by a human who can just open them and read. A SQLite full-text index over the files handles search. A vector layer can be opted in per tenant if you need semantic retrieval.

How do you prevent an agent from codifying bad lessons into its identity?

Gate the learning loop on humans first. The automated dream cycle can propose graduations, but a human reviews and approves each one before it binds. Our rule is to never automate a decision until you have made it manually 40 times. Most teams try to automate after 2, which is how bad lessons get promoted into binding rules and poison future interactions.

What is per-tenant isolation and why does it matter?

Per-tenant isolation means each agent instance has its own database, its own filesystem, and its own permission scope. It is the difference between a product and a lawsuit once you have more than one customer. Shared-table multi-tenancy with a "tenant_id" column feels cheap at day zero and becomes a year-one footgun when one tenant's corrupted state damages another or a compliance request requires surgical data removal.

Are community agent skills safe to use?

Treat them like npm packages. Recent audits found that roughly 1 in 4 community-contributed agent skills contains a vulnerability. If you are installing a community skill into your agent without reading it line by line, you are running untrusted code inside your memory layer. Fewer skills, audited, pinned by version.

What is the difference between harness engineering and prompt engineering?

Prompt engineering optimizes what the model sees in a single call. Harness engineering optimizes the surrounding infrastructure that makes many calls reliable, recoverable, and observable over time. A well-crafted prompt inside a broken harness will still produce unreliable results; a well-designed harness with a mediocre prompt will still be maintainable and fixable.

Not sure if an AI agent is right for you?

The AI Agent Decision Guide walks you through a 20-question framework to figure out what setup actually fits your workflow. Free PDF.


← All field notesBook a Strategy Call →

Keep Reading

AI & Productivity

How to Build an AI Agent That Gets Smarter Without Retraining

How to build an AI agent that improves itself from user corrections — no model retraining needed. Real production system: two Python scripts, emergent learning behavior.

AI & Productivity

AI Automation Agency vs DIY: When to Hire an Expert

AI automation agency vs DIY — decision framework, red flags, timelines, and what agencies actually deliver.

The AI Agency Model: How a 2-Person Team Outperforms a 20-Person Consultancy
Business

The AI Agency Model: How a 2-Person Team Outperforms a 20-Person Consultancy

Real numbers, real deliverables. How we run an AI consulting agency with 2 humans and AI agents, and why the traditional consulting model is about to break.