NightBuild #3 · HARNESS & LOOP ENGINEERING

Tutorial · for AI practitioners

Exploring harness & loop engineering

Getting useful work from a model has meant continuous interaction with it: prompt, read, judge, prompt again. Harness and loop engineering help you make the model work autonomously, with minimum interaction.

01 — THE MOTIVATION

Why do you need harnesses and loops?

Making a model do a task has been hands-on. You prompt it, read its response, judge the response, and prompt again. It couldn't work without you.

What if you could write the task down once, set the constraints, then let the model run on its own and come back only to check the result?

To achieve this, you need two things:

  1. Harness. A framework the model works within. It enforces the constraints, gives the model tools, and remembers state between runs.
  2. Loop. A driver that keeps prompting the model toward the goal, so you don't have to.

Let's define each one clearly, with examples.

02 — THE FIRST PRIMITIVE

What is a harness?

A harness is nothing new, and you're already using one. Claude Code, OpenAI Codex, Google Antigravity, any agentic IDE or app that decides how a model works, these are all harnesses. The harness controls and directs different behaviour from the same model underneath.

Definition · Harness

A harness is everything wrapped around the model that isn't the model: prompts, tools, context policies, infrastructure, hooks, observability. It gives the model durable state, executable tools, feedback signals, and enforceable constraints. It governs how the model's actions are validated, authorized, executed, and logged, and it controls the context the model works against.

Let's take an example of building an app across many sessions. Here, you can have a harness enforce this rule: read a STATUS.md file at the start of each session to learn where the last one stopped, then write the new status back at the end.

The harness makes the model follow this rule in every session, without you having to tell the model each time.

Building blocks of a harness

A harness is made of:

HARNESS — validates · authorizes · executes · logs InstructionsCLAUDE.md · AGENTS.md Capabilitiestools · skills · MCP Enforcementhooks · gates Observabilitylogs · traces · cost MODEL predicts text Environment — filesystem · sandbox · browser MEMORY — STATUS.md / PROGRESS.md · read at session start, written at session end
A harness: building blocks wrap the model, and memory on disk carries state across sessions.
The defining equation

Model plus harness equals agent. A raw model only predicts text. Add a harness, with state, tools, feedback, and constraints, and you get something that can finish a task.

Failures become rules

To prevent your harness from turning into huge configuration, add a constraint only after you've seen a real failure. Remove it only when a better model makes it redundant. Every line in a good rules file should trace to a specific thing that went wrong.

For example: the agent merges a pull request with a commented-out test? Add a hook that greps the diff for skipped tests and blocks it. The agent runs a destructive command? Add a hook that refuses it.

Each rule becomes permanent, so the agent never repeats that mistake. You encode rules in your harness based on your failure history, the same way you would create a guardrail from an incident postmortem.

What is harness engineering?

Harness engineering involves assembling these building blocks (instructions, tools, environment, enforcement) into a framework that makes the model behave reliably, then tightening that framework each time the model slips. See How do you engineer harnesses and loops? for how to do this in practice.

03 — THE SECOND PRIMITIVE

What is a loop?

At its simplest, a loop is repetitive execution of tasks, either indefinitely at intervals or until a goal is met.

Definition · Loop

In agentic work, a loop repeatedly prompts an agent toward a goal without you driving each turn. It comes in two forms:

  1. Cadence loop. Re-runs a task on a timer (every five minutes, every morning). It has no natural stopping point; it's a heartbeat.
  2. Goal loop. Runs until a verifiable condition is true (all tests pass, lint is clean), with a separate check after each turn deciding whether it's done. It has a gate; it exits when the gate is satisfied.

Let's take an example of fixing failing tests across a codebase. Here, you can run a goal loop: re-run the test suite, send each failure to the agent to fix, and repeat until every test passes. The loop keeps prompting the agent until the goal is met, without you checking in between runs.

In Claude Code these are /loop (re-run on an interval) and /goal (run until a condition holds). Codex and Antigravity expose the same pair of behaviours.

Building blocks of a loop

A loop is made of capabilities and memory.

  1. Automations — the heartbeat. A scheduled trigger that does discovery and triage on its own. This is what makes a loop a loop, not one run you did once.
  2. Worktrees — parallel without collisions. Each agent gets its own git working directory on its own branch, so two agents editing at once can't clobber each other. Same fix as engineers branching instead of sharing lines.
  3. Skills — project knowledge written down. A SKILL.md folder holding conventions, build steps, and hard-won lessons, so the agent stops re-deriving your project every cycle.
  4. Connectors — reach into your real tools. Built on MCP, they let the loop read your issue tracker, query a database, open a pull request, or post to a channel, not just see the filesystem.
  5. Sub-agents — keep the maker away from the checker. One agent drafts; a different one (often a different model) reviews it against the spec and the tests.
  6. Memory on disk. A markdown file or tracker board that lives outside any single conversation and records what's done and what's next. The model forgets everything between runs, so state lives in the repo, not the context window.

Claude Code, Codex, and Antigravity all ship the first five building blocks, making loop design portable between them.

LOOP — drives on a cadence · stops when the goal is met Trigger schedule Agent acts maker drafts inside the harness checker: goal met? Exit done pass fail → re-prompt with the error STATE ON DISK — PROGRESS.md records what's tried, passed, and still open between runs
A loop: a trigger starts the agent, a checker tests the result, failures re-prompt, and a passing goal exits. State persists on disk.

What is loop engineering?

Loop engineering involves designing a system from these building blocks that prompts the agent through the tasks toward a goal, so you don't have to. See How do you engineer harnesses and loops? for how to do this in practice.

04 — HOW THEY FIT

How do harnesses and loops work together?

The relationship is layered. The harness is the environment a single agent runs inside. The loop sits one floor above: it runs the harness on a timer, spawns helpers, checks their work, and feeds itself the next task.

The harness defines what an agent can do and what it's not allowed to do. The loop decides when, how often, and when to stop.

LOOP — drives & schedules · decides when to stop HARNESS — validates · authorizes · executes · logs tools · skills · MCP filesystem · sandbox hooks · gates logs · traces · cost MODEL predicts text MEMORY ON DISK — STATUS.md / PROGRESS.md / tracker · survives every reset
Model + harness = agent. Loop + agent = autonomous, scheduled work. State lives on disk, not in context.

Trace one cycle. The loop fires on schedule. It reads the memory file to learn where things stand. It picks the next task and hands it to an agent inside the harness. The harness runs the tools, hits the test and security gates, and refuses anything the rules forbid. A checker sub-agent verifies the result. The loop writes the outcome back to the memory file, then picks up the next task, or exits if the goal is met. You designed that once. You prompted none of it.

05 — DOING IT

How do you engineer harnesses and loops? Does it require coding?

Mostly, you don't code the orchestration runtime yourself anymore. You write the harness and loop specs in plain text as Markdown files, hand them to your coding agent (Claude Code, Codex, or Antigravity), and ask it to wire them to the primitives that agent already understands.

One caveat: the spec is plain text, but the loop still runs on real configuration. Skill files, worktrees, connectors, and scheduled tasks are concrete artifacts on disk. You write intent in markdown and the agent assembles the plumbing. It's configuration, not zero engineering.

Example: A harness for feature-delivery builds, specified as a HARNESS.md file. It defines what a model or agent should remember, what tools it can use, alongside rules and gates (conditions to pass).

HARNESS.md — a portable harness spec# Harness: feature-delivery

memory:
  on_start: read PROGRESS.md; do not trust chat history
  on_end:   write done / next / blockers to PROGRESS.md, then commit

tools:
  - filesystem (read/write within repo)
  - bash (test + lint only)
  - git

gates:            # nothing is "done" until all pass
  - run: npm test         expect: exit 0
  - run: npm run typecheck expect: exit 0
  - run: npm audit         expect: 0 high/critical

rules:
  - never comment out a failing test; fix it or delete it
  - never hardcode secrets; read from env only
  - block: rm -rf, git push --force

Example: A loop that triages issues in feature-delivery builds every night and fixes them. This loop drives an agent inside the harness in the example above. It names the goal condition, splits maker from checker, and keeps state on disk.

LOOP.md — a portable loop spec# Loop: nightly-triage-and-fix

cadence: every morning, 06:00
memory:  PROGRESS.md  (what's tried / passed / open)

run:
  1. triage skill: read CI failures, open issues, recent commits
  2. for each finding worth doing:
       - open an isolated worktree
       - maker sub-agent: draft the fix under harness rules
       - checker sub-agent: review against spec + tests
  3. connector: open PR, link ticket, post to channel

goal:  all tests in src/ pass AND lint clean
stop:  goal satisfied  OR  3 findings failed → park in triage inbox
  1. Write the harness spec. Capture the memory protocol, allowed tools, hard gates, and rules learned from real failures. Start with rules you can justify; ratchet in more as the agent slips.
  2. Write the loop spec. Pick cadence or goal (or both), name the verifiable stop condition, assign the maker and checker roles.
  3. Hand both to your coding agent. Ask it to translate the specs into the primitives it supports: hooks, /loop or /goal, sub-agent config, scheduled tasks.
  4. Run once, supervised. Watch a full cycle. Confirm the gates block and the checker catches things.
  5. Calibrate, then step away. Tune the cadence and per-run task budget against real token use before running it unattended.

06 — ACROSS PLATFORMS

Are harnesses and loops engineered differently for different AI platforms?

Conceptually, you need not engineer harnesses and loops differently for each AI platform.

Caveat: the harness and loop designs are platform-agnostic, but they are not directly portable. The concepts (memory protocol, gated verification, maker/checker split, scheduled cadence) are identical across all three. What differs is a thin adapter layer that maps your neutral spec onto each platform's command names and config locations.

Once you've written a harness and loop spec, you can largely port them. Hand the same markdown to a different platform's agent and ask it to map the fields onto that platform's primitives. What you invest in (the conventions, the gates, the stop conditions, the failure history encoded as rules) carries over. Only the last-mile binding changes.

One caveat on the spec itself. Keep abstraction levels consistent. A field like run: npm test is platform-neutral. A field that bakes in a platform-specific command or flag is not, and it quietly couples your spec to one platform. When that creeps in, it belongs in the adapter, not the spec.

Claude, Codex, and Antigravity ship the same loop primitives: scheduled automations, parallel worktrees, markdown skill files, MCP connectors, and sub-agents. They share file conventions too, Codex and Antigravity both read an AGENTS.md rules file and SKILL.md skill folders, and MCP connectors written for one platform generally work in the others because MCP is an open standard.

Concept (your spec)Claude CodeCodexAntigravity
Cadence loop/loop · cron · hooksAutomationshooks · CLI in CI
Goal loop/goal/goalagent goal run
Parallel isolationgit worktreebuilt-in worktreesparallel sub-agents
Project knowledgeSKILL.mdSKILL.mdSKILL.md
Real-tool accessMCPMCPMCP
Maker / checker.claude/agents/.codex/agents/sub-agents
Rules fileCLAUDE.mdAGENTS.mdAGENTS.md

07 — THE REAL CONSTRAINTS

What do loops cost?

Every iteration spends tokens. Sub-agents multiply that, since each maker and checker runs its own model and tools. Set the interval too tight and you hit rate limits quickly.

Two levers keep it sane. First, cap by task count per checkpoint, not token estimate. Token use per task is unpredictable; task count isn't. Three tasks per checkpoint is a reasonable default to calibrate from.

Second, tier the model by phase. Frontier model for planning and security review, where mistakes compound. Cheaper model for mechanical execution against a good plan. Escalate back to frontier only when the test-fix loop gets stuck. Cheaper open-weight models now land within a fraction of frontier cost per unit of capability, so tiering is a real lever, not a rounding error.

08 — STAYING IN CONTROL

Why do the loops still need you?

Three things stay your responsibility as the loop improves. Verification stays on you: a loop running unattended is also a loop making mistakes unattended, and "done" is a claim, not a proof, even with a checker. Your understanding rots faster if you stop reading what the loop ships. And taking whatever the loop hands back without an opinion is the dangerous posture.

You step in at three points, each on a different clock:

You stay involved because you hold a context advantage: you know things about the users and the situation that the model doesn't, and that knowledge only enters the system through you.