Trellis

Spec-driven orchestration for AI coding agents. Separates planning from execution because agents that think before they type produce better code.

TypeScriptAI AgentsOrchestration

Most AI coding tools let agents jump straight into your codebase and start writing. No plan. No review. No audit trail. Just vibes and a prayer.

The result is predictable: code that looks right, passes the tests, and slowly rots from the inside. Duplicated blocks. Architectural drift. Changes nobody asked for buried in changes somebody did. The agent ships fast and you spend the next week figuring out what it actually did.

Trellis enforces a simple constraint: think before you type.

Every non-trivial task becomes a YAML specification before a single line of code changes. The spec defines what will change, in what order, with what acceptance criteria, and how to roll it back if it breaks. A human reviews and approves the spec. Only then does the agent execute - phase by phase, validated at every checkpoint, auditable after the fact.

This isn’t a wrapper around a prompt. It’s a development methodology - the same separation of planning from execution that every serious engineering discipline has always required, applied to the one context where people decided to skip it entirely.

The spec as contract

The spec isn’t documentation. It’s a contract between what was requested and what gets delivered.

Before any code changes, there must be a machine-readable YAML artifact that defines precisely what will change. The human approves the plan, not the outcome. Once approved, the agent operates autonomously within those bounds. Another agent - or human - can pick up the same spec and execute it identically. Prompts are ephemeral. Specs are artifacts.

A spec declares:

  • Task definition - objectives, scope boundaries (in/out), assumptions, size, risk level
  • Context - packages affected, files impacted with line ranges, architectural invariants that must be preserved
  • Touchpoints - every system, module, or adapter the change will touch
  • Risks - what could go wrong, impact level, mitigation plan
  • Phases - ordered execution steps, each with file-level change declarations and acceptance criteria
  • Rollback - per-phase undo commands so failure is recoverable, not catastrophic
  • Definition of done - explicit checklist items that get checked off during execution

The spec forces the outcome into view before it happens. You can’t review an outcome you didn’t see coming.

Planning mode

Planning isn’t “generate a spec.” It’s a structured exploration cycle that the agent runs conversationally with the developer:

  1. THOUGHT - interpret the request in repo terms, identify unknowns
  2. ACTION - search the codebase, read files, check diffs to answer those unknowns
  3. OBSERVATION - capture what was learned: files, invariants, risks, dependencies
  4. THOUGHT - update the spec, ask clarifying questions when information is missing
  5. REPEAT until all required fields are filled and assumptions are explicit

The agent is in read-only mode during planning. It can explore anything but change nothing outside .ai/specs/. If planning gets blocked on missing information, the spec saves with status: under_review and the agent tells you exactly what it needs. Max 20 cycles - if it’s still uncertain, it documents assumptions and moves on. No infinite loops.

This is how you get specs that are executable by another agent without any additional back-and-forth. The planning loop does the work upfront so execution doesn’t have to guess.

Lifecycle and execution

                    ┌─── request changes ───┐
                    ▾                       │
  draft ──▸ under_review ──▸ approved ──▸ in_progress ──▸ completed
                                │              │
                             HUMAN           phase loop
                             GATE         ┌──────────────┐
                                          │ apply changes │
                                          │ run criteria  │
                                          │ record result │
                                          └──────────────┘

                                          failed ──▸ rollback ──▸ resume

The filesystem is the state machine. Specs physically move between directories as they progress:

.ai/specs/
  drafts/          planning in progress
  approved/        human-reviewed, ready for execution
  active/          currently executing
  archive/YYYY-MM/ completed, failed, or cancelled

Each transition is enforced by the CLI. You can’t skip the approval gate. You can’t execute a draft. The state machine has opinions and it will tell you when you’re wrong.

Execution is phase-by-phase. Each phase reads the spec, applies changes, runs every acceptance criterion, and records pass/fail results with timestamps directly into the spec file. If a criterion fails, the phase rolls back independently - completed phases stay intact. If the execution gets interrupted, the resume protocol picks up from the first pending or failed phase, not from scratch.

Keeping agents honest

Validation profiles

Not all changes deserve the same level of verification. Trellis scales validation proportionate to risk:

  • Light (micro/small, low risk) - compile check + acceptance criteria only
  • Standard (medium risk) - add targeted tests per phase, full test suite + linter + typecheck + security scan before commit
  • Strict (high risk) - everything in standard, plus boundary checks per phase to catch cross-module side effects

The profile derives from the task’s risk level or can be set explicitly. A one-line typo fix doesn’t need the full test suite. A schema migration does.

Scope auditing

trellis audit diffs actual git changes against the files declared in the spec. If the agent touched files it didn’t declare - the exact scope creep AI agents are notorious for - the audit flags it:

trellis audit add-error-codes -b main
# Scope drift: 12% (3/25 files undeclared) ── exit 1

Three categories: declared and changed (green), changed but not in spec (red - scope creep), and in spec but not changed (yellow - incomplete). This is how you catch an agent that “helpfully” refactors three unrelated modules while fixing a bug.

Self-evaluation

After execution, the agent scores its own work against a weighted rubric:

  • Completeness (weight 3) - did it meet all the requirements, handle edge cases, follow conventions?
  • Architecture fidelity (weight 3) - did it respect layer boundaries, use established patterns, improve separation?
  • Spec alignment (weight 2) - did it match what was planned, or propose improvements?
  • Validation depth (weight 2) - did it verify thoroughly with targeted and broader checks?

Below 7/10 triggers a mandatory second pass. And here’s the thing - when the agent gives itself a 9 or 10 without noting any deviations or improvements, the CLI warns you. A perfect score with no self-criticism is a rubber stamp, not a review. Scores above 8 should document at least one deviation or improvement. 10/10 means flawless with improvements beyond the spec - are you sure?

Scores are permanent record in the archived spec. The agent can’t just ship something and move on. It has to justify its work, and the tooling keeps it honest.

Guardrails

Safety controls

Some actions require human approval regardless of the spec: schema migrations, public API changes, data deletion, production deployments. These are defined in config.yaml and enforced during execution. If a spec’s constraints intersect with the safety rules, the agent pauses and asks.

Trellis also automatically prevents common security violations: hardcoded secrets, unbounded queries, SQL injection patterns, XSS vulnerabilities. The security scan runs as part of the standard and strict validation profiles, checking every file the agent touched.

Invariants

Non-negotiable architectural rules the agent cannot violate regardless of the task:

  • Domain boundaries - services stay in their layers, no circular dependencies
  • No legacy fallbacks - no dual-reads, dual-writes, or runtime shims. Migrate immediately with a one-off script
  • Public API stability - HTTP contracts and event schemas don’t change without explicit approval
  • Config from environment - never hardcoded
  • No test logic in production - fixtures and mocks stay in test files

These are customisable per project. You define your own invariants in AGENTS.md and reference them by name in config.yaml. Every spec declares which invariants it must preserve, and the agent references them during planning and execution.

If the task requires violating an invariant, the agent pauses and asks. This is non-optional.

How you use it

Trellis works with Claude Code, Cursor, Copilot, Windsurf, or any AI coding agent. It’s not a Claude wrapper. The spec is a YAML file. The CLI is a Python script. The prompts are markdown. Any agent that can read files and execute shell commands can run the full workflow.

Plan mode and exec mode instructions live in .ai/prompts/plan.md and .ai/prompts/exec.md. Point your agent at the right prompt and it knows what to do. The config, the schema, the invariants, the validation pipeline - all of it is agent-readable by design.

For projects with multiple codebases - an API, a frontend, an SDK, an MCP server - the workspace pattern gives the agent visibility across all of them from a single root. Create a root repo, add your codebases as git submodules, run trellis init. The root holds the orchestration layer and the agent sees the whole picture. If you’re running AI agents across multiple repos without a unified root, you’re asking the agent to plan with half the context.

trellis init                 Scaffold workspace (copies templates, creates directories)
trellis new <task>           Create a spec (scaffold in drafts/)
trellis approve <task>       Human approval gate (drafts/ -> approved/)
trellis start <task>         Begin execution (approved/ -> active/)
trellis exec <task>          Run acceptance criteria, record results
trellis exec <task> -p phase Run criteria for a specific phase
trellis audit <task> -b main Scope drift check against git
trellis diff <task>          Show git history for a spec
trellis complete <task>      Archive with full audit trail
trellis fail <task>          Archive as failed
trellis cancel <task>        Archive as cancelled
trellis status <task>        Review spec details and progress
trellis list [filter]        List specs by state
trellis validate <task>      Check spec against JSON schema
trellis report               Aggregate stats across all specs

trellis report aggregates pass rates, self-eval scores, scope drift, size/risk distributions, and monthly activity across your entire spec history. It flags specs completed without running exec and treats suspiciously high self-eval scores with appropriate skepticism. The archive is permanent and the report reads all of it.

Why this exists

We built Trellis because every AI coding workflow we used was broken in the same way. The agent would receive a task, immediately start modifying files, and produce something that was technically functional but architecturally thoughtless. Ask it to add a feature and it might refactor three other things along the way. Ask it to fix a bug and it might introduce a dependency you didn’t want. There was no contract between what was requested and what was delivered, and no way to verify the difference after the fact.

The spec is the contract. It forces the planning to happen explicitly, in a format that a human can review and a machine can validate. It creates an audit trail that answers “what changed, why, and did it match what was agreed.”

Most tools in the AI coding space optimise for speed. Generate more code, faster. Trellis optimises for correctness. The extra minutes spent planning save hours of debugging. The spec becomes the documentation. The approval gate prevents the kind of architectural drift that turns codebases into archaeological sites.

Completed specs archive to .ai/specs/archive/ with full execution logs, self-evaluation scores, acceptance criteria results, and git diffs. This isn’t temporary logging. This is a permanent audit trail - because when something breaks in production six months from now, you need to trace back to the spec that approved it.