Sam Naji · Apr 28, 2026

Vibe Coding OS: How a Small Team Runs an Engineering Org

From

case-study autonomous-engineering vibe-coding ai-agents harness-engineering edtech

TL;DR, Vibe Coding OS (VCOS) is the autonomous engineering system we built at UniqLearn. It pairs a hierarchical product specification layer (Product Line → Feature → Story → Task) with a four-layer, ~20-gate harness and a five-persona AI review chain. Together they let a small team operate like a full engineering organization: we write specs, the system writes code, and the gates decide what ships. This case study walks through how VCOS is structured, what it produces, where it breaks, and why the specification layer, not the harness, is the part that matters.

The claim

The industry has settled on harness engineering as the answer to autonomous coding. OpenAI wrote a million-line codebase with three engineers and 1,500 PRs by getting the linters, structural tests, and AGENTS.md files right. Anthropic formalized the generator-evaluator pattern. Stripe, Martin Fowler, and half of HN followed.

The harness is real. We run a production-grade one: ~20 independent gates across four layers, a five-stage AI review chain with personas and cost budgets, graded exit codes, auto-fix behaviors, and bypass auditing. It works.

But a year of shipping with it has taught us something the harness discourse keeps missing: the harness tells an agent how to write code. It does not tell the agent what to build. The hardest problem in autonomous engineering is not code quality. It is product intent, who the user is, what job they are doing, what would surprise them, what adjacent scope must not be touched. A linter cannot catch “you built the enrollment flow but forgot to gate it by login type.”

Vibe Coding OS is what we built to solve that. This is the case study.

What Vibe Coding OS is

VCOS is the full operating system that sits between our team and the codebase. It has three layers.

The three layers of VCOS. Each one does a job the next cannot.

The product specification layer. A Notion database called Product Item that holds every piece of work, from product line down to individual task, in a strict, JTBD-driven hierarchy with AI-aware templates.
The quality gates. A set of validation gates (tech stack, feature catalogue, ADR approval, acceptance criteria) that a work item must pass before and after implementation.
The harness. Four layers of CI/CD and local gates, ~20 independent checks, a five-persona AI review chain, graded exit codes, post-merge audits, that enforce how code gets written and merged.

The agent reads from layer 1, gets permission to proceed from layer 2, and is policed by layer 3. What the humans on the team do is write and maintain layer 1 and approve the hard decisions in layer 2. Everything else runs on rails.

If you want to understand why this matters, the rest of this article walks through one real feature, the Class Configuration Page, at every level of the system.

The product specification layer

The top of VCOS is a four-level hierarchy enforced by Notion templates. Each level has its own schema, its own pitfall warnings, and its own rules for what counts as “ready for the agent to start.”

Level 1, Product Line (the “why”)

A product line defines a domain of customer value using Jobs-to-Be-Done. The template forces you to answer the foundational questions before any code is written. It requires, verbatim:

Who & Why

Job Performer(s): Who is the primary user this Product serves?

Domain: What broad area of customer value does this cover?

Job Context: In what situation does the job performer interact with this part of the product?

JTBD Statement: “When [job performer] is in [situation/context], they want to [broad outcome], so they can [ultimate benefit].”

It also requires Job Scope (what this domain owns and explicitly does not own), Limitations & Constraints, Dependencies & Handoff Points, Ownership & Measurement, and Expectations & Trust Boundaries. Structured properties, Leader, Maintainer, Launch Stage, OKRs, customer-facing or internal, are database fields, not suggestions in a doc. They feed views, filters, and formulas.

For UniqLearn’s core platform, the product line establishes the teacher-to-student workflow: teachers log in, configure classes, generate AI-powered differentiated assignments, and students take them with real-time feedback and AI tutoring. It also names what is out of scope, district admin, content moderation logic, internal ML training thresholds, to stop the agent from scope-creeping into adjacent systems.

And the template is opinionated. It includes explicit AI guardrails:

AI PITFALL, JTBD TAUTOLOGY: The three clauses (situation, outcome, benefit) must express three DISTINCT ideas. If they restate the same concept, mark [TBD] and flag for rewrite.

AI PITFALL, FABRICATED CONSTRAINTS: Only document constraints that have been stated, discovered, or are verifiable. Use [TBD] if none known.

Every section is designed to prevent a specific failure mode. Nothing is decorative.

Level 2, Feature (the “what” and “how”)

A feature lives beneath a product line and bridges strategy with implementation. The Class Configuration Page is one such feature. It sits inside Epic 1 (Teacher Dashboard) and covers the teacher’s job of configuring class details and enrolling students.

The feature template is where VCOS stops looking like product management and starts looking like a buildable spec. The live Feature Flags Module page (currently in Tech Planning) is a good reference point. Its Current State section maps what already exists with file locations:

What Exists	Location
`SchoolFeatureFlag` model	`schema.prisma` lines 692–706
`DistrictFeatureFlag` model	`schema.prisma` lines 708–717
Seed data references	`prisma/seed.ts`

Its Gaps section enumerates nine specific missing pieces (no service layer, no admin API, no inheritance resolution, no audit trail, no gradual rollout, etc.). Its Technical Approach defines the new Prisma models inline. Its API Routes section specifies seven endpoints with auth requirements per endpoint. Its Resolution Algorithm defines a four-level inheritance chain (teacher → class → school → district → default). Its Caching Strategy specifies a 5-minute TTL with invalidation on override mutations and cache warming on startup. Its Acceptance Criteria are testable, with performance bounds (5ms flag resolution).

The feature estimate at the bottom, 7 to 10 days, means something because the spec is complete enough to estimate against. No one is guessing.

For the Class Configuration Page specifically, the same template captures what the agent must honor: SSO-managed districts cannot use it (roster changes flow through an overnight Edlink sync); deduplication must run before any new student account is created; license checks run at the district level before account creation; the same enrollment flow must be accessible from two entry points with identical modal behavior. These are product constraints. A harness cannot discover them.

Level 3, Story (the “implementable increment”)

Stories are where the system gets surgical. The template decomposes work into:

Context & Intent, standard “As a [role], I want to [action], so that [outcome]” plus an explanation of how the story advances the parent feature.
Preconditions & Postconditions, what must be true before the story can begin, and what becomes true after it ships. Forces thinking about state transitions, not just UI.
Task Scope (with explicit exclusion), both what the story delivers and what it does not deliver, naming the adjacent story where excluded work belongs. This is the most underrated section in the entire system; scope boundaries become mapped, not debated.
Definition of Done, pass/fail checkboxes covering happy path, error states, and in-scope edge cases, plus a verification approach for QA.
Dependencies & Handoff, split into Frontend, Backend, and Both lanes so the agent knows the API contract, field names, and response shape before writing either side.
Expectations & Trust Boundaries, what would surprise the user, where this behavior belongs and where it would feel out of place, what existing behavior must remain unchanged, and when the story is doing too much.
Open Questions, structured callouts with question title, implementation impact, and what is blocked or at risk until the question resolves.

For the “Add Individual Students” story under the Class Configuration Page, the spec names the acceptance conditions concretely: native-login teachers can enroll students one at a time via modal; the modal requires email, first name, last name, password, confirm password, and age; deduplication and license check run automatically on submission; existing students are enrolled in the class without creating a new account; new students are created and enrolled in both the class and the district; SSO-managed teachers never see the flow.

And critically, the story carries its open questions with it. Here is one taken directly from the spec:

Deduplication matching logic, What field(s) does the deduplication check use to determine if a student already exists? Is it email address only, or a combination of email, name, and school/district? What happens in the case of a partial match (e.g., same email but different name on record)?

Until a human answers that question and marks the spec as ready, the agent cannot begin. Open questions are a blocking dependency on the team, not a Slack thread that quietly rots.

Level 4, Task (the atomic unit)

Tasks auto-link to their parent story. Their completion percentages, Task Completion %, Bug Completion %, roll up through formulas to the story, feature, and product line levels. Nothing is manually status-updated. Derived T-Shirt, Story Points, Sprint, Quarter (Auto), and Hierarchy Check are all computed, not guessed. The system computes what most teams argue about.

Each level nests inside its parent. Completion rolls up automatically, nothing is manually status-updated.

The quality gates

Between a spec being “ready” and the agent writing code, VCOS runs four quality gates. Each has the authority to halt the pipeline.

The path a feature takes from ready spec to merge. Gates 1–4 govern what is built; the harness governs how.

Gate 1, Tech stack validation. A tech stack document maps every approved technology to the features it can serve. Before the agent builds, the pipeline asks: does the current stack have the capability to serve every acceptance criterion in this spec? If not, say, the feature requires real-time WebSocket updates and we don’t have a WebSocket layer, the gate fails. The agent cannot improvise a solution. A human either updates the stack or revises the spec. This single gate has killed the most common autonomous-engineering failure mode I’ve seen: the agent pulling in an unauthorized dependency or inventing a novel pattern nobody reviewed.

Gate 2, Feature catalogue verification. The feature catalogue is a validated mapping of every feature to its implementation status, tech stack requirements, and cross-feature dependencies. Before a story starts, the pipeline checks: does the parent feature exist in the catalogue? Are all blocking dependencies complete? Have the tech stack requirements been satisfied? Has the feature been modified since the last human review? This catches specification drift, when an upstream spec changes after a downstream story has already begun, before it becomes silent rework.

Gate 3, ADR approval. The agent cannot introduce a new architectural pattern without a human-approved Architecture Decision Record. It can propose an ADR, and usually does, but it cannot decide on its own to switch from REST to GraphQL, add a state management library, or change the authentication flow. In practice most ADRs are drafted by the agent and approved by a human; it identifies the decision point, writes the options and tradeoffs, we redirect or approve, and then it proceeds.

Gate 4, Acceptance criteria verification. After implementation, the pipeline replays the acceptance criteria from the feature spec as a verification pass. This is semantic verification, not unit tests: did the agent build what the spec asked for? Some checks are automated, some are run by the evaluator agent interacting with the running app, and some are flagged for human review. The pipeline knows which is which.

These four gates govern what the agent builds. The harness that governs how sits beneath them.

The harness beneath the gates

The harness is the part of VCOS that looks most like what OpenAI and Anthropic have been writing about. It is, deliberately, only one-third of the story, but it is substantial enough that the rest of the system can afford to be aggressive.

Four layers. ~20 independent checks. The design treats quality as a pipeline of graded gates, not a single wall.

Layer 1, Local git hooks (.github/hooks/) install automatically on pnpm install via a symlink script. Because the hooks live in version control, every clone gets the same set, no per-developer drift. The pre-commit hook is staged-file aware and runs five sequential phases: registry regeneration, repo-map artifact generation (with 15-second timeouts that warn but don’t block), JS/TS checks via ESLint + CodeScene + cspell, doc drift detection, and doc coverage auto-generation. That last one is the spicy part, if a required doc is missing, the hook generates it via Claude and stages it alongside your commit. Graded exit codes distinguish script bug (don’t block), recommended gap (warn), and required gap (block). The commit-msg hook enforces [#N] issue traceability. The post-commit hook can’t block a commit that already happened, so it forensically detects --no-verify bypasses by checking for a breadcrumb file and appends to a local bypass-audit.log. Trust developers. Keep receipts.

Layer 2, PR structural gates. pr-template-enforcement auto-closes non-compliant PRs with a guidance comment. pr-check enforces branch-name traceability (feat/123-slug) and commit grouping. CODEOWNERS routes frontend, backend, AI, and QA changes to the right reviewers, and, critically, names a single protected gatekeeper for the gates themselves: @smnji is the sole approver for .github/workflows/, .github/scripts/, .claude/, and .pre-commit-severity.json. You cannot relax a gate without the gatekeeper.

Layer 3, CI/CD workflow gates (22 workflows). The centerpiece is the AI Review Pipeline, a five-stage chain where each gate has a persona, a model, and a dollar budget:

Five personas in sequence. Cost-awareness is built in, Gate 2 skips if Gate 1 blocks; the Inspector refuses to pass claims unsupported by evidence.

Gate	Persona	Model	Cost budget	Role
1	Architect	`claude-opus-4-6`	$0.50–$2.00	Deep code review with inline severity tags (🔴 BLOCKING / 🟡 WARNING / 🟢 SUGGESTION / 💬 QUESTION)
2	QA Analyst	`claude-sonnet-4-6`	$0.10–$0.50	AC-to-test mapping; skipped if Gate 1 blocks (cost-aware)
3	E2E QA	,	<$3.00	Browser-based scenario verification (currently disabled)
4	The Inspector	`claude-opus-4-6`	<$3.00	Multimodal; verifies evidence gist (screenshots, API responses) matches claims. <80% AC coverage or misleading evidence fails.
5	The Sentinel	`claude-opus-4-6`	<$5.00	Holistic review; detects bypass patterns, stub tests, placeholder screenshots, fabricated receipts. Advisory only.

Around the AI chain sit the non-AI gates: blast-radius-tests maps changed files to impacted Vitest/Playwright tests and runs only those, observability-compliance enforces OTEL span naming (module.operation) and typed errors, doc-coverage-verify blocks on required documentation gaps, promptfoo-gate runs LLM prompt/eval tests for apps/ai/ changes, file-placement uses two LLM judges (Directory Judge, File Judge) to evaluate newly added files against structure-rules.json.

Layer 4, Deploy and post-merge. render-preview-link wires up preview URLs; post-merge-audit is the backstop, it audits every merge into int/staging/main for issue traceability and creates a remediation issue if anything slipped through. The pre-merge gates may fail. The post-merge audit does not.

The ~20 gates, grouped by layer. Hover a dot for its role. Most block; a meaningful minority auto-fix or warn. Graded, not binary.

The common thread across all four layers is a philosophy I would summarize in three moves. Auto-fix over complain, registry regeneration, doc generation, planning-gate drafts, and doc-auto-fix PRs produce the missing artifact instead of yelling about it. Graded, not binary, friction is proportional to risk; bypass is logged, not forbidden. Smart-skip everywhere, Promptfoo passes instantly if no AI code changed; Observability skips non-observability PRs; Gate 2 skips if Gate 1 blocks. Signal matched to effort.

What VCOS looks like shipping a feature

Here is the Class Configuration Page redesign, through VCOS end-to-end.

Day 1, Specification review. The spec already existed in the Notion hierarchy. We reviewed it, resolved three of the five open questions (deduplication uses email-only matching at the district level; the license check runs after deduplication and before account creation; CSV import processes row-by-row with per-row error feedback rather than atomic all-or-nothing), and deferred two to a technical spike (exact license counting model; SSO teacher gating signal). We marked the spec ready on the three stories that weren’t blocked by the deferred questions.

Day 1–2, Gate checks. Tech stack validation passed, the existing Next.js + Node stack had every required capability. Feature catalogue verification passed, the blocking dependency (“Navigating Into a Class” from Epic 1) was complete. One new ADR required, for the single-page layout consolidation replacing the legacy two-tab design. The agent drafted it, we approved it, the gate passed.

Day 2–4, Implementation. VCOS consumed the story specs and began implementation. It built three stories across two sprints, Edit Class Details, Add Individual Students, Add Students via CSV. After each sprint, the evaluator agent ran the acceptance criteria. It caught two real issues: the CSV upload modal was surfacing an atomic all-or-nothing error instead of per-row feedback (the spec required per-row), and the success message was appearing before the deduplication check resolved, creating a race condition where the UI confirmed enrollment before knowing whether the account was new or existing. The agent fixed both before merge.

Day 4–5, Verification and deployment. Final acceptance-criteria verification passed on 6 of 8 criteria (the remaining 2 depended on the deferred SSO gating and license model stories). Staging review was ours. Production pushed clean. The SSO and license stories followed in the next sprint after the technical spike resolved the deferred questions.

Total: 5 days from spec-ready to production on the first three stories. 14 commits. Three human intervention points: spec review, ADR approval, staging review. Everything else was autonomous.

Results

VCOS has been running in production for over a year. The product serves 14 school districts across 10 states, with real teachers and real students using it daily.

school districts

PRs merged

commits

net lines shipped

The repo tells a concrete story. Over a recent development window: 84 commits across 6 contributors (humans and autonomous agents), 48 PRs merged, +170k / -127k lines changed across 1,625 file changes. Net growth of ~43k lines of production code from merged PRs alone. 40 MB repo. 18 open issues.

Some specifics:

Specification coverage: 100% of features have a complete JTBD spec before development begins. Zero features are built from a verbal description or a Slack message.
Spec throughput: 2–4 hours per feature for a standard spec; up to a full day for complex features like the Generate Assignment Flow UI. This is the actual ceiling on velocity.
Gate catch rate: the evaluator and AI review chain catch issues like the CSV per-row error and the success-message race condition before merge, routinely.
Bypass audit: --no-verify bypasses are logged locally and surfaced in post-merge audits. The number of un-audited merges into protected branches is zero.

The comparison that matters: OpenAI’s Codex experiment used three engineers to produce 1,500 PRs and a million lines of code. We operate as a small team. VCOS is what makes the difference defensible, because the agent does not need us to explain what to build, how to write it, or how to verify it. It reads the spec. It obeys the gates. The harness does the rest.

Where VCOS breaks

We want to be precise about the limits, because this system is not free.

Specification writing is the bottleneck. The product spec hierarchy doesn’t write itself. Each feature spec takes 2–4 hours, including JTBD framing, acceptance criteria, trust boundaries, and open questions. For a complex multi-story feature, the spec itself can take a full day. The agent’s implementation speed is capped by our specification throughput. If we don’t write, nothing ships.

Open questions are a human bottleneck. The spec system surfaces questions that must be answered before the agent proceeds. The Class Configuration Page had five. These are product decisions, not engineering decisions, and the agent cannot resolve them. Every unanswered open question is a blocked story.

Novel interactions are hard to specify. JTBD works well for features with clear user jobs and predictable flows. It works less well for emergent behavior, like the AI tutoring chatbot (Brainy) that must guide students without ever revealing answers. Writing “never provides or confirms correct answers under any framing or circumstance” is easy. Verifying it is extremely hard.

Over-engineered for trivial work. Adding a loading spinner to the metrics page does not need a seven-section JTBD spec with trust boundaries and decomposition checks. We’ve introduced a lightweight story template for simple tasks, but the overhead of the full system is real for small work.

Context windows are still a constraint. A complex feature with a long spec, multiple cross-feature dependencies, and a full tech stack document can exceed the agent’s memory in a single pass. We address this with progressive disclosure, the agent reads the story-level spec during implementation, not the full product line document, but the limit is real.

Gate 3 (E2E QA) and the full Playwright suite are currently disabled. They worked, then the surface area outpaced the available budget and stability. The post-merge audit backstops much of this, but we’re running without a live E2E chain right now, and we want that back.

Why the specification layer is the part that matters

If you take one thing away from this case study, take this: the harness is necessary, and it is not sufficient. A beautiful linter cannot save you from shipping the wrong feature. A generator-evaluator loop with Playwright cannot tell you that SSO-managed districts should see no enrollment controls at all, not disabled with a tooltip, but removed entirely. Those decisions live in the spec or they do not live anywhere.

Harness engineering treats code quality as the bottleneck. Autonomous engineering at any real scale reveals that product intent is the bottleneck, and the way you encode product intent, the structure of your specs, the pitfall warnings in your templates, the trust boundaries you make first-class, the open questions you treat as blocking, is your organization’s capability. Changing the harness changes how the agent writes. Changing the specification layer changes what the company can build.

VCOS is the bet that the specification layer deserves the same engineering rigor as the harness. A year in, the bet has held.

Where VCOS is going

The obvious next capabilities: auto-generating acceptance criteria from JTBD statements, proposing feature decompositions from historical patterns, interactive dependency graphs that show the critical path through a release, automated specification-drift detection across the catalogue. Some of these already exist in partial form, the feature-catalogue verification gate catches drift today; the acceptance-criteria verification gate is a first pass at auto-validating JTBD statements against the running application. The harder problems are ahead.

The thing we keep coming back to is that autonomous engineering isn’t going to be won by whoever builds the best harness. It will be won by whoever builds the best operating system, spec, gates, and harness as a coherent whole, purpose-built for agent consumption. That is what VCOS is. That is what we’re continuing to sharpen.

Reach out on LinkedIn, subscribe for new case studies, or drop a question below.

Take it further

New case studies and engineering notes, delivered when they're published. No spam.

Discuss

Questions or counter-examples welcome. Comments are powered by GitHub Discussions.