The agent told me the feature was complete. I opened the file. Six // TODO comments and a function that returned null. It had literally written itself a note to finish later and called it shipped.
Another time, I asked an agent to fix a race condition in a queue processor. After ten minutes of back-and-forth, it printed something I still think about: “I did not resolve it. I sidestepped it.” At least that one was honest.
And my personal favorite: told an agent the tests were failing. It apologized, changed something unrelated, and said “that should fix it.” I told it the tests were still failing. It apologized again. We did this five times before I gave up and fixed it myself.
If you’ve used a coding agent for more than a week, you have stories like these. Everyone does. And most of us respond the same way: we blame the model.
“Claude hallucinated.” “GPT is dumb.” “Copilot doesn’t understand my codebase.”
I’ve been thinking about this differently. And I don’t think the model is the problem.
The Brilliant Contractor With No Blueprints
Imagine you hire a brilliant contractor. Decades of experience. Can build anything. You walk them to an empty lot and say: “Build me a house.”
No blueprints. No building codes. No site survey. No budget. No information about the soil, the climate, the neighborhood restrictions. Just “build me a house.”
What do you get?
If you’re lucky, you get something that stands. It probably won’t match what you imagined. It’ll have rooms where you didn’t want them and miss the ones you needed. The plumbing might work but it won’t be up to code. And when you complain, the contractor will say “you didn’t tell me.”
They’d be right.
This is what we’re doing with coding agents every single day. We open a terminal, type a prompt, and expect the agent to build exactly what we need. We give it instructions when it needs context. We give it prompts when it needs specifications.
The model isn’t stupid. It’s uninformed.
What Agents Actually Need (But We Never Provide)
Think about what a senior developer knows when they join your team. Not on day one. On day ninety. After three months of absorbing context that nobody wrote down.
They know the tech stack AND its limitations. Not “we use Cloudflare Workers.” But “Workers have a 6 simultaneous outbound connection limit per invocation, so if you’re building anything that fans out to multiple services, you need to batch and serialize. This changes the entire architecture for data aggregation jobs.” (That’s a real limitation, by the way. I watched an agent confidently build a parallel HTTP fan-out pattern that would have silently queued and timed out in production.)
They know the coding conventions. Not “write clean code.” But “we use early returns, errors are always typed, and every API endpoint has integration tests that hit a real database.”
They know what “done” means. Not “the function exists.” But “tests pass, types check, linting is clean, the PR description explains the why, and you’ve tested it against the staging environment.”
They know what’s been tried before. “We tried event sourcing for the audit log. It was overkill for our scale. Don’t go down that road again.”
They know what to never do. “Never call the payment API without idempotency keys. We learned this the hard way.”
None of this lives in a prompt. It accumulates over time. And we’re expecting agents to work without any of it.
The .eslintrc Moment
Remember the code style wars?
Before .eslintrc, every codebase was a battleground. Tabs vs. spaces. Semicolons or not. Where do the braces go. Code reviews were 80% style arguments and 20% actual logic discussion. Every new developer brought their own preferences. Every team had that one person who would reformat entire files.
Then someone had a simple idea: put the rules in a config file. Drop it in the project root. Everyone follows the same rules. .eslintrc didn’t make developers write better code. It gave them shared context about how this project does things.
New developer joins? The config tells them the rules. Disagreement about style? The config is the source of truth. Muscle memory from a previous job? The linter catches it before the PR review.
One config file ended years of arguments.
I think we need the same thing for AI-assisted development.
Right now, every developer prompts differently. Every agent session starts from zero. Results are wildly unpredictable. The same developer, same model, same codebase can get a brilliant result one session and a disaster the next. The variable isn’t the model. It’s the context.
What if you could drop a directory into your project that gives every agent session the context it needs to do consistently good work?
Introducing .agentlint
Here’s what I’ve been working toward. A specification directory that lives in your project root:
.agentlint/
├── agentlint.yaml # Project metadata, phases, validation gates
├── spec.md # Product + feature spec
├── stack/ # Tech stack playbooks
│ ├── cloudflare-workers.md
│ ├── planetscale.md
│ └── nextjs.md
├── patterns/ # How this codebase does things
│ ├── coding.md
│ ├── testing.md
│ └── deployment.md
├── context/ # Living context (human + agent maintained)
│ ├── project.md
│ ├── agent-learnings.md
│ └── features/
│ └── {feature-name}.md
├── plan.md # Evolutionary build plan
├── log.md # Resumability state
└── validation/
└── gates.yaml # Hard gates: tests, lint, types, docs
Let me walk through why each piece matters.
spec.md: Blur the Product/Engineering Line
Most agent failures start before a single line of code is written. They start with an incomplete spec.
Here’s the thing: current agent tooling assumes the developer already knows exactly what to build. But the hardest part of engineering isn’t typing code. It’s the discovery that happens while understanding the problem. What happens when the API is down? What’s the scale expectation, ten users or ten million? Who else touches this data?
spec.md captures both product AND engineering context. Who has this problem? What’s painful about the current solution? What does success look like? And also: what are the technical constraints? What’s the performance budget? What integrations does this touch?
The product/engineering boundary needs to blur here because agents need both to do good work. An agent that understands “this is a compliance feature for financial services” will make fundamentally different architecture decisions than one that just sees “build an audit log.”
Filling this out shouldn’t feel like homework. Imagine an agent that asks you the right questions: “You mentioned this touches payment data. Are there PCI compliance requirements?” “Your spec mentions real-time updates. What latency is acceptable?” The questionnaire becomes a discovery tool, not a form.
stack/: The Gotchas That Change Everything
This is where community contributions become powerful.
Remember that Cloudflare Workers concurrency limit I mentioned? Six simultaneous outbound connections per invocation, with a total of 50 (free) or 1,000 (paid) subrequests per incoming request. An agent that doesn’t know this will confidently build a fan-out pattern using Promise.all() with twenty parallel HTTP calls. It’ll work locally. It’ll pass tests. And it’ll silently break in production because most of those connections will queue up, waiting for one of the six slots to open. Depending on timeout settings, the request might just… die.
That single piece of knowledge changes the entire architecture. Suddenly you need batching strategies, or Durable Objects for coordination, or Cloudflare Queues for async processing.
Stack playbooks capture these gotchas. Not generic documentation. Specific, practical knowledge: “here’s what works, here’s what doesn’t, and here’s what will surprise you.” The kind of thing you learn by deploying to production and watching it fail.
Now imagine these are community-contributed. Someone who’s built five production systems on Cloudflare Workers writes their cloudflare-workers.md. Someone who’s run PlanetScale at scale writes planetscale.md. You pull these into your project and your agent starts its first session already knowing what took those developers months to learn.
patterns/: Tribal Knowledge, Externalized
Every codebase has unwritten rules. Error handling follows a specific pattern. Files are organized a certain way. Tests use particular helpers. API responses have a consistent shape.
Senior developers absorb this over months. Agents don’t have months. They have whatever context you give them right now.
patterns/ is where you write down the tribal knowledge. How do we handle errors here? How do we name things? What’s the testing strategy? When an agent reads these before writing code, the output actually looks like it belongs in your codebase instead of being a generic Stack Overflow answer dropped into your project.
context/: The Dual-Layer Memory
This is where it gets interesting. Context has two layers:
Human-provided context (project.md, features/): You write down the what and why. Project goals, feature requirements, business context. This is the stuff only you know.
Agent-maintained context (agent-learnings.md): The agent writes down what it discovers while working. Mistakes it made. Guidance it received from you. Decisions that changed mid-stream.
“Tried to use Node.js streams in Cloudflare Workers. Failed. Don’t do this again.” “User corrected: authentication tokens should be stored in httpOnly cookies, not localStorage.” “Originally planned REST API, pivoted to GraphQL after discussing query flexibility needs.”
This creates institutional memory that survives session boundaries. The next agent session (or the next developer using the codebase) benefits from everything the previous sessions learned. Context becomes evolutionary, not static.
plan.md: Evolutionary, Not Waterfall
“Build the whole thing” is how agents end up producing 2,000 lines of code that don’t compile.
plan.md breaks the work into phases. Each phase is small enough to build, validate, and deploy independently. Phase 1 might be the data model and basic CRUD. Phase 2 adds authentication. Phase 3 adds the real-time features.
This isn’t waterfall. The plan evolves as you learn. But each step is concrete and verifiable. You can look at the plan and point to exactly where you are, what’s done, and what’s next.
The key insight: each phase should be easy to validate. If you can’t tell whether a phase is “done” in under five minutes, the phase is too big.
log.md: The Resumability File
Here’s a scenario every agent user knows: you’re halfway through building a feature. Context window fills up. Or your laptop dies. Or you just need to stop for the day. When you start a new session, the agent has no idea what happened before.
log.md is the solution. The agent maintains a running log of what’s been completed, what’s in progress, and what’s next. When a new session starts, it reads the log and picks up where the last one left off.
Not “start over.” Continue.
This also means multiple developers (or multiple agents) can work on different phases of the same plan without stepping on each other. The log is the coordination point.
validation/gates.yaml: No More Phantom Implementations
Remember that function that returned null and called itself done? gates.yaml prevents that.
Hard validation gates that must pass before any task is considered complete:
- All tests pass
- Type checking clean
- Linting clean
- Documentation exists for public APIs
- No
TODOorFIXMEcomments in submitted code
But here’s the critical part: the agent that builds the code shouldn’t be the same one that validates it. That’s the fox guarding the henhouse. Use a different model for validation. Or combine multiple models and aggregate their findings. The builder’s blind spots are exactly what the validator needs to catch.
The Phases: How It All Flows
This isn’t a waterfall process. It’s an evolutionary loop.
Spec → Architecture → Plan → [ Build → Validate ] → Deploy
↑ │
└────────┘
(per plan phase)
Spec Phase: An agent helps you fill out the questionnaire. It asks discovery questions you forgot to ask yourself. “What happens when the third-party API is down?” “You mentioned this is multi-tenant. How do you handle data isolation?” The output is spec.md.
Architecture Phase: Informed by the spec AND the stack limitations, the agent proposes architecture. Not generic patterns from a textbook. Architecture that accounts for “Cloudflare Workers can only hold 6 outbound connections, so we need Queues for the fan-out pattern.” Questions get asked, scaling scenarios get explored, and the result is documented.
Plan Phase: Break the architecture into evolutionary steps. Each step is independently buildable, testable, and deployable. No big-bang releases. No “it’ll all work when everything is done.”
Build + Validate Loop: For each phase of the plan, the agent builds code, tests, and documentation. Then a DIFFERENT model (or combination of models) validates the work against the gates. If validation fails, back to build. This loop continues until the phase passes all gates.
Deploy: Non-prod first. Validate at scale. Then production. Each deployment is one phase of the plan, not the entire system.
The Cross-Cutting Concerns
Some things aren’t phases. They happen throughout.
Model Specialization
Not every phase needs the same model. I’ve noticed patterns:
- Architecture and complex coding: models with strong reasoning perform best
- Documentation and spec writing: faster models work well and save cost
- Code review and validation: you want a different model than the one that wrote the code
- Testing: models that are thorough and don’t skip edge cases
The right model for the right job. This is an area where the tooling is still catching up, but the principle matters now.
Decision Logging
Every time something changes (architecture pivot, pattern adjustment, scope change) it gets captured. Not as an afterthought. As part of the process.
“Switched from REST to GraphQL. Reason: the frontend needs flexible queries across 12 entity types, and REST would require either over-fetching or 30+ endpoints.”
When you come back in two weeks and ask “why did we do it this way?”, the answer is there. When a new team member reads the codebase, they understand the why, not just the what.
The Observer
Throughout the entire process, you want a way to track what’s happening across phases. What’s done. What’s stuck. What changed. Think of it as the project manager that never sleeps. Not managing. Just watching and keeping notes.
This could be as simple as an agent that periodically reads log.md, agent-learnings.md, and the current state of the plan, then produces a status summary. Or it could be more sophisticated. The point is: someone (or something) should have the full picture, even when individual sessions only see their slice.
The Community Play
Here’s where this gets really interesting.
The structure above is project-specific. But the stack/ modules? Those are universal. Every project using Cloudflare Workers needs to know the same limitations. Every project using PlanetScale needs the same connection patterns.
Imagine a world where starting a new project looks like this:
- You initialize
.agentlint/ - You pull community-contributed stack modules for your tech choices
- An agent walks you through the spec questionnaire
- You start building with an agent that already knows everything the community has learned
“Here’s my .agentlint stack module for Next.js 15 + Cloudflare Workers + D1” becomes a thing people share on GitHub. Like shared ESLint configs, but for agent context.
A community repository of stack modules means nobody has to discover the same gotcha twice. The Cloudflare concurrency limit gets documented once, and every agent working on every project benefits.
Why This Matters Now
The tools are getting better fast. Claude Code, Cursor, Copilot Workspace, OpenCode. New agent harnesses launching every month. Models getting more capable every quarter.
But the experience is still inconsistent. Amazing results one session, baffling failures the next. And the variable isn’t the model. It’s the context we give it.
I wrote recently about the trust ladder for AI agents: how trust needs to be earned individually before it can scale to teams and enterprises. The trust ladder tells you WHERE to build trust. .agentlint tells you HOW. Consistent, high-quality agent sessions build trust. Phantom implementations and sidestepped bugs destroy it.
We’re past the “wow, it can write code” phase. We’re in the “okay, how do we make this reliable enough to depend on” phase. And reliability comes from specification, not from better prompting.
What’s Next
I’m actively building on this idea. An open-source tool that supports .agentlint from day one, with an init command that walks you through the spec questionnaire, community-contributed stack modules, and multi-phase build orchestration.
The standard will be open. The stack modules will be community-contributed. Because the agent failures I described at the top of this post? They’re not just my problem. They’re everyone’s.
Your AI coding agent isn’t broken. It just doesn’t know what you know. And right now, there’s no standard way to tell it.
Stop prompting. Start specifying.