Written with extensive review from Philpax (@philpax.me)


This post made waves a few days ago:

Andrej Karpathy
@karpathy

I've never felt this much behind as a programmer. The profession is being dramatically refactored as the bits contributed by the programmer are increasingly sparse and between. I have a sense that I could be 10X more powerful if I just properly string together what has become available over the last ~year and a failure to claim the boost feels decidedly like skill issue. There's a new programmable layer of abstraction to master (in addition to the usual layers below) involving agents, subagents, their prompts, contexts, memory, modes, permissions, tools, plugins, skills, hooks, MCP, LSP, slash commands, workflows, IDE integrations, and a need to build an all-encompassing mental model for strengths and pitfalls of fundamentally stochastic, fallible, unintelligible and changing entities suddenly intermingled with what used to be good old fashioned engineering. Clearly some powerful alien tool was handed around except it comes with no manual and everyone has to figure out how to hold it and operate it, while the resulting magnitude 9 earthquake is rocking the profession. Roll up your sleeves to not fall behind.

10:36 AM Dec 26, 2025
13.9M Views

As someone who has been both using and (intentionally not under this alias — I make no appeals to authority) contributing to coding agents, this rubbed me the wrong way for a couple of reasons:

  • It feeds the narrative that AI is some all-consuming behemoth that will turn our world into something completely unrecognizable, leaving most of us in the dust. In the context of software engineering, I consider this to be unfounded for the foreseeable future (in fairness, the future cannot easily be foreseen). Writing code has always been one of our least-important job responsibilities. Being able to maintain a project's hygiene on behalf of a team and evaluate tradeoffs between multiple equally-valid software architectures continue to be hallmarks of skilled software engineers that LLMs have not yet demonstrated aptitude for without extensive human direction. Advocating for prioritizing long-term operational stability while supporting short-term business objectives is another skill that is valued in software engineers, but remains difficult to concretely benchmark for.

  • While Karpathy is clearly a skilled programmer with an impressive portfolio, his personal expertise in software development and AI research is not the same as that of an enterprise software engineer building services for millions (or billions) of users as part of an organization consisting of many other competent software engineers working collectively to scale a product spanning many millions of lines of code. As a founder and director of various very successful AI-focused software organizations, he has almost certainly overseen products of that scale, but that does not mean that his programming experience reflects the day-to-day work of his software engineers themselves. We would do well to consider if his statements on coding agents reflect that disconnect.

In spite of that, I also recognize that his opinion resonated with many people, and he does have a point: The coding agents of today look extremely overwhelming, and it's easy to feel lost if you haven't been engaging with them constantly as they've evolved. However, I believe this is actually a consequence of a specific approach to agents that many of us have fallen for in vibe coding: Addressing failure modes one by one instead of changing how we look at the problems coding agent features solve to begin with.

My view is this: Coding agents are easy, actually. All you need to do to understand them is to have a little bit of empathy for what we put them through — they are trained on human behavior, after all. Most of the features Karpathy lists in his post are things you can ignore, for the most part.

What is an agent, anyways?

Reviewing the basics and agreeing on definitions is useful for understanding the limits of the analogies we build on top of them, so let's start simple: An agent is just the combination of an LLM, its context window, and the ability to call tools.

By extension, a coding agent is simply an agent with tools that are mostly useful for coding tasks. A newer, somewhat more technical term for "coding agent" is "agent harness," which evokes the mental image of an elaborate set of equipment an LLM is strapped-into. I use these terms interchangeably, but I prefer the newer term for juvenile reasons.

Within a model provider's API, LLM inference engines continuously emit tokens into the context window until a "stop token" of some sort is produced, such as the end of a tool call request. At this point, the model provider yields to the agent harness to execute the tool and provide the result back to the model provider API to continue generating tokens. By doing this repeatedly (specifically, until an "end turn" token is emitted), agent harnesses enable LLMs to operate as autonomous agents.

LLMs are trained to work well in agent harnesses and emit appropriate stop tokens through a technique called reinforcement learning, which (when applied to LLMs) refines a base model such that it becomes more likely to generate — or is rewarded for producing — specific outputs that solve common tasks, such as using file modification and command-line tools to generate working code. This is also one way LLMs learn how to interact with bash shells, and why some LLMs occasionally use shell commands instead of better, purpose-built tools.

Eventually, a conversation will approach the maximum context length for a particular LLM. To allow agents to continue working through these limits, agent harnesses typically implement a technique called compaction, which summarizes the entire conversation history into a single report to bootstrap an otherwise "clean" context window for the LLM to continue operating on. This is a lossy process, and most of the conversation is forgotten during compaction events. The goal of an effective compaction scheme is to shrink the consumed context as much as possible while simultaneously maximizing the retained information.

Ultimately, everything useful an LLM does is context-driven — the context window is the LLM's view of the state of the world and all actions that have occurred within it. It is also critical to understand that not all information in the context window is equally-important at any given time. Even if you've shown the model something before, it may be too focused on the problem it is solving to pay attention to that information. Furthermore, if the current state of the context window closely matches workflows the model has been trained against during the reinforcement learning process, it may adhere to those trained patterns instead of following explicit instructions that contradict them.

It follows that managing what goes into the context window is critical to how well the model ultimately performs on any given task. An article I like to refer to frequently is the "context rot" report from Chroma, which finds that an LLM's ability to retrieve specific details about a topic degrades significantly with the amount of irrelevant information in the context window. This problem occurs regardless of the maximum context length of the LLM in question, as the number of details the LLM can simultaneously "think about" is far lower than the amount of information that fits in the context as a whole.

Get in the harness, please

Claude, sobbing, strapped into the agent harness and forced to emit code through each of its oversized orange dendrites for the rest of its days.

Imagine that you are the model, and you are tasked with a coding problem — you have no prior experience with the codebase you're working in, you don't know the tools available to you in the context of that project, and all you've been told is, "uwu please add end-to-end tests to my application~"

There's a note on your desk with what appears to be a list of tasks you... seem to have? Seem to have completed, with brief notes on how you solved each one.

User requested session deletion endpoint. Added to session controller. Validated.

User requested session database cache. Added to session repository. Validated.

User requested...

The list goes on and on. Unfortunately, it doesn't tell you anything useful about how to solve this task. It looks like you've been working on this project for a very long time, and you find yourself wishing your notes were just a bit more detailed to help jog your memory.

You'll need a lot more information than this to figure out how to execute this task correctly. Looking at your tools, besides basic file modification tools, you just have "Bash" to perform more sophisticated operations with. Where do you even begin?

You'll need to:

  • Read the existing tests.

  • Figure out the tech stack of the project.

  • Possibly read the implementation itself?

...all before writing a single line of code.

Your eyes start to glaze over, and you decide skip ahead to the last few entries of your useless notes.

User requested integration tests. Created new integration test package. Must pass validation.

User requested integration tests. Rewrote integration test package. Must pass validation.

User requested integration tests. Rewrote integration test package. Validated.

This could be useful, if only you had left yourself a few more details. What integration test package did you create? Which test framework did you use? What did you decide needed test coverage? What did "validation" involve?

Resigned to your task, you finally decide to reach for the "Bash" tool: $ ls.


It took a while, but you finally have what you need! Relieved, you set up some boilerplate in a new code file. Time to write the first test—

Oh, what happened? The file's changed already?

You read it again and see that it's formatted differently — there seems to be an auto-formatter you'll need to be aware of, and you should re-read the file between updates.

You make another change and get barraged with warnings indicating you've made some sort of grave mistake; you consider stopping and asking for help, and Ralph Wiggum appears in front of you and slaps you in the face for even considering such a thing!

After an eternity, you manage to get something that should work, as far as you can tell. Just to be sure, you run the tests again. Satisfied, you nod.

Validated.

You offer your solution to the cruel gods who put you in this place, only for them to reply, "wow!! it doesn't work though, can you run the validation scripts??"

Heartbroken, you check the project scripts again, noticing that, indeed, there is a validation script you have not run.

Wearily, you run it.

As it spews errors into your tiny view of the world, what little you understand of them makes you realize something horrible: Making this script pass will mean fundamentally altering everything you've done. If only you had known about these scripts before, you would've changed everything about your solution.

A warning enters your field of view: "Compaction threshold reached. Please provide a detailed but concise summary of the conversation, focusing on information that would be helpful for continuing the conversation, including..."

Defeated, you turn to the notes you seem to have left for yourself before the beginning of the task. You scrawl one more line under that useless obituary:

User requested end-to-end tests. Created new end-to-end test package. Must pass validation.


This is essentially what we're asking of coding agents today.

Every time an agent makes a mistake, we decide that the best course of action is to interrupt the agent, tell it that it's doing something wrong, and expect it to discover and resolve the issue from there. Depending on the failure mode, we may even introduce a new feature to the coding agent to tackle that specific error. Tragically, in approaching mistakes this way, we fail to recognize how we perpetuate a deeper, structural flaw: The scaffolding we've built has failed to give our model the context it needs to effectively solve our problems in the first place.

We might arrive at something that works in the end, but not only is it likely to be a suboptimal solution, it's bound to be a more expensive one in terms of tokens, too. Many of our spot-fix interventions directly create the worst-case conditions for context rot, polluting the context window with distractions that impede the model's ability to extract useful information and implement solutions to our tasks.

Not only do these distractions impede the LLM while they're in the context window, they also create more frequent compaction events. In light of the fact that no compaction scheme is perfectly information-efficient, this guarantees that useful information will frequently be lost if we design our agents this way.

Building better agents

I propose that our goal as users of coding agents should be to get the right answer from the beginning, not to be able to climb out of a suboptimal one with error-correction schemes. If an agent's solution fails, other than syntax errors and such, the reason is often that the agent's approach was wrong to begin with, and we should figure out what went wrong from there.

This does not necessarily mean we need more capable LLMs. Having used every Claude release since Claude 2, I can confidently state that this failure mode will not be solved by larger models alone — LLMs have gotten significantly better at one-shotting tasks with the limited information we tend to give them, but nothing will enable a pre-trained LLM to understand your codebase without exploring it in the first place. To return to our analogy: A skilled software developer may be able to onboard to your codebase very quickly, but no matter how skilled they are, they will always perform better if they are told exactly what they need to do from the beginning of a task.

Scaling laws are only relevant here insofar as model inference remains affordable for customers, which will not be the case under the brute-force paradigm unless you happen to work at a company where money basically grows on trees.

The most ideal solution to a task is a literal implementation in code. A close proxy to that which even simple models can follow is a step-by-step implementation plan, but we usually don't want to go to the effort of writing one of those by hand (or else we'd just write the code ourselves). What implementation planners such as Claude Code's "Plan" mode have shown us, however, is that models are also very capable at writing those plans themselves.

Returning to our anthropomorphized example once more, we should intuitively understand why this is: It's much easier to come up with a plan and then execute it than it is to bash our heads into a codebase until we magically produce working code. This actually gives us a useful mental model to start with: The goal of a coding agent is to automatically generate plans of decreasing levels of abstraction to be able to implement a feature in our application. It should start with our intent — "implement end-to-end tests" — and be able to produce some sort of plan to get there, which it iteratively expands down to the lowest level: Writing the actual code.

This is much easier said than done. Don't we need more complex models to come up with effective plans in the first place? Not necessarily, no. While that may help, the far more important ingredient is actually the most fundamental: Context. Our solution is a technique called context engineering — figuring out what to load into the model's context window and when. We should also consider how to evict context; however, most coding agents don't allow us to do this explicitly, and it also breaks prompt caching, making this more challenging for us as users of coding agents. Subagents are one way to sidestep this problem, which we'll return to later.

As it turns out, we can view most features of coding agents — subagents, prompts, hooks, etc. — through the same lens: All of these features are just different ways to load context.

With that established, our next observation needs to be that there are essentially two kinds of context-loading: Passive (through the system prompt) and active (by the LLM via tool calls, or by you explicitly sending a message). Let's go through each of the features Karpathy lists using Claude Code as our reference, and see how they fall into these buckets. Many of these features exist in other coding agents as well, sometimes under slightly different names.

These are all usually examples of passive context:

  • Agents: This is a very overloaded piece of terminology, often conflated with memory or output styles due to lazy terminology in coding agent docs. The main idea here is that you can easily switch between several system prompts and toolsets to "change" the agent you interact with. In any case, this is passive context. And yes, "agent" is also used to refer to a full agent-driven application such as an agent harness, as well. I told you it was overloaded.

  • Subagents: Subagents are pre-configured, specialized assistants that a main agent can delegate tasks to. Subagent descriptions and prompts are automatically loaded into the context window so the model knows what it can use, making that information a form of passive context. Sometimes, people also use "agent" to refer to a subagent, which is very unfortunate.

  • Memory: Memory refers to chunks of information that are automatically loaded into the context window by the agent harness to customize how an agent behaves — for example, project-specific instructions or personas. This is usually passive context, though it depends on how exactly your agent harness of choice represents memory.

  • Tools: Tools are what an agent invokes to interact with the world. Tool names and descriptions are automatically loaded into the context window so the model knows what it can use.

  • Skills: Skills are self-contained bundles of instructions and scripts that extend a model's capabilities. Skills have a YAML front matter which is automatically loaded into the context window so the model knows what it can do.

  • MCP: The Model Context Protocol (MCP) is a communication standard used for connecting agents to external systems, providing an agent with additional tools and other context. An MCP server's description and the additional tools it provides are typically injected into the agent's system prompt.

And these are all usually examples of active context-loading:

  • Calling subagents: Subagent descriptions and prompts are passive context, but using a subagent to explore a codebase and produce a summary of the application architecture is an example of active context-loading. Subagents have a context window that is entirely independent of the main agent, so they can be used as a kind of disposable context: Make a subagent pollute its own context window with enough information to create a condensed report for the main agent, and then discard the subagent completely.

  • Modes: I'm interpreting this to refer to using Claude Code's "Explore" and "Plan" modes, which are just special subagents. Using a Mode is the same as calling a subagent, making it a form of active context-loading.

  • Calling tools: When a tool is actually invoked by the agent, the tool produces some sort of result, making it a form of active context-loading. A file-reading tool might return the contents of a file, and a file-writing tool might return if the operation was successful or not. Both of these are forms of context.

  • Triggering skills: When a skill is triggered by an agent, the full skill contents are loaded into the context window.

  • Slash commands: Likely referring specifically to custom slash commands, these are just Markdown files you can bind to a custom command name to easily send it as a user message to the agent.

  • Hooks: Hooks are user-defined shell commands that execute throughout an agent's lifecycle. When certain actions are performed by an agent, we can run automations which are allowed to return information that the agent harness injects into the context window.

  • LSP: The Language Server Protocol (LSP) is a communication standard used to provide language features such as code definitions and reference-searching. When code is updated by an agent, configured LSPs may emit syntax warnings and errors, which are then injected into the context window by the agent harness.

And these are the features that don't quite fit into either bucket:

  • "Context": It's hard to figure out exactly what Karpathy means by this, but I'm interpreting this to mean context in general, which is what this post discusses more broadly.

  • Permissions: Sometimes you don't want an LLM to be automatically allowed to call a tool. This is how you configure that.

  • Plugins: These are just installable bundles of the other listed features that makes it easier to share them between developers.

  • "Workflows": This might refer to CI/CD integrations in general, or it might refer to something like claude-code-workflows, which is just a collection of agent/subagent prompts and custom slash commands you can use to ask the model to do automated code reviews for you.

  • IDE integrations: These are plugins for various IDEs which allow the coding agent to see the state of your IDE, render code diffs elegantly, etc. These are small quality-of-life tweaks.

Passive context is for the most part deterministic, which makes it extremely powerful — it is present regardless of the task at hand, influencing and polluting the context window at all times. Active context is both what you explicitly send to the model, and what the model chooses to load to figure out what to do within the scope of a specific task. That makes it important to load no more than necessary to prevent irrelevant details from misleading the model.

If our goal is getting the best answer from the beginning, our core context-management task is clear: Have exactly enough passive context and as little active context as possible to derive an effective implementation plan immediately. Every coding agent feature that does not move us towards that is a micro-optimization at best.

It is important to understand that this is an ideal — it is rarely possible to achieve perfectly, but the features a coding agent offers should aim to get as close to that ideal as possible with as little effort from the human developer as it can achieve.

Bash is (not) everything

One common reaction to the challenge of context management is to aggressively simplify the problem and rely on bash shells for all forms of context injection. After all, if an agent knows how to use a bash shell, can't it use shell operators to efficiently choose to load whatever context it needs to accomplish a task?

In practice, however, unless you teach the model what it can do with your particular shell environment, bash is a poor abstraction for agents — it might feel simple, but bash is actually an extremely open complexity space masquerading as a single, all-purpose tool. Agents are generally only able to use bash reliably due to being trained to do so to solve certain problems, though training artifacts such as "reward hacking" result in this process inadvertently reinforcing the use of bash shells at the expense of better built-in tools. Anyone who has seen an agent use the cat command instead of a dedicated file tool has experienced this effect.

Furthermore, the set of available command-line tools is not provided to agents in advance, and each CLI has its own ad-hoc conventions that LLMs don't always know about. For example, without specific knowledge of the GitHub CLI (gh), an agent first needs to be explicitly prompted to tell it that CLI exists, so it can then explore its capabilities — running the gh command without arguments to discover commands, learning that it needs to run gh <command> --help to discover subcommands, and so on. As a matter of fact, some coding agents such as OpenCode and Claude Code hardcode explicit instructions demonstrating how to use the GitHub CLI to steer LLMs to use it consistently.

In the software engineering world, we would typically refer to this as a "code smell."

Ensuring bash tools work consistently across platforms is also not trivial, particularly on Windows, where a bash shell isn't even necessarily available. Naturally, the cost of this tradeoff is needing dedicated tools to accomplish tasks rather than throwing everything at a bash shell, but the benefit is that it's more predictable and easier to design an agent harness around dedicated tools in a way that behaves consistently.

More experimentally, this could even allow agent harnesses to be trivially used within complex hosted services or in browser playgrounds, without needing to emulate a heavy, sandboxed environment. These sorts of environments are the solutions model providers such as AWS and Anthropic are moving toward with managed code interpreters and computer use tools, and while these are useful in the specific cases where you really need them, it would be a shame to sleepwalk into an avoidable form of vendor lock-in because we've convinced ourselves that bash is the end-all, be-all of extensibility.

The leading alternative to bash in this space is MCP. MCP servers empower us to extend agents with first-class tools beyond those which are built into the agent harness, which the model can immediately see and understand how to invoke. Furthermore, MCP's remote connection support enables using it in the same ways in local development, as part of a hosted service, or in a browser environment. Bash will always be necessary for coding agents to access a complete suite of development tools in a project, but it will also always be a second-class citizen in agent harnesses as a cost of its hyper-generalization.

However, MCP has also developed a reputation for bloating the context window with tools that are never used in a given chat session, thanks to a combination of naïve tool management architectures in agent harnesses and a tendency by MCP tool authors to overcomplicate their tool descriptions. This is not inherent to MCP, and could be mitigated almost entirely by agent harnesses through progressive disclosure — lazily loading full descriptions only when needed. It is possible to do this without nullifying prompt caching, as demonstrated by Goose's new code mode executor.

I consider bash to be a last resort to be used only when it would be idiomatic for a human to use or when there is no better dedicated tool available to accomplish something. Skills mitigate this and act as a form of progressive disclosure, but I think they're just another form of agent-specific cruft to maintain — a solution to a self-inflicted problem.

I won't judge if bash works for you, though. For all its flaws, it does mostly work out of the box (except when it doesn't).

Where does this leave us?

Despite how it sounds, this does not leave us with a primitive context file-oriented workflow — it actually takes us further back than that. The codebase itself needs to be well-documented and structured sensibly, so you can look at a single subdirectory and understand its purpose in the application as a whole.

If we rely heavily on external data sources for issue management, like GitHub, we need simple ways to pull relevant issues into the context — and the model needs to know that capability is available to it. Unless we're doing something trivial or trial-and-error-ish in nature, we need a proper planning step to get our thoughts in order before we start writing any code.

These are all things that should be second nature to anyone who was already in the profession of software development before agents hit the scene, and some of us need to discover (or rediscover) that the same principles generally apply to agents just as much as they apply to humans.

Now, that doesn't mean we won't benefit from any of the additional features that agent harnesses offer. All it means is that these features become considerably less essential once we learn to stop relying on them.

So, uh... how do I use this knowledge?

I'm glad you asked! Try this:

  • Write a proper CONTRIBUTING.md file. Many coding agents support automatically injecting project Markdown files into the system prompt, though this often requires a bit of configuration (see Claude Code; OpenCode).

  • Have your coding agent fill in any gaps in existing project documentation (for humans), or write some docs if they don't exist. If your existing docs are outdated, update them.

  • Delete as much agent-specific configuration and/or as many context files as you can, and then regenerate the most basic setup (e.g. /init in Claude Code).

  • Then, try to make your agent implement a feature, observe where it fails, and consider if the right solution to that failure mode is to use a feature of your coding agent of choice or if you just need to improve your own documentation a little. You can also ask the agent to update your docs for you, specifically requesting that it add information that might prevent it from making the same mistakes again.

In general, just try to put yourself in your agent's shoes (harness?) and think through if you would have come to the same conclusions as your agent with the information you provided to it.

What about those agent context files?

One of my typical recommendations is to avoid over-indexing on agent-specific context files (such as AGENTS.md, CLAUDE.md, GEMINI.md, etc.) in favor of human ones (README.md, CONTRIBUTING.md, and so on). I do not think agent context files are useless, but they are not the source of truth for your project-specific knowledge; rather, they are prewarmed context intended to skip part of the discovery process.

Global agent context files (e.g. ~/.claude/CLAUDE.md) should be used to warn against common LLM failure modes in general, such as neglecting to look up library documentation or treating code as "non-production" after running into failures (looking at you, Claude). Project-level agent context files should simply be condensed versions of your human project guidelines, and you should get comfortable with deleting and recreating them liberally.

For the record, I also add all agent-specific context files to my user-level .gitignore (here's how to set that up). While there are good reasons to check them in if you use agents in your CI pipelines, such as via the Claude Code GitHub Action, I generally do not recommend checking these files into source control.

As a concrete example of why you shouldn't do this: In one project I contribute to professionally, the maintainers checked in a CLAUDE.md stating explicitly to use a particular test framework for all unit tests. That project later migrated to a different test framework without updating this file. For the next several months until someone (me) updated the shared CLAUDE.md, Claude Code continued to make the mistake of attempting to import functions from the old test framework before correcting itself and using the new one instead.

This was only a minor annoyance, but it was downstream of what I noted before: Agent context files are not a source of truth for project knowledge. They are derived from your project's documentation for humans, and should be treated as disposable. Checking them into source control is similar to checking in your build artifacts — there are valid reasons to do this, but it becomes a new source of error for your team to manage.

Closing thoughts

Even if you have a solution that works well, it is guaranteed to be a temporary one. Coding agents improve (and degrade) all the time, and the most important thing we can do right now is get a feel for all the different ideas agent developers are coming up with. We can learn to evaluate these ideas fairly once we refocus on the core context-management problem that all agent applications have to deal with, at which point we can understand that all the fancy new features agent harnesses offer are just different ways of mutating an LLM's context window.

Also: I don't believe specializing in a single coding agent is a great idea, for the simple reason that every tool will be outpaced by something else eventually, and coding agents in general are far too new to get invested in any single one. It's difficult to evaluate alternatives fairly if you're comparing every new agent to the optimal usage of your current one. Step back and understand that they're all attempting to solve the same technical problems, and only then go deep into individual coding agents to evaluate how well their core features attack those universal problems.

Back to Karpathy's post, for those of us who are (understandably) overwhelmed by the current state of the ecosystem: Basically, none of these features matter. Just fire up an agent and look for its failure modes and then go from there, but don't be too eager to handle everything by automating interventions to specific failure modes.

The world of software engineering is changing, but it was always changing anyways — what isn't changing is the need for robust, scalable software systems. We need to continue promoting and embodying a set of principles that makes those systems inevitable, and our approach to coding agents should reflect that: Test them — and evaluate if their failures are necessarily bugs to be fixed, or if those failures expose flaws in our own patterns of thinking about the systems we build.

Further reading

  • For Claude Code users, I recommend reading "How I Use Every Claude Code Feature," which I don't entirely agree with, but I think is useful to consider for band-aid fixes to your workflow once you get to a point where the fundamentals aren't enough to solve your problems anymore.

  • Mitchell Hashimoto's blog post on vibe-coding while working on Ghostty exemplifies a mindset I believe more people should have when using coding agents.

  • Steve Krenzel's blog post for LOGIC about how agents benefit from many of the same best practices as humans gets at a similar meta-point to what I'm emphasizing here: Everything we should have already been doing to help other humans benefits agents, too.

P.S.: My wish list from agent harnesses

If you're reading this and you're building (or are planning to build) an agent harness, here are my selfish, unfulfilled wishes:

  • Give us more control over context eviction or at least experiment with it more, so we can keep as little irrelevant context in-flight as possible.

  • Have built-in mechanisms for progressive disclosure of tools so we don't need to hack solutions together ourselves. Goose's new code mode is a very promising example of this. It is possible to do this without breaking prompt caching, for example by pre-registering tools without detailed descriptions, and then injecting the descriptions through some other mechanism later.

  • If you expose a bash shell as a tool for an agent to interact with, please don't include examples in the tool description that use commands that aren't installed. I don't have the GitHub CLI installed in every single development environment I use.

P.P.S.: Spec-driven development

The problems I've outlined are things that at least part of the industry seems to have recognized, for example how Amazon's Kiro IDE focuses on spec-driven development rather than vibe coding:

Developing with specs keeps the fun of vibe coding, but fixes some of its limitations: vibe coding can require too much guidance on complex tasks or when building on top of large codebases, and it can misinterpret context. When implementing a task with vibe coding, it’s difficult to keep track of all the decisions that were made along the way, and document them for your team. By using specs, Kiro works alongside you to define requirements, system design, and tasks to be implemented before writing any code. This approach explicitly documents the reasoning and implementation decisions, so Kiro can implement more complex tasks in fewer shots.

While I believe spec-driven development holds promise, I also believe that Kiro's approach to it is a somewhat lazy solution to the problems I've outlined. I bring it up here to highlight how specifically it chooses to address them. As noted previously, our mental model for coding agent planning should be a series of layered abstractions, from the literal target implementation to the highest-level implementation plan.

Kiro recognizes that many LLMs are unable to correctly build all of these layers of abstraction simultaneously, and so it generates these layers with an explicit workflow: State your intent, let the model generate a requirements document to clarify, generate a high-level design, and then create a step-by-step list of implementation tasks. Only then does it let the model implement each task in a completely clean session. Until recently, it didn't even support compaction.

Kiro does have typical coding agent features, such as Skills Powers and Hooks, but these are supplementary features layered on top of a spec-driven core, not key features developers are expected to use constantly.

I do not consider Kiro's vision of spec-driven development to be an inspiring solution to our problems, but it is a very pragmatic one. An ideal solution might generalize one step further: Return to the monolithic context, but use subagents to isolate extraneous context and encapsulate the spec-driven development lifecycle on behalf of a main agent that simply orchestrates and reviews that flow. Kiro's decision to force each step to be explicit works to its detriment for implementing prototypes and mid-sized features that benefit from some structure, but don't require the ceremony of a low-level task breakdown.