My first response to the duration-related failures was procedural rather than architectural in nature. Once it became clear that longer-running work degraded reliability, I did not introduce new tooling or orchestration. I narrowed the problem by changing what I demanded from the system. Tests became explicit and non-negotiable. They had to pass without scaffolding, partial execution stopped counting as progress, and shortened scenarios were treated as failures rather than acceptable compromises.

The point was less about broader coverage and more about regaining control over how work completed. Requiring that all tests pass removed a familiar failure mode. The agent could no longer advance by producing code that merely appeared reasonable, and work resolved more cleanly into completion or failure without lingering ambiguity.

That clarity made another pattern easier to see. When forced to satisfy full test suites without shortcuts, the agent received continuous, concrete feedback about what was working and what was not. Code quality improved as a result, with the feedback loop staying intact even as sessions stretched longer and attention thinned.

At this point, the system worked because the feedback loop was closed. Passing tests fed directly into the next iteration, and reliability improved as long as that signal remained continuous and unambiguous. The loop between generation, validation, and revision became self-reinforcing, and within that loop the system behaved coherently and improved over time.


Subscribe for new essays

Get updates when new writing goes live.