LLM Manifesto, March 2026 (Subject to Change)

March 10, 2026

I don’t care about AI.

Outcomes, yes. Value flowing through a system instead of pooling behind bottlenecks, yes. Problems getting solved, absolutely. If agents were the wrong tool for any of that I’d use something else and not think twice about it.

They happen to be extraordinarily good at it right now. So here we are.

The Root Matters

I don’t think of myself as a software engineer. Or not purely. I came up through process ownership: value stream analysis, standard work, change management, obsessing over where work stalls and why. Software became the way I expressed that. The code was never the point; the point was the system working better.

Every engineer is a tree in an orchard. Some grew straight from software soil. Some grew from science, operations, manufacturing, teaching. The roots are different; the angle toward the light is different. The fruit is different. None of that is wrong. But it means the same tool in different hands produces different results, not because of skill, but because of what question you’re asking when you pick it up.

When I pick up an agent I’m asking: where is the waste in this value stream and how do I eliminate it? That’s a process owner’s question. Process owners have always been measured on outcomes, not methods; the method is just whatever actually works. That instinct leads somewhere different than “how do I write this function faster.”

The domains where I’ve asked it: manufacturing operations platforms, data pipelines feeding real-time simulators, game AI with genetic algorithm tournaments and minimax round-robins, benchmarking suites across the -ler stack, frontend and backend development, data analysis. Not variations on one problem; genuinely different domains, different constraints, different failure modes. The same question runs through all of them.

Evaluate the Outcome, Not the Tool

This is my core. It’s unglamorous and doesn’t produce hot takes about which model is best or whether AI is overhyped.

The only question that matters: did value come out the other end?

Smooth experience is relevant; making the line flow smoothly is how you eliminate waste, which is the whole point. Best practices are relevant; Toyota and Danaher built EHS and safety principles into their systems because that’s what makes sustained high performance possible, not because it’s nice in theory. We aren’t cowboys. We learn from others. We protect the forest.

What’s irrelevant: whether the agent impressed me, whether the method felt sophisticated, whether this looked like cutting-edge practice. Did the work move. Did the system improve. Did the problem get solved.

This framing makes a lot of discourse fall away. It also makes some decisions obvious that feel controversial through other frames; like how much autonomy to grant an agent, how many confirmation prompts to accept, how tightly to supervise a run.

If the outcome is good and the system is sound, the method is fine.

The Bottleneck Moved

My bottleneck used to be the time it took to write code; analysis done, problem framed, solution understood, and then hours of typing to make it real. That gap between knowing and having is gone now. The code writes itself.

Which means the scarce resource is now the quality of my thinking. The analysis. The problem framing. The “is this even the right question” work. The judgment calls that require experience and context that no model has.

Here’s the personal angle: I’m not just optimizing some external system. I am the throughput unit. My attention, my judgment, my decisions are what flows. Agents are how I increase that throughput; not by automating away my job, but by eliminating the low-leverage parts so I can do more of the high-leverage parts. More problem framing. More analysis. More of the work that actually requires the years of context I’ve built up. Less typing.

Rational response: invest everything into the thing that’s now scarce. Do more of the human work, do it better, do it longer. Let the agents handle what they’re genuinely excellent at. The bottleneck moved; chase it.

Experimentation Compounds

I treat my own usage of these tools as a system to optimize. Personal account, work account. I edit my global CLAUDE.md, my roadmap patterns, my agent delegation rules constantly, between sessions, between repos. Run something one way, watch what breaks, change one variable, run it again.

The instinct is to find what works and stay there. Understandable; especially when what you have is already good. But this is a moving target and the compounding is real; small improvements to how you work add up the same way small improvements to a process add up. You don’t feel the delta in a week. You feel it in six months.

If you believe you have good judgment and real domain knowledge (and if you’ve read this far, you probably do), the opportunity is sitting there. Smarter people than me have gotten less from these tools by not pushing. That’s not a character flaw; it’s just an experiment not run. Run the experiment.

Autonomy Is a Time Decision

I run agents with full autonomy for long stretches. No confirmation prompts. I monitor; I don’t babysit.

Every prompt waiting for human approval is thinking time eaten by watching instead of thinking. Coordinating across machines, parallel agent sessions, managing a real-time system while building the next iteration: every stop is a context-switch, and context-switches aren’t free.

So I design the guard rails into the system: roadmaps, CLAUDE.md, hard architectural constraints, fine-grained permission rules in settings.json.¹ Then let it run. The safety comes from the structure, not from vigilance.²

Last weekend: sqler benchmarks running all weekend unattended, genetic algorithm tournaments and minimax round-robins on the laptop, five to ten agent sessions active across machines. I was planning other things. Sprinting on other things. Sleeping. Living my life. The agents were working while I wasn’t; that’s not a flex, that’s the whole point of the design. The value kept flowing while the human was somewhere else being human.

Design Accordingly

People say “LLMs are stochastic black boxes” like that settles the conversation. People are stochastic black boxes too; tired, distracted, having bad days, making confident mistakes with calm faces. We built factories around them anyway. I’ve stood on production floors and watched exactly the kinds of mistakes people are afraid agents will make: confidently wrong, hard to catch in the moment, expensive if the process doesn’t account for them. The solution was never to watch every person every second; it was poka-yoke, standard work, process design that made the right action easy and the wrong action obvious or impossible. Same problem. Different workers.

Not all agents get the same leash. This level of autonomy is only possible with Opus 4.6 right now; not older models, not other models. I got burned by an older model that deleted a database. Confidently, without hesitation, then apologized profusely. Recovery was trivial because the guard rails held: wrong database, right contingencies, exactly what the design was supposed to ensure. The lesson wasn’t “never grant autonomy.” It was “this model at this capability level warrants this much rope.” Codex lied to me about committing work last week; I barely trust Codex. Opus 4.6 has earned a level of trust through hundreds of sessions that other models simply haven’t. That’s empirical trust built through reps, not faith.

Guard rails and contingencies aren’t a concept I invented for agents. I managed a large production PostgreSQL database for a real webapp. I’ve built FastAPI auth systems. I’ve worked in environments where a bad query costs real money and a bad process design on a factory floor hurts people. The principle doesn’t change depending on whether the variable element is a tired worker, a junior dev on a Friday afternoon, or an LLM with a long context window and a lot of confidence. Build the system so that when something goes wrong, and something always does, the damage is contained and recovery is trivial. Then let it run.

You Cannot Abdicate

Autonomy doesn’t mean absence; it means structured presence at the right moments.

Every long session produces a report-out. I read it. If I want more depth, I read the tests; they tell the story better than the code does, and my testing requirements are strict enough that they’re worth reading (most of the time that’s as deep as I need to go; a well-written test suite is a narrative). If I need to go deeper still, I read the code. That’s the order: report, then tests, then code.

For frontend work the agent always produces a manual testing checklist. I do it myself. I report failures back. Some things require human eyes and human hands; knowing which ones is part of the job.

When something is high-stakes or sensitive (genuinely unfamiliar territory, production-adjacent, architecturally risky), I watch it work. I read its thinking in real time. I steer it, stop it, correct it. Not because I don’t trust it; because I know what I’m looking at and I can see drift before it gets expensive to fix.

That’s the most important loop: catch something off, stop the run, correct it, then immediately update CLAUDE.md or a rule file so it doesn’t happen again. The session produces output; the correction produces doctrine. The doctrine improves the next session. That’s the actual compounding mechanism: not just doing more, but encoding what you learn so you don’t learn it twice.

You cannot abdicate your role as leader and teacher. The agent is capable and fast; you still set the direction, hold the standard, and update the curriculum when it falls short. The moment you stop doing that, quality drifts. I’ve seen it happen in a single session when I got lazy. The report-out looked fine; the tests told a different story.

Read the report. Read the tests. Know when to watch. Never stop updating the rules.

There’s an advantage to having written software before any of this existed. Not nostalgia; calibration data. You know what a bad test looks like because you’ve shipped one and paid for it. You know what architectural drift feels like because you’ve been three months into a codebase that quietly lost its shape. You’ve been burned by the mistakes the model will confidently make, and that scar tissue is what lets you catch them in the report-out before they’re in production. The model has never been burned. You have. That asymmetry is real and it matters.

Subject to Change

Last December I held different beliefs than these. Some of them I’m embarrassed by now; not the wrong-headed ones, but the timid ones. The autonomy I wasn’t granting. The experiments I wasn’t running.

Models change. Tooling changes. More importantly: I change. More reps, more data, more experiments, more failures. The beliefs update.

March 2026. Ask me again in December.

Claude Code has a proper permission system: allow/ask/deny rules per tool, bypassPermissions mode in settings.json, domain restrictions for web fetching. It’s not binary “approve everything” vs “skip everything.” You can configure it precisely so the things you trust run automatically and the things that are genuinely risky still prompt. ↩
Yes, I know “every thirty seconds” is hyperbole. ↩

日本語版: この記事を日本語で読む