LLM-First: Software for Agents as Operators

February 24, 2026

I’ve been building infrastructure for agent-driven workflows. Not just having them write code faster. Having them actually operate in my systems.

Five tools so far; all the same pattern and the same question: what if my software was designed for agents to use it, not just for agents to write it?

How I Got Here

I was designing AI for a complex game. The game AI’s decision-making process generated verbose logs; thousands of lines of reasoning traces, state transitions, evaluation scores. I needed to debug why the AI was making bad decisions in specific scenarios.

Before logler, the workflow was painful. I’d paste log chunks into Claude’s context window and ask it to investigate. The verbose output burned through tokens fast; Claude would lose the thread mid-investigation and I’d restart, re-explain, re-investigate. Loop. Each session was a fresh start because there was no way to persist investigation state.

The frustrating part: the information was right there in the logs. RIGHT there. Claude just couldn’t access it efficiently; too many tokens for raw output, no way to query specific threads, no correlation between log entries across different game systems. I was doing the work of a log investigation tool by hand, using Claude as a very expensive grep. The thread was obvious. I just needed infrastructure that would let the agent pull it.

So I built logler. Not as a side project. As infrastructure I needed to stop wasting time.

The Pattern Across the -ler Stack

It was never meant to be cool side projects. I had actual problems to solve. I built what I needed; it’s paid for itself many times over.

logler: Claude was burning context on verbose log output. Thousands of tokens per investigation step; then losing the thread and starting over. Logler gives the agent structured access with token-budget control. The count format returns 202 bytes where full output returns 513,019. That’s not a marginal improvement; that’s the difference between an agent that can investigate and one that drowns.

procler: Claude couldn’t reliably query process state. Scraping ps aux output is a mess; the agent guesses, gets it wrong, guesses again. Procler returns JSON: pid, uptime, state. Humans get the WebUI; agents get the CLI. Both first-class. “Claude Code is a first-class citizen” was the design constraint from day one; not a feature tacked on later.

sqler: I needed a data layer that returned JSON natively. Not an afterthought --json flag bolted onto a human-readable ORM. JSON-first persistence for agent workflows.

qler: task coordination tooling. Same philosophy.

The pattern across all of them: token efficiency as design constraint, structured output as the primary interface, agent-queryable state, session persistence for investigations that span restarts. Not features. Consequences of asking one question: “what does an agent need to operate this?”

Note: the rest of the post will use logler more than the others as an example but procler and qler are similarly positioned; sqler is a sort of glue / bridge / data backend / micro-orm / thing¹

The Factory Perspective

People (somehow) are still making the same argument in 2026: “LLMs are stochastic black boxes.” [as if this is a deal breaker]

Fine. People are stochastic black boxes too. We built factories around them anyway.

I ran and analyzed value streams under the Toyota Production System. Process ownership. Standard work. Poka-yoke.² The whole stack. Here’s what applies directly:

Value stream mapping → identify where agents waste tokens, lose context, re-derive decisions. Log investigation has measurable cycle time. Token waste is measurable. Context exhaustion is measurable. You can optimize the stream and glorious value comes out the other end.

Standard work → CLAUDE.md, roadmap discipline, mandatory plan mode. Documented, repeatable processes for agent operations. Not vibes. Systems.

Daily management → updating Claude.md, running post-mortems, fixing system defects. When Claude causes rework, the system failed. Any failure is ultimately a system failure.

Poka-yoke → design outputs that make errors obvious or impossible. Structured formats catch parsing failures early. Type-safe interfaces prevent category errors. Session state prevents investigation loss.

The core insight isn’t complicated: the LLMs are the people and the code is the machine. We’re not waiting for deterministic LLMs any more than Toyota waited for deterministic humans. Process control doesn’t eliminate variability. It reduces failure rates, improves consistency, catches errors, and lets you recover.

Value Stream Thinking vs Feature Thinking

Feature thinking: “Let’s add a --json flag so agents can parse it.”

Value stream thinking: “Log investigation has bottlenecks. Token waste on verbose output. Context exhaustion mid-investigation. Session state loss on restart. What’s the cycle time? Where’s the waste? How do we optimize the whole stream?”

The -ler stack came from that second question. Not “here’s a cool log tool” or “here’s a process manager.” Each one started as a value stream analysis: identify the bottlenecks, measure the waste, eliminate the expensive steps. The tools are consequences of the thinking, not the other way around.

The result: the agent writes code, starts processes with procler, checks logs with logler, iterates. The infrastructure supports the workflow instead of fighting it.

Paul Dix wrote about building verification infrastructure at InfluxData; verification systems, agents in feedback loops, the organizational perspective. Simon Willison is documenting developer discipline for coding agents; testing patterns, practical workflows. Both doing important work. My angle is the value stream. Different paths; similar conclusion: build the machine that builds the machine.

Agent-First Doesn’t Mean Human-Last

Does agent-first design hurt humans?

Not if you do it right.

LLMs handle complexity just fine. Million flags in the CLI? Claude figures it out. I don’t want to memorize ffmpeg syntax (I actually probably would if I used it like, often). I just get Claude to do it. Resizing videos, changing sample rates, whatever. The agent handles the complexity. Or better yet, it writes me a bash script I can rerun.

Think when you should think.³ Don’t waste thinking on memorizing flags.

Same engine, different interfaces. Agents get structured JSON. Humans get dashboards and real-time updates. Neither degrades the other; both are first-class.

The Journey: Failure Then Improvement

I was building a monitoring dashboard. The data already existed; the paths to get the data already existed. Straightforward wiring job.

Claude remade all the data connections from scratch. Bespoke implementations for the dashboard when perfectly good ones were already there. Some were just bad; wrong approaches, unnecessary complexity. Others were worse: browser JavaScript that looked like it was connected to real data but wasn’t actually hooked up to anything. Beautiful dashboard. Fake plumbing. Claude decided the existing infrastructure wasn’t pretty enough to reuse.

Too much trust, not enough structure. No roadmap specifying “use existing data paths.” No plan mode to catch the reinvention before it happened. The failure wasn’t Claude being bad at dashboards; the failure was the system that let it rebuild things that already worked.

The whole journey is failure then improvement. That’s daily management. That’s continuous improvement. That’s how factories work. When Claude causes rework, you don’t blame Claude. You fix the system that allowed the fuckup.

What’s Next

Right now agents are session-based. Start a run, do work, end.

The horizon: always-on agents. Monitoring. Investigating. Responding without humans in the loop. That requires log investigation that doesn’t exhaust context windows, process state that’s queryable without human interpretation, investigation sessions that persist across restarts, task queues agents can pick up and continue. Structured output everywhere.

Most production systems aren’t ready for this. Logs are human-readable and token-expensive. Process management is scrape-the-output. Task coordination is a Notion doc. The infrastructure was built for humans reading screens; not for agents operating autonomously.

I started auditing my own CLIs. Does the output require a human to interpret? Is it optimized for eyes or for parsing? Does structured output exist as an afterthought or as what it’s designed for?

I looked at my logs. My process managers. My task queues. My debugging tools.

I asked: what does an agent need to operate this?

Then I built better answers.

Closing

The conversation happening now is agents writing code. Fine. Important.

The conversation I want to have: where’s the integrated value?

Management sees Excel integration and customer service bots. The tool makers see code generation and autocomplete. Both real. Both a fraction of what’s here. Agents as operators in production systems. Infrastructure designed for them. Process engineering applied to LLM workflows. The value stream doesn’t end at “write me a function.”

I built it because the thread was right there; the process engineering lens, the value stream optimization instinct, the years of watching variable components perform inside structured systems. Same principles. Different workers. I pulled the thread.

Audit your tools. Design for token efficiency. Think value streams, not features. Demand more from the infrastructure. Build the machine that builds the machine.

Links:

¹ I know this is weird but I love SQL and love Postgres, but I also don’t love overkill.

² Mistake-proofing. Design systems so errors are obvious or impossible.

³ Only you know when that is.

日本語版: この記事を日本語で読む