LLM Coding Agents in 2026: A Journey from Skeptic to Realist

February 20, 2026

LLMs are powerful, fast, and constantly changing. The LLM coding agent landscape looks nothing like it did even a year ago. At first, copying and pasting from ChatGPT into your project, like the next evolution of Stack Overflow, was novel, and everyone knew the code wasn’t great.

Podcasts have mostly changed their tune; some have gone from distrust and disdain to grudging acceptance to loving embrace. Everyone couches their talk about it in safety hedges and promises that they at least review the generated code. Reviewing is important, but it’s not always worth the effort.

People give LLM coding agents too much credit or not enough. I’m human too and would love to be the right one, but from my work there’s a real way to extract value and time from coding agents without ruining everything. LLMs in 2026¹ are excellent at writing code you already know how to write, useless for code you don’t understand, and dangerous if you forget which category you’re in.

This matters because I know what it feels like to actually learn a tool deeply, so I can tell when an LLM is bluffing vs actually helping. Mid 2010’s I needed to code and was a little interested. VBA and AutoHotkey were my jam because they made my life easier; so much easier that they became my hammer and every problem was a nail. Then Python when data science went berserk. Then Vue when I wanted interactive frontends. I’ve walked the path from copy-pasting Stack Overflow to copy-pasting ChatGPT to actually understanding when these tools amplify competence vs just generate garbage faster.

The Honeymoon Phase

ChatGPT (3 or 3.5)

We all remember when this came onto the scene. It was cool for getting help and you could copy paste your code and problem in and copy paste the code out with some changes. I actually thought this wasn’t so bad because you were still intimately connected to your code even when the output was garbage.

When it could do web search I found it was more useful to brainstorm approaches or quickly ask what other projects were out there and how they figured things out. Like how did SQLAlchemy handle X. If you were already copy pasting from Stack Overflow, this was just the streamlined version. Easy to get caught in its thrall with the calm, reassuring responses.

ChatGPT 4

At this point GPT-4o is out and it’s way better at copy paste. I knew about Copilot and code completion but hadn’t tried it; turns out I never will. Copying and pasting context into GPT-4 and getting something to copy and paste out of it has never been better or easier. With some edits. Terrible but amazing. I even made some little scripts that would go through and put whatever you wanted in your code onto the clipboard so I could super copy paste. Claude Code entered the picture here; also ignored it since I already had ChatGPT.

ChatGPT was great at chatting but I hated chatting with it. It was good for sanity checks or comparison or asking things you knew. It’s like that old story. You read a newspaper and there’s an article about something you’re an expert in. The cracks and errors stand out like a sore thumb. Then you read the next article about something you know nothing about and think, oh hey that’s interesting! I assume it’s either half the picture or half wrong. Either way it’s not perfect and should never have been trusted like that, or allowed to be thought of like a friend.

The Crash

Codex

Whoah. All by itself! Developer experience was lacking a bit at this point but I’m pretty good at copying and pasting context around between sessions. It’s also pretty good at doing stuff on my computer that’s not hard but annoying like Linux setup or folder stuff, reading journalctl or finding out why I’m seeing accept: too many open files. You can Google it, yeah, but Codex can also just do it. I converted my old gaming desktop to an Ubuntu server and was doing some more experimentation so it was helpful as a personal assistant since I’d mostly done WSL and Docker Compose before without much real “Linux” work.

Codex was very eager to reinvent the wheel constantly. I could never get it to work well with anything substantial but it excelled at small toys and bash scripts. It eventually got good-ish and you could leave it alone for a while.

You allow it unlimited permissions just to stop pressing accept on everything. Then it rampages through your repo. Deleting things on a whim.

Trust-But-Verify (Or Watch It Crash)

Trusting the LLM meant dealing with the fallout. Small, focused tasks. Forcing it to write good tests. Brainstorming with it. Using the web search to compare across several projects; seeing what others had done. Stuff you could do but now you can do faster. No objections from me.

Watching them make strange decisions in real time was helpful. It was more helpful to see it go down a bad path and correct it. This also sucks because you have to WATCH it, which almost defeats the purpose. Still fast but I don’t want to do this. It messed up too much to prefer reviews.

I hate this term, vibe-coding. Someone made it to be silly and it wasn’t marketing but it’s just marketing now. A friendly word for the system to sell its services to people; they know it’s not a good idea and won’t lead anywhere, especially with Sonnet 3s and GPT-4s. But if we’re stuck with it, let’s at least be precise: vibe-coding is fine for prototypes and exploration. It’s inviting tech debt to production. The problem isn’t the technique; it’s using it in the wrong context. You reap what you sow.

2026: What Actually Works

The Pattern That Wins

Plan mode. Subagents. CLAUDE.md. The developer experience is actually good. I wanted to like Codex but it just couldn’t keep up. You do seem to get a lot more tokens and Claude is nothing if not stingy. ChatGPT on the other hand is leaps and bounds ahead of Claude’s chat and Anthropic seem to be playing catchup on that side.

But here’s what I noticed. The tools that won weren’t the ones with the most features. They were the ones that understood the workflow: planning, execution, memory, iteration. And Anthropic keeps giving me reasons to stay. They added memory. They just keep making all my little scripts and personal add-ons obsolete. And agent teams! Seems like everyone invented their own special little wheels and hand rolled claude upgrades then Anthropic went and added it themselves. This attitude is a large part of what sets them apart.

The underlying models and tooling have gotten so good that the size of the unsupervised toy you can make has grown. Now instead of one-off bash scripts you can make simple apps or applets. Or you can make some components and then just check them after. Easy.

Sonnet 4.5, Opus 4.5 (now 4.6), GPT-5.n-codex are all show stoppers IN THEIR LANES and in their lanes alone. Good context, good examples, good CLAUDE.md. Set them up for success and ask what they can actually deliver. Just ask them (tell them?) to make a view and they’ll do it. Make tests and they’ll make them, run them.

They’ll make dumb and useless tests if you let them. Proudly proclaim all 2,000 tests are green and only 200 are skipped. Then you look. The skipped ones are the ones you wrote. A bug made them fail. The 2,000 are just assert True and pass no matter what. No big deal: just update the CLAUDE.md and the memory.

Code to the Size of the Model

Here’s the thing though. Asking it to code something you don’t know and can’t recognize bad patterns in is a mistake unless it’s a toy. You have to code things to the size of the model. Claude-sized (shaped?) development.

Think of it like an architectural constraint. The model has a cognitive envelope; it can hold a certain amount of context, understand a certain complexity of relationships, maintain a certain depth of state. Stay inside that envelope and it’s powerful. Push past it and you get entropy tax; the output degrades, hallucinations appear, architectural decisions lose cohesion.

Right now that envelope is maybe a few components, a small service, a standalone script. Not a full application. Not a complex system with multiple layers of abstraction. If Claude can consistently and usefully hold larger contexts at once then you’ll be able to trust-code larger and more complex projects. This is the bottleneck that determines everything else. ²

I design to this constraint now. Small modules with clean interfaces. Stable API contracts between components. Local context per milestone; fewer cross-cutting changes in one run. Claude gets a well-bounded problem and delivers. Give it the whole codebase and a vague directive and you get confidently wrong architecture.

The side effect is a cleaner codebase for humans too. Not a bonus; a direct consequence of respecting the envelope. When auth has a clean API, Claude can ship feature work that depends on auth without re-deriving auth internals every time. If boundaries are sloppy, Claude keeps too much state alive, then starts guessing. That’s where drift starts.

The failure modes have shifted too. In well-scaffolded projects with strong grounding, the old issues are rare: hallucinating libraries, mixing Python 2 and 3, generating completely broken tests. But the new failure mode is subtler. It’s architectural drift. It’s losing the thread on why a decision was made three functions ago. It’s confidently building the wrong thing because the context window couldn’t hold the full intent.

This is why the work intensifies instead of reducing. You stop being a builder and become a constant auditor. The bottleneck shifts from typing to discernment. That’s cognitively exhausting in a different way than writing code from scratch.

What LLMs Actually Do

LLMs amplify existing skill. They don’t replace judgment.

If you don’t know what good code looks like, the LLM won’t teach you. It’ll just generate bad code faster. If you can’t recognize when an architectural decision is wrong, the LLM will happily build you a beautiful disaster. The agent scales skill, not ignorance.

The biggest risk is coding in the dark. Write Python long enough and you know when you’re digging yourself into a hole. You know a test isn’t actually testing what it should. You know what’s probably causing that bug; chasing it down isn’t an odyssey. Junior developers who only ever coded through agents aren’t equipped to evaluate two approaches. They may not notice repeated code or multiplying magic numbers. Critically, they won’t realize when the LLM is wrong and can’t push back. (To quote Claude the other day when I corrected it: “Fuck. You’re right.”)

LLMs compress search space. They don’t eliminate the need for evaluation. You still need taste. You still need architectural intuition. You still need to know when something smells wrong even if you can’t immediately articulate why.

The Intensity Trap

It’s undeniable that it makes you faster at certain things simply because you literally cannot and will never be able to type as fast as an LLM agent. They also have very wide “working memory” especially when math or calculations are involved and can stitch things together in crazy ways; you just have to bring the vision, creativity, and discipline.

There’s an interesting article about LLM coding intensifying work instead of reducing it: AI Doesn’t Reduce Work—It Intensifies It. The authors note that people started working longer hours for free because the euphoric feeling of productivity and creation was so compelling.

I’ll admit something. Late one night I started a Claude Code run from my phone. My 5-hour limit had just rolled over. My weekly limit was about to refresh the next day. I hadn’t used it all up yet. Imagine not using your entire limit. Makes my skin crawl. Very healthy.

The article describes something easy to feel and that I felt. The euphoric honeymoon then the crash, especially if you didn’t clean up your tech debt along the way. You can pay back tech debt with the agent but it’s wasted tokens and lost time either way. You have to tune your scrutiny depending on the workload. Very sensitive situations require very intense review.

What To Actually Trust Them With

Prototypes and Mockups

Trust-code prototypes and mockups. If it’s a quick thing and you want the step past a sketch on paper then why not. LLM agents in 2026 are SO good at this. Quick one-session things. LLMs are well suited to version 0.0.0.0.1 super fast mockups that aren’t “usable” but brainstorming or idea generating. Wow people better than sticky notes on a whiteboard. Mockups you’re planning to throw away.

“Usable” prototypes that actually do stuff make me wary.

Mechanical Work at Scale

This is where I’ve gotten the most reliable value. Renaming a variable across 20 files. Updating imports after a refactor. Mass-applying a pattern you’ve already validated in one place. The agent doesn’t need to understand your architecture for this; it just needs to be precise and tireless. I use cheaper models for this (Haiku, Sonnet) and save Opus for thinking work. Delegation by capability, not by laziness.

Tests (With Supervision)

LLMs write tests fast. They also write garbage tests fast. The trick is having a testing philosophy baked into your project docs so the agent knows what “good” means. I require meaningful failure conditions; I run automated auditors that catch assert True and existence-only checks at creation time. The agent writes the first pass, the auditor catches the softballs, the agent fixes them.

For frontend work, human eyes are still non-negotiable. Claude can run browser automation and that helps a lot. It still cannot feel friction, confusion, or boredom the way users do.

Bounded Feature Work

If you’ve got clean module boundaries, you can hand Claude a well-scoped feature and get something shippable. “Add a tag filter to the blog listing page” with clear acceptance criteria and existing patterns to follow. That works. “Redesign the authentication system” without a plan does not.

The key is the word bounded. Claude excels inside constraints. Give it freedom and it’ll use all of it, usually in directions you didn’t want.

Responsibility

If you push the code it’s yours. Doesn’t matter who wrote it. LLM wrote it and you trust it, fine. It sucks and it’s on you. 100% yours.

Closing

The rest of the field is moving fast. Kiro (removed Opus from free tier = cry)³, specialized tools for repo analysis and PR review; they’re building things that would be too hard to do by hand. The site builders are marketing something that can’t be done to a good enough degree yet; small sites and CRUD sure, but I wouldn’t trust them with a real vision.

Code to the size of the model. Build the scaffolding that makes them reliable; don’t just hope they’ll figure it out. The models will keep getting better, but the judgment problem doesn’t go away. It just moves.

Next up: How I actually use Claude; roadmap discipline, context management, and the operating manual I’ve built for working with LLM agents every day.

¹ As of writing in February 2026. I’m sure Christmas, if we survive until then, will invalidate this entire blog post ² The day after I wrote this draft Sonnet 4.6 and 1M token context came out ³ All these model and tool names may not be relevant in 2 years but I’ll name drop them now anyway

日本語版: この記事を日本語で読む