I am sceptical if these persona based agents really make that much of a difference, and more "appear" to make a difference because of their talk style.
Underneath is just a system prompt, or more likely a prompt layered on top "You are a frontend engineer, competent in react and Next.js, tailwind-css" - the stack details and project layout, key information is already in the CLAUDE.md. For more stuff the model is going to call file-read tools etc.
I think its more theatre then utilty.
What I have taken to doing is having a parent folder and then frontend/ backend/ infra/ etc as children.
The parent/CLAUDE.md provides a highlevel view of the stack "FastAPI backend with postgres, Next.js frontend using with tailwind, etc". The parent/CLAUDE.md also points to the childrens CLAUDE.md's which have more granular information.
I then just spawn a claude in the parent folder, set up plan mode, go back and forth on a design and then have it dump out to markdown to RFC/ and after that go to work. I find it does really well then as all changes it makes are made with a context of the other service.
I'm also skeptical partially because I don't like the huge essays generated by any llm. CLAUDE.md/AGENTS.md/README.md that are 5+ pages long are all equally bad imo. I prefer following the idea that if something is too verbose for me to want to get anything useful out of it, then the llm should behave similarly. Even if it's not true, why waste 2 paragraphs explaining something that could be explained in one short sentence?
My CLAUDE.md or AGENTS.md is usually just a bulleted list of reminders with high level information. If the agent needs more steering, I add more reminders. I try not to give it _too_ broad of a task without prior planning or it'll just go off the rails.
Something I haven't really experimented with is having claude generate ADRs [1] like your RFC/ idea. I'll probably try that and see how it goes.
I too am skeptical about the personas, but I still use them to organize context and instructions for different types of work. I use a top-level .agents dir, with commands, roles, and rules, sub-dirs.
CLAUDE.md is kept somewhat lean, with pointers to individual files in ./docs/ and .claude/commands is a symlink to .agents/commands.
After starting Claude, I use /commands to load a role and context, which pulls in only the necessary docs and avoids, say, loading UI design or test architecture docs, when adding a backend feature.
I don't want to have to do any of this, but it helps me try and keep the agents on the rails and minimize context rot.
Subagents can work very well, especially for larger projects. Based on this statement, I think you're experiencing how I felt in my early experience with them, and that your mental model for how to use them effectively is still embryonic.
I've found that the primary benefit for subagents is context/focus management. For example, I'm doing auth using Stytch. What I absolutely don't want to do is load https://stytch.com/docs/llms.txt and instructions for leveraging it in my CLAUDE.md. But it's perfect for my auth agent, and the quality of the output for auth-related tasks is far higher as a result.
I'm unsure if this also qualifies as incompetence/embryonic understanding, though I've used LLMs for hundreds of hours on development tasks and have also found that sub-agents are not good at programming. They're more suitable for research tasks to provide informed context to the parent agent while isolating it from the token consumption which retrieving that context cost.
Zooming out, my findings on LLMs with programming is that they work well in specific patterns and quickly go to shit when completely unsupervised by a SME.
* Prototyping
* Scaffolding (i.e. write an endpoint that does X that I'll refine into a sustainable implementation myself)
* Questions on the codebase that require open-ended searching
* Specific programming questions (i.e. "How do I make an HTTP call in ___ ?")
* Idea generation ("List three approaches for how you'd ____" or "How would you refactor this package to separate concerns?")
The LLMs all fuck up on something in every task that they perform due to the intersection of operating on assumptions and working on large problem spaces. The amount of effort it takes to completely eliminate the presence of assumptions in the agent make the process slower than writing the code yourself. So people try to find the balance they're comfortable with.
> I've found that the primary benefit for subagents is context/focus management. For example, I'm doing auth using Stytch. What I absolutely don't want to do is load https://stytch.com/docs/llms.txt and instructions for leveraging it in my CLAUDE.md.
> But it's perfect for my auth agent, and the quality of the output for auth-related tasks is far higher as a result.
What about just using a sub agent specifically to fetch llms.txt and find the answer to the question for the parent agent? Instead of handing a full task off to it
Subagents suffer from the same overriding problem with "Claude Contexting", which is context wrangling. Subagents "should" help to compartmentalize and manage your context better, but not in my experience so far. I found I was jumping through a lot of hoops with special instructions, manual compacts, up front super detailed plans, and MCPs just to manage my context. So subagents is probably the same, where you want to have it handle tasks that do not require context from your main thread.
P.S. I know they added 1m context to their API, with a price increase, but AFAIK the subscription still uses the 200k context.
Yes I know, but subagent suffer from context amnesia during context handouts which is why this subagent use is flawed for purpose of coding product features. I've been using these tools a lot and installed every ai agent out there i could find.
Yup, this is the killer. Subagents SEEM good when you use them on greenfield projects, you can grind out a whole first pass without burning through much of your main context, it seems magical. But when you have a complex project that handoff is the kiss of death.
I'm wondering if in large projects, you want subagents to avoid having tasks flush out the main context?
If you're working with large source files, you might want to do each piece of work in an independent context with the information discarded afterwards?
Is the context a sliding window, or are there tiers of importance?
No the context going out of control is overblow. Lemme example why. First you need to work at feature level. It shouldn't be too large of a feature in one go.
Let's say in my workflow, first agent must know where it needs to make changes? So it greps bunch of files and reads them. We do not need these read calls or grep calls to be part of history, the knowledge gained by doing these is what needs to be part of context
Finally, we do some risk analysis and then just code it right away.
No sliding window needed for this
After this you reset context /reset and u start on new feature.
> No the context going out of control is overblow. Lemme example why. First you need to work at feature level. It shouldn't be too large of a feature in one go.
As a meta point, why write ' Lemme example why.' ?
If someone is still with you at this sentence, that person was ready to understand why.
Otherwise, it delays (and thus endangers the visibility of) whatever your explanation was going to be.
So maybe the solution is to make all subproblems greenfield products?
By this I mean treat features as isolated plugins. I get that there are cross-cutting features that touch multiple pieces of functionality, and those probably need special treatment, but a large class of functionality can be developed in an isolated way with a common set of design tokens and APIs to tie them all together.
This might play better to coding agent strengths.
Full disclosure: this is very much an armchair view. I have all of 2 weeks of experience coding via agents (vs manually), but this thread is nerd sniping me into trying it myself.
I do try to do this, from an architectural standpoint it starts with modular monoliths to avoid coupling, then I try to decompose problems in a way that is very sandboxed so the blast radius of an agent going of the rails is contained.
So the things people hate Java for will make a big comeback then? Hexagonal architecture with domain driven design,a big fetish for inversion of control, so the LLM never needs to figure out how the system works, it just magically does. And errors have just the right amount of stack trace, this being 500++ lines
A lot of old school "java-ish" paradigms are going to come back with AI for the same reason people used them with Java back in the day - they put golden handcuffs on implementors, which is a bad tradeoff for competent, agile humans but a very good tradeoff for sometimes off the rails agents. This includes waterfall, spec driven development, front loaded planning, extensive automated testing suites, formal verification, etc.
As someone who's built a project in this space, this is incredibly unreliable. Subagents don't get a full system prompt (including stuff like CLAUDE.md directions) so they are flying very blind in your projects, and as such will tend to get derailed by their lack of knowledge of a project and veer into mock solutions and "let me just make a simpler solution that demonstrates X."
I advise people to only use subagents for stuff that is very compartmentalized because they're hard to monitor and prone to failure with complex codebases where agents live and die by project knowledge curated in files like CLAUDE.md. If your main Claude instance doesn't give a good handoff to a subagent, or a subagent doesn't give a good handback to the main Claude, shit will go sideways fast.
Also, don't lean on agents for refactoring. Their ability to refactor a codebase goes in the toilet pretty quickly.
> Their ability to refactor a codebase goes in the toilet pretty quickly.
Very much this. I tried to get Claude to move some code from one file to another. Some of the code went missing. Some of it was modified along the way.
Humans have strategies for refactoring, e.g. "I'm going to start from the top of the file and Cut code that needs to be moved and Paste it in the new location". LLM don't have a clipboard (yet!) so they can't do this.
Claude can only reliably do this refactoring if it can keep the start and end files in context. This was a large file, so it got lost. Even then it needs direct supervision.
> Humans have strategies for refactoring, e.g. "I'm going to start from the top of the file and Cut code that needs to be moved and Paste it in the new location". LLM don't have a clipboard (yet!) so they can't do this.
For my own agent I have a `move_file` and `copy_file` tool with two args each, that at least GPT-OSS seems to be able to use whenever it suits, like for moving stuff around. I've seen it use it as part of refactoring as well, moving a file to one location, copying that to another, the trim both of them but different trims, seems to have worked OK.
If the agent has access to `exec_shell` or similar, I'm sure you could add `Use mv and cp if you need to move or copy files` to the system prompt to get it to use that instead, probably would work in Claude Code as well.
Claude’s utility really drops when any task requires a working set larger than the context window.
On the one hand, it’s kind or irritating when it goes great-great-great-fail.
On the other hand, it really enforces the best practices of small classes, small files, separation of concerns. If each unit is small enough it does great.
Unfortunately, it’s also fairly verbose and not great at recognizing that it is writing the same code over and over again, so I often find some basic file has exploded to 3000 lines, and a simple “identity repeated logic and move to functions” prompt shrinks it to 500 lines.
Remember 20 years ago when Eclipse could move a function by manipulating the AST and following references to adjust imports and callers, and it it didn't lose any code?
I have a suite of agent tools that is just waiting on my search service for a release, it includes `srefactor` and `spatch` commands that have fuzzy semantic alignment with strong error guards, they use LSP and tree sitter to enable refactoring/patching without line numbers or anything and ensure the patch is correct.
Nice. This sounds like the right approach. As an aside, it’s crazy that a mature LSP server is not a first class requirement for language choice in 2025. I used to write mini LSP servers before working on a project starting when LSP came out a few years ago. Now that there is wider adoption, I don’t find myself reaching for this quite as often, but it’s still a really nice way to ease development on mature codebases that have grown their own design patterns.
I think it's likely that these agent-based development will inevitably add more imperative tools to their arsenal to lower cost, improve speed and accuracy.
This is only a problem if an agent is made in a lazy way (all of them).
Chat completion sends the full prompt history on every call.
I am working on my own coding agent and seeing massive improvements by rewriting history using either a smaller model or a freestanding call to the main one.
There's a large body of research on context pruning/rewriting (I know because I'm knee deep in benchmarks in release prep for my context compiler), definitely don't ad hoc this.
Just ask chat gpt about state of the art in context pruning and other methods to optimize the context being provided to a LLM, it's a good research helper. The right mental model is that it's basically like RAG in reverse, instead of trying to select and rank from a data set, you're trying to select and rank from context given a budget.
One key insight I have from having worked on this from the early stages of LLMs (before chatgpt came out) is that the current crop of LLM clients or "agentic clients" don't log/write/keep track of success over time. It's more of a "shoot and forget" environment right now, and that's why a lot of people are getting vastly different results. Hell, even week to week on the same tasks you get different results (see the recent claude getting dumber drama).
Once we start to see that kind of self feedback going in next iterations (w/ possible training runs between sessions, "dreaming" stage from og RL, distilling a session, grabbing key insights, storing them, surfacing them at next inference, etc) then we'll see true progress in this space.
The problem is that a lot of people work on these things in silos. The industry is much more geared towards quick returns now, having to show something now, rather than building strong fo0undations based on real data. Kind of an analogy to early linux dev. We need our own Linus, it would seem :)
I’ve experimented with feature chats, so start a new chat for every change, just like a feature branch. At the end of a chat I’ll have it summarize the the feature chat and save it as a markdown document in the project, so the knowledge is still available for next chats. Seems to work well.
You can also ask the llm at the end of a feature chat to prepare a prompt to start the next feature chat so it can determine what knowledge is important to communicate to the next feature chat.
Summarizing a chat also helps getting rid of wrong info, as you’ll often trial and error towards the right solution. You don’t want these incorrect approaches to leak into the context of the next feature chat, maybe just add the “don’t dos” into a guidelines and rules document so it will avoid it in the future.
I ask the bot to come up with a list of "don't dos"/lessons learned based on what went right or required lots of edits. Then I have it merge them in to an ongoing list. It works OK.
i too have discovered that feature chats are surely a winner (as well as a pre-requirement for parallelization)
in a similar vein, i match github project issues to md files committed to repo
essentially, the github issue content is just a link to the md file in the repo
also, epics are folders with links (+ a readme that gets updated after each task)
i am very happy about it too
it's also very fast and handy to reference either from claude using @
.ie: did you consider what has been done @
other major improvements that worked for me were
- DOC_INDEX.md build around the concept of "read this if you are working on X (infra, db, frontend, domain, ....)"
- COMMON_TASKS.md (if you need to do X read Y, if you need to add a new frontend component read HOW_TO_ADD_A_COMPONENT.md )
common tasks tend to be increase quality when they are epxpressed in a checklist format
The difference between agents and LLMs is that agents are easy to tune online, because unlike LLMs they're 95% systems software. The prompts, the tools, the retrieval system, the information curation/annotation, context injection, etc. I have a project that's still in early stages that can monitor queries in clickhouse for agent failures, group/aggregate into post mortem classes, then do system paramter optimization on retrieval /document annotation system and invoke DSPy on low efficacy prompts.
Lots of ways. You could do binary thumbs up/down. You could do a feedback session. You could look at signals like "acceptance rate" (for a pr?) or "how many feedback messages did the user send in this session", and so on.
My point was more on tracking these signals over time. And using them to improve the client, not just the model (most model providers probably track this already).
My somewhat terse/bitter question was because yesterday Claude would continue claim to have created a "production-ready" solution which was completely entirely wrong.
I would've loved to have the feedback loop you describe
My experience so far, after trying to keep CC on track with different strategies is that it will more or less end up on the same ditch sooner or later. Even though i had defined agents, workflows, etc. now i just let it interact with github issues and the quality is pretty much the same
A few youtubers have done deep dives on this, monitoring claude traffic through a proxy. Subagents don't get the system prompt or anything else, they get their subagent prompt and whatever handoff the main agent gives them.
I was on the subagent hype train myself for a while but as my codebases have scaled (I have a couple of codebases up to almost 400k now) subagents have become a lot more error prone and now I cringe when I see them for anything challenging and immediately escape out. They seem to work great with more greenfield projects though.
I have a bunch of homegrown CLI tools in my $PATH that are only described in the CLAUDE.md file. My subagents use these tools perfectly as if they have full instructions on their use but no such instructions are in the subagent prompts.
This should not be possible if they don't have CLAUDE.md in their context.
My main agent prompt always has a complete ban on the main agent doing any work themselves. All work is done by subagents which they coordinate.
I've been doing this for 2-3 months now on projects upwards of 200k lines and the results have been incredible.
I'm very confused how so many of us can have such completely different experiences with these tools.
Totally agreed, tried agents for a lot of stuff (I started creating a team of agents, architect, frontend coder, backend coder and QA). Spent around 50 USD on a failed project, context contaminated and the project eventually had to be re-written.
Then I moved some parts in rules, some parts in slash commands and then I got much better results.
The subagents are like a freelance contractors (I know, I have been one very recently) Good when they need little handoff (Not possible in realtime), little overseeing and their results are a good advice not an action. They don't know what you are doing, they don't care what you do with the info they produce. They just do the work for you while you do something else, or wait for them to produce independent results. They come and go with little knowledge of existing functionalities, but good on their own.
Here are 3 agents I still keep and one I am working on.
1: Scaffolding: Now I create (and sometimes destroy) a lot of new projects. I use a scaffolding agents when I am trying something new. They start with fresh one line instruction to what to scaffold (e.g. a New docker container with Hono and Postgres connection, or a new cloudflare worker which will connect to R2, D1 and AI Gateway, or a AWS Serverless API Gateway with SQS that does this that and that), where to deploy. At the end of the day they setup the project with structure, create a Github Repo and commit it for me. I will take it forward from them
2: Triage: When I face some issues which is not obvious from reading code alone, I give them the place, some logs and the agent will use whatever available (including the DB Data) to make a best guess of why this issue happens. I often found out they work best when they are not biased by recent work
3: Pre-Release Check QA: Now this QA will test the entire system (Essentially calling all integration and end-to-end test suite to make sure this product doesn't break anything existing. Now I am adding a functionality to let them see original business requirement and see if the code satisfies it or not. I want this agent to be my advisor to help me decide if something goes to release pipeline or not.
4: Web search (Experimental) Sometimes, some search are too costly for existing token, and we only need the end result, not what they search and those 10 pages it found out...
I often see people making these sub agents modelled on roles like product manager, back end developer, etc.
I spent a few hours trying stuff like this and the results were pretty bad compared to just using CC with no agent specific instructions.
Maybe I needed to push through and find a combination that works but I don't find this article convincing as the author basically says "it works" without showing examples or comparing doing the same project with and without subagents.
Anyone got anything more convincing to suggest it's worth me putting more time into building out flows like this instead of just using a generic agent for everything?
Right - don’t make subagents for the different roles, make them to manage context for token heavy tasks.
A backend developer subagent is going to do the job ok, but then the supervisor agent will be missing useful context about what’s been done and will go off the rails.
The ideal sub agent is one that can take a simple question, use up massive amounts of tokens answering it, and then return a simple answer, dropping all those intermediate tokens as unnecessary.
Documentation Search is a good one - does X library have a Y function - the subagent can search the web, read doc MCPs, and then return a simple answer without the supervisor needing to be polluted with all the context
I've seen this for coding agents using spec-driven development for example. You can try to divide agents into lots of different roles that roughly correspond to human job positions, like for example BMad does, or you can simply make each agent do a task and have a template for the task. Like make an implementation plan using a template for an implementation plan or make a task list, using a template for a task list. In general, I've gotten much better results with agents that has a specific task to do than trying to give a role, with a job-like description.
For code review, I don't use a code reviewer agent, instead I've defined a dozen code reviewing tasks, that each runs as separate agents (though I group some related tasks together).
I see lots of people saying you should be doing it, but not actually doing it themselves.
Or at least, not showing full examples of exactly how to handle it when it starts to fail or scale, because obviously when you dont have anything, having a bunch of agents doing any random shit works fine.
That sounds crazy to me, Claude Code has so many limitations.
Last week I asked Claude Code to set up a Next.js project with internationalization. It tried to install a third party library instead of using the internationalization method recommended for the latest version of Next.js (using Next's middleware) and could not produce of functional version of the boilerplate site.
There are some specific cases where agentic AI does help me but I can't picture an agent running unchecked effectively in its current state.
Indeed. Attaching the link (of the correct page) of the documentation worked in this case but I would've been faster than the AI. LLM.txt has been hit or miss. Maybe I need to adapt my workflow and have a granular plan of what needs to be done.
However the complexity is in knowing what to do and when. Actually typing the code/running commands doesn't take that much time and energy. I feel like any time gained by overusing an LLM will be offset by having to debug its code when it messes things up.
I'm commenting while agents run in project trying to achieve something similar to this.
I feel like "we all" are trying to do something similar, in different ways, and in a fast moving space (i use claude code and didn't even know subagents were a thing).
My gut feeling from past experiences is that we have git, but now git-flow, yet: a standardized approach that is simple to learn and implement across teams.
Once (if?) someone will just "get it right", and has a reliable way to break this down do the point that engineer(s) can efficiently review specs and code against expectations, it'll be the moment where being a coder will have a different meaning, at large.
So far, all projects i've seen end up building "frameworks" to match each person internal workflow. That's great and can be very effective for the single person (it is for me), but unless that can be shared across teams, throughput will still be limited (when compared that of a team of engs, with the same tools).
Also, refactoring a project to fully leverage AI workflows might be inefficient, if compared to a rebuild from scratch to implement that from zero, since building docs for context in pair with development cannot be backported: it's likely already lost in time, and accrued as technical debt.
How do you not get lost mentally in what is exactly happening at each point in time? Just trusting the system and reviewing the final output? I feel like my cognitive constraints become the limits of this parallelized system. With a single workstream I pollute context, but feel way more secure somehow.
i suppose, gradually and the suddenly?
each "fix" to incorrect reasoning/solution doesn't just solve the current instance, it also ends up in a rule-based system that will be used in future
initially, being in the loop is necessary, once you find yourself "just approving" you can be relaxed and think back
or, more likely, initially you need fine-grained tasks; as reliability grows, tasks can become more complex
"parallelizing" allows single (sub)agents with ad-hoc responsibilities to rely on separate "institutionalized" context/rules, .ie: architecture-agent and coder-agent can talk to each others and solve a decision-conflict based on wether one is making the decision based on concrete rules you have added, or hallucinating decisions
i have seen a friend build a rule based system and have been impressed at how well LLM work within that context
I built this tool https://github.com/btree1970/variant-ui where you can use a sub-agent to spin up multiple branches with different code changes into the UI and compare them side by side in the browser.
These prompts remind me of the YouTubers giving people self-actualization advice. “Act like the person you want to be!” Telling the LLM that it is an experienced product manager doesn’t make it an experienced product manager, it just makes it sound like one. This is like launching an entire team of “fake it til you make it” employees.
as much as ai has been a boon to my own development i writhe at the thought of middle managers oversold on the promise of ai and its output, making unrealistic requests and demanding 'MORE PRODUCTIVITY' at the greater cost of making more work in the future. Diluting code-as-craft, and commodifying it down to shovels of coal into the furnace.
Slightly off topic but I would really like agentic workflow that is embedded in my IDE as well as my code host provider like GitHub for pull requests.
Ideally I would like to spin off multiple agents to solve multiple bugs or features. The agents have to use the ci in GitHub to get feedback on tests. And I would like to view it on IDE because I like the ability to understand code by jumping through definitions.
Support for multiple branches at once - I should be able to spin off multiple agents that work on multiple branches simultaneously.
This already exists. Look at cursor with Linear, you can just reply with @cursor & some instructions and it starts working in a vm. You can watch it work on cursor.com/agents or using the cursor editor. Result is a PR. Also github has copilot getting integrated in the github ui, but not that great in my experience
Would that be solved by having several clones of your repo, each with a IDE and a Claude working on each problem? Much like how multiple people work in parallel.
Why not just use only async agents? You can fire off many tasks and check PRs locally when they complete the work. (I also work on devfleet.ai to improve this experience, any feedback is appreciated)
> One can hardly control one coding agent for correctness
Why not? I'm assuming we're not talking about "vibe coding" as it's not a serious workflow, it was suggested as a joke basically, and we're talking about working together with LLMs. Why would correctness be any harder to achieve than programming without them?
Using a coding agent can make your entire work day turn into doing nothing but code reviews. I.e. the least fun part: constant review of a junior dev that's on the brink of failing their probation period with random strokes of genius.
Is it a good idea to generate more code faster to solve problems? Can I solve problems without generating code?
If code is a liability and the best part is no part, what about leveraging Markdown files only?
The last programs I created were just CLI agents with Markdown files and MCP servers(some code here but very little).
The feedback loop is much faster, allowing me to understand what I want after experiencing it, and self-correction is super fast. Plus, you don't get lost in the implementation noise.
Code you didn't write is an even bigger liability, because if the AI gets off track and you can't guide it back, you may have to spend the time to learn it's code and fix the bugs.
It's no different to inheriting a legacy application though. As well, from the perspective of a product owner, it's not a new risk.
Claude is a junior. The more you work with it, the more you get a feel for which tasks it will ace unsupervised (some subset of grunt work) and which tasks to not even bother using it for.
I don't trust Claude to write reams of code that I can't maintain except when that code is embarrassingly testable, i.e it has an external source of truth.
What tasks are these? I don’t doubt they’re out there, but if I know the exact code that needs to be generated typing speed is not a bottle neck.
For me the slow part is determining what to write. And while AI helps with that (search, brainstorm, etc) by the time I know what to write trying to get the AI to enter those lines is often just a slow down. Much like writing up a ticket for a junior, I could write the code faster than I could write the English language rules describing how to write that code.
Was going to ask how much all this cost, but this sort of answers it:
> "Managing Cost and Usage Limits: Chaining agents, especially in a loop, will increase your token usage significantly. This means you’ll hit the usage caps on plans like Claude Pro/Max much faster. You need to be cognizant of this and decide if the trade-off—dramatically increased output and velocity at the cost of higher usage—is worth it."
TBH I think the time it takes the agent to code is best spent thinking about the problem. This is where I see the real value of LLMs. They can free you up to think more about architecture and high level concepts.
Fast decision-making is terrible for software development. You can't make good decisions unless you have a complete understanding of all reasonable alternatives. There's no way that someone who is juggling 4 LLMs at the same time has the capacity to consider all reasonable alternatives when they make technical decisions.
IMO, considering all reasonable alternatives (and especially identifying the optimal approach) is a creative process, not a calculation. Creative processes cannot be rushed. People who rush into technical decisions tend to go for naive solutions; they don't give themselves the space to have real lightbulb moments.
Deep focus is good but great ideas arise out of synthesis. When I feel like I finally understand a problem deeply, I like to sleep on it.
One of my greatest pleasures is going to bed with a problem running through my head and then waking up with a simple, creative solution which saves you a ton of work.
I hate work. Work sucks. I try to minimize the amount of time I spend working; the best way to achieve that is by staring into space.
I've solved complex problems in a few days with a couple of thousand lines of code which took some other developers, more intelligent than myself, months and 20K+ lines of code to solve.
I was bored yesterday and I tried to vibe code a simple react app yesterday using claude code and it was basically useless. It created a good shell of a code initially, but after 10 minutes I basically had to take over (It would be a feature, then regress the previous.)
Am I the only one convinced that all of the hype around coding agents like codex and claude is 85% BS ?
I am sceptical if these persona based agents really make that much of a difference, and more "appear" to make a difference because of their talk style.
Underneath is just a system prompt, or more likely a prompt layered on top "You are a frontend engineer, competent in react and Next.js, tailwind-css" - the stack details and project layout, key information is already in the CLAUDE.md. For more stuff the model is going to call file-read tools etc.
I think its more theatre then utilty.
What I have taken to doing is having a parent folder and then frontend/ backend/ infra/ etc as children.
parent/CLAUDE.md frontend/CLAUDE.md backend/CLAUDE.md
The parent/CLAUDE.md provides a highlevel view of the stack "FastAPI backend with postgres, Next.js frontend using with tailwind, etc". The parent/CLAUDE.md also points to the childrens CLAUDE.md's which have more granular information.
I then just spawn a claude in the parent folder, set up plan mode, go back and forth on a design and then have it dump out to markdown to RFC/ and after that go to work. I find it does really well then as all changes it makes are made with a context of the other service.
I'm also skeptical partially because I don't like the huge essays generated by any llm. CLAUDE.md/AGENTS.md/README.md that are 5+ pages long are all equally bad imo. I prefer following the idea that if something is too verbose for me to want to get anything useful out of it, then the llm should behave similarly. Even if it's not true, why waste 2 paragraphs explaining something that could be explained in one short sentence?
My CLAUDE.md or AGENTS.md is usually just a bulleted list of reminders with high level information. If the agent needs more steering, I add more reminders. I try not to give it _too_ broad of a task without prior planning or it'll just go off the rails.
Something I haven't really experimented with is having claude generate ADRs [1] like your RFC/ idea. I'll probably try that and see how it goes.
[1]: https://adr.github.io/
I too am skeptical about the personas, but I still use them to organize context and instructions for different types of work. I use a top-level .agents dir, with commands, roles, and rules, sub-dirs.
CLAUDE.md is kept somewhat lean, with pointers to individual files in ./docs/ and .claude/commands is a symlink to .agents/commands.
After starting Claude, I use /commands to load a role and context, which pulls in only the necessary docs and avoids, say, loading UI design or test architecture docs, when adding a backend feature.
I don't want to have to do any of this, but it helps me try and keep the agents on the rails and minimize context rot.
You don't need subagent, I shared this on ClaudeCode sub as well https://www.reddit.com/r/ClaudeCode/s/barbpBxG78
Subagents do not work well for coding at all
> Subagents do not work well for coding at all
Subagents can work very well, especially for larger projects. Based on this statement, I think you're experiencing how I felt in my early experience with them, and that your mental model for how to use them effectively is still embryonic.
I've found that the primary benefit for subagents is context/focus management. For example, I'm doing auth using Stytch. What I absolutely don't want to do is load https://stytch.com/docs/llms.txt and instructions for leveraging it in my CLAUDE.md. But it's perfect for my auth agent, and the quality of the output for auth-related tasks is far higher as a result.
A recommended read: https://jxnl.co/writing/2025/08/29/context-engineering-slash...
I'm unsure if this also qualifies as incompetence/embryonic understanding, though I've used LLMs for hundreds of hours on development tasks and have also found that sub-agents are not good at programming. They're more suitable for research tasks to provide informed context to the parent agent while isolating it from the token consumption which retrieving that context cost.
Zooming out, my findings on LLMs with programming is that they work well in specific patterns and quickly go to shit when completely unsupervised by a SME.
The LLMs all fuck up on something in every task that they perform due to the intersection of operating on assumptions and working on large problem spaces. The amount of effort it takes to completely eliminate the presence of assumptions in the agent make the process slower than writing the code yourself. So people try to find the balance they're comfortable with.> I've found that the primary benefit for subagents is context/focus management. For example, I'm doing auth using Stytch. What I absolutely don't want to do is load https://stytch.com/docs/llms.txt and instructions for leveraging it in my CLAUDE.md.
> But it's perfect for my auth agent, and the quality of the output for auth-related tasks is far higher as a result.
What about just using a sub agent specifically to fetch llms.txt and find the answer to the question for the parent agent? Instead of handing a full task off to it
> your mental model for how to use them effectively is still embryonic.
Well
Subagents suffer from the same overriding problem with "Claude Contexting", which is context wrangling. Subagents "should" help to compartmentalize and manage your context better, but not in my experience so far. I found I was jumping through a lot of hoops with special instructions, manual compacts, up front super detailed plans, and MCPs just to manage my context. So subagents is probably the same, where you want to have it handle tasks that do not require context from your main thread.
P.S. I know they added 1m context to their API, with a price increase, but AFAIK the subscription still uses the 200k context.
Subagents are literally built into Claude Code via a built-in tool where it can recursively call itself
Yes I know, but subagent suffer from context amnesia during context handouts which is why this subagent use is flawed for purpose of coding product features. I've been using these tools a lot and installed every ai agent out there i could find.
Yup, this is the killer. Subagents SEEM good when you use them on greenfield projects, you can grind out a whole first pass without burning through much of your main context, it seems magical. But when you have a complex project that handoff is the kiss of death.
I'm wondering if in large projects, you want subagents to avoid having tasks flush out the main context?
If you're working with large source files, you might want to do each piece of work in an independent context with the information discarded afterwards?
Is the context a sliding window, or are there tiers of importance?
No the context going out of control is overblow. Lemme example why. First you need to work at feature level. It shouldn't be too large of a feature in one go.
Let's say in my workflow, first agent must know where it needs to make changes? So it greps bunch of files and reads them. We do not need these read calls or grep calls to be part of history, the knowledge gained by doing these is what needs to be part of context
Finally, we do some risk analysis and then just code it right away.
No sliding window needed for this
After this you reset context /reset and u start on new feature.
> No the context going out of control is overblow. Lemme example why. First you need to work at feature level. It shouldn't be too large of a feature in one go.
As a meta point, why write ' Lemme example why.' ?
If someone is still with you at this sentence, that person was ready to understand why.
Otherwise, it delays (and thus endangers the visibility of) whatever your explanation was going to be.
So maybe the solution is to make all subproblems greenfield products?
By this I mean treat features as isolated plugins. I get that there are cross-cutting features that touch multiple pieces of functionality, and those probably need special treatment, but a large class of functionality can be developed in an isolated way with a common set of design tokens and APIs to tie them all together.
This might play better to coding agent strengths.
Full disclosure: this is very much an armchair view. I have all of 2 weeks of experience coding via agents (vs manually), but this thread is nerd sniping me into trying it myself.
I do try to do this, from an architectural standpoint it starts with modular monoliths to avoid coupling, then I try to decompose problems in a way that is very sandboxed so the blast radius of an agent going of the rails is contained.
So the things people hate Java for will make a big comeback then? Hexagonal architecture with domain driven design,a big fetish for inversion of control, so the LLM never needs to figure out how the system works, it just magically does. And errors have just the right amount of stack trace, this being 500++ lines
A lot of old school "java-ish" paradigms are going to come back with AI for the same reason people used them with Java back in the day - they put golden handcuffs on implementors, which is a bad tradeoff for competent, agile humans but a very good tradeoff for sometimes off the rails agents. This includes waterfall, spec driven development, front loaded planning, extensive automated testing suites, formal verification, etc.
Agreed, the roles seem more cerenonial than anything else.
As someone who's built a project in this space, this is incredibly unreliable. Subagents don't get a full system prompt (including stuff like CLAUDE.md directions) so they are flying very blind in your projects, and as such will tend to get derailed by their lack of knowledge of a project and veer into mock solutions and "let me just make a simpler solution that demonstrates X."
I advise people to only use subagents for stuff that is very compartmentalized because they're hard to monitor and prone to failure with complex codebases where agents live and die by project knowledge curated in files like CLAUDE.md. If your main Claude instance doesn't give a good handoff to a subagent, or a subagent doesn't give a good handback to the main Claude, shit will go sideways fast.
Also, don't lean on agents for refactoring. Their ability to refactor a codebase goes in the toilet pretty quickly.
> Their ability to refactor a codebase goes in the toilet pretty quickly.
Very much this. I tried to get Claude to move some code from one file to another. Some of the code went missing. Some of it was modified along the way.
Humans have strategies for refactoring, e.g. "I'm going to start from the top of the file and Cut code that needs to be moved and Paste it in the new location". LLM don't have a clipboard (yet!) so they can't do this.
Claude can only reliably do this refactoring if it can keep the start and end files in context. This was a large file, so it got lost. Even then it needs direct supervision.
> Humans have strategies for refactoring, e.g. "I'm going to start from the top of the file and Cut code that needs to be moved and Paste it in the new location". LLM don't have a clipboard (yet!) so they can't do this.
For my own agent I have a `move_file` and `copy_file` tool with two args each, that at least GPT-OSS seems to be able to use whenever it suits, like for moving stuff around. I've seen it use it as part of refactoring as well, moving a file to one location, copying that to another, the trim both of them but different trims, seems to have worked OK.
If the agent has access to `exec_shell` or similar, I'm sure you could add `Use mv and cp if you need to move or copy files` to the system prompt to get it to use that instead, probably would work in Claude Code as well.
Claude’s utility really drops when any task requires a working set larger than the context window.
On the one hand, it’s kind or irritating when it goes great-great-great-fail.
On the other hand, it really enforces the best practices of small classes, small files, separation of concerns. If each unit is small enough it does great.
Unfortunately, it’s also fairly verbose and not great at recognizing that it is writing the same code over and over again, so I often find some basic file has exploded to 3000 lines, and a simple “identity repeated logic and move to functions” prompt shrinks it to 500 lines.
Remember 20 years ago when Eclipse could move a function by manipulating the AST and following references to adjust imports and callers, and it it didn't lose any code?
I have a suite of agent tools that is just waiting on my search service for a release, it includes `srefactor` and `spatch` commands that have fuzzy semantic alignment with strong error guards, they use LSP and tree sitter to enable refactoring/patching without line numbers or anything and ensure the patch is correct.
Nice. This sounds like the right approach. As an aside, it’s crazy that a mature LSP server is not a first class requirement for language choice in 2025. I used to write mini LSP servers before working on a project starting when LSP came out a few years ago. Now that there is wider adoption, I don’t find myself reaching for this quite as often, but it’s still a really nice way to ease development on mature codebases that have grown their own design patterns.
It’s still early days for these agents. There isn’t any reason the agents won’t build or understand AST in the future to more quickly refactor.
Why do the agents need to build or understand it? Just give them tools to work with it like we would.
Everyone talking about MCP and they haven’t figured this out. Actually, JetBrains has an IDE MCP server plugin, although I haven’t tried it.
I think it's likely that these agent-based development will inevitably add more imperative tools to their arsenal to lower cost, improve speed and accuracy.
Codex’s model is much better at actually reading large volumes of code which improves its results compared with CC
I don't use subagents to do things, they're best for analysing things.
Like "evaluate the test coverage" or "check if the project follows the style guide".
This way the "main" context only gets the report and doesn't waste space on massive test outputs or reading multiple files.
This is only a problem if an agent is made in a lazy way (all of them).
Chat completion sends the full prompt history on every call.
I am working on my own coding agent and seeing massive improvements by rewriting history using either a smaller model or a freestanding call to the main one.
It really mitigates context poisoning.
There's a large body of research on context pruning/rewriting (I know because I'm knee deep in benchmarks in release prep for my context compiler), definitely don't ad hoc this.
Care to give some pointers on what to look at? Looks like I will be doing something similar soon so that would be much appreciated
Just ask chat gpt about state of the art in context pruning and other methods to optimize the context being provided to a LLM, it's a good research helper. The right mental model is that it's basically like RAG in reverse, instead of trying to select and rank from a data set, you're trying to select and rank from context given a budget.
Everyone complains that when you compact the context, Claude tends to get stupid
Which as far as I understand it is summarizing the context with a smaller model.
Am I misunderstanding you, as the practical experience of most people seem to contradict your results.
One key insight I have from having worked on this from the early stages of LLMs (before chatgpt came out) is that the current crop of LLM clients or "agentic clients" don't log/write/keep track of success over time. It's more of a "shoot and forget" environment right now, and that's why a lot of people are getting vastly different results. Hell, even week to week on the same tasks you get different results (see the recent claude getting dumber drama).
Once we start to see that kind of self feedback going in next iterations (w/ possible training runs between sessions, "dreaming" stage from og RL, distilling a session, grabbing key insights, storing them, surfacing them at next inference, etc) then we'll see true progress in this space.
The problem is that a lot of people work on these things in silos. The industry is much more geared towards quick returns now, having to show something now, rather than building strong fo0undations based on real data. Kind of an analogy to early linux dev. We need our own Linus, it would seem :)
I’ve experimented with feature chats, so start a new chat for every change, just like a feature branch. At the end of a chat I’ll have it summarize the the feature chat and save it as a markdown document in the project, so the knowledge is still available for next chats. Seems to work well.
You can also ask the llm at the end of a feature chat to prepare a prompt to start the next feature chat so it can determine what knowledge is important to communicate to the next feature chat.
Summarizing a chat also helps getting rid of wrong info, as you’ll often trial and error towards the right solution. You don’t want these incorrect approaches to leak into the context of the next feature chat, maybe just add the “don’t dos” into a guidelines and rules document so it will avoid it in the future.
I ask the bot to come up with a list of "don't dos"/lessons learned based on what went right or required lots of edits. Then I have it merge them in to an ongoing list. It works OK.
i too have discovered that feature chats are surely a winner (as well as a pre-requirement for parallelization)
in a similar vein, i match github project issues to md files committed to repo
essentially, the github issue content is just a link to the md file in the repo also, epics are folders with links (+ a readme that gets updated after each task)
i am very happy about it too
it's also very fast and handy to reference either from claude using @ .ie: did you consider what has been done @
other major improvements that worked for me were - DOC_INDEX.md build around the concept of "read this if you are working on X (infra, db, frontend, domain, ....)" - COMMON_TASKS.md (if you need to do X read Y, if you need to add a new frontend component read HOW_TO_ADD_A_COMPONENT.md )
common tasks tend to be increase quality when they are epxpressed in a checklist format
The difference between agents and LLMs is that agents are easy to tune online, because unlike LLMs they're 95% systems software. The prompts, the tools, the retrieval system, the information curation/annotation, context injection, etc. I have a project that's still in early stages that can monitor queries in clickhouse for agent failures, group/aggregate into post mortem classes, then do system paramter optimization on retrieval /document annotation system and invoke DSPy on low efficacy prompts.
> don't log/write/keep track of success over time.
How do you define success of a model's run?
Lots of ways. You could do binary thumbs up/down. You could do a feedback session. You could look at signals like "acceptance rate" (for a pr?) or "how many feedback messages did the user send in this session", and so on.
My point was more on tracking these signals over time. And using them to improve the client, not just the model (most model providers probably track this already).
Ah. Yes!
My somewhat terse/bitter question was because yesterday Claude would continue claim to have created a "production-ready" solution which was completely entirely wrong.
I would've loved to have the feedback loop you describe
I do something similar and I have the best results of not having a history at all, but setting the context new with every invokation.
My experience so far, after trying to keep CC on track with different strategies is that it will more or less end up on the same ditch sooner or later. Even though i had defined agents, workflows, etc. now i just let it interact with github issues and the quality is pretty much the same
It was my understanding that the subagents have the same system prompt. How do you know that they don’t follow CLAUDE.md directions?
I’ve been using subagents since they were introduced and it has been a great way to manage context size / pollution.
A few youtubers have done deep dives on this, monitoring claude traffic through a proxy. Subagents don't get the system prompt or anything else, they get their subagent prompt and whatever handoff the main agent gives them.
I was on the subagent hype train myself for a while but as my codebases have scaled (I have a couple of codebases up to almost 400k now) subagents have become a lot more error prone and now I cringe when I see them for anything challenging and immediately escape out. They seem to work great with more greenfield projects though.
I have a bunch of homegrown CLI tools in my $PATH that are only described in the CLAUDE.md file. My subagents use these tools perfectly as if they have full instructions on their use but no such instructions are in the subagent prompts.
This should not be possible if they don't have CLAUDE.md in their context.
My main agent prompt always has a complete ban on the main agent doing any work themselves. All work is done by subagents which they coordinate.
I've been doing this for 2-3 months now on projects upwards of 200k lines and the results have been incredible.
I'm very confused how so many of us can have such completely different experiences with these tools.
Totally agreed, tried agents for a lot of stuff (I started creating a team of agents, architect, frontend coder, backend coder and QA). Spent around 50 USD on a failed project, context contaminated and the project eventually had to be re-written.
Then I moved some parts in rules, some parts in slash commands and then I got much better results.
The subagents are like a freelance contractors (I know, I have been one very recently) Good when they need little handoff (Not possible in realtime), little overseeing and their results are a good advice not an action. They don't know what you are doing, they don't care what you do with the info they produce. They just do the work for you while you do something else, or wait for them to produce independent results. They come and go with little knowledge of existing functionalities, but good on their own.
Here are 3 agents I still keep and one I am working on.
1: Scaffolding: Now I create (and sometimes destroy) a lot of new projects. I use a scaffolding agents when I am trying something new. They start with fresh one line instruction to what to scaffold (e.g. a New docker container with Hono and Postgres connection, or a new cloudflare worker which will connect to R2, D1 and AI Gateway, or a AWS Serverless API Gateway with SQS that does this that and that), where to deploy. At the end of the day they setup the project with structure, create a Github Repo and commit it for me. I will take it forward from them
2: Triage: When I face some issues which is not obvious from reading code alone, I give them the place, some logs and the agent will use whatever available (including the DB Data) to make a best guess of why this issue happens. I often found out they work best when they are not biased by recent work
3: Pre-Release Check QA: Now this QA will test the entire system (Essentially calling all integration and end-to-end test suite to make sure this product doesn't break anything existing. Now I am adding a functionality to let them see original business requirement and see if the code satisfies it or not. I want this agent to be my advisor to help me decide if something goes to release pipeline or not.
4: Web search (Experimental) Sometimes, some search are too costly for existing token, and we only need the end result, not what they search and those 10 pages it found out...
I often see people making these sub agents modelled on roles like product manager, back end developer, etc.
I spent a few hours trying stuff like this and the results were pretty bad compared to just using CC with no agent specific instructions.
Maybe I needed to push through and find a combination that works but I don't find this article convincing as the author basically says "it works" without showing examples or comparing doing the same project with and without subagents.
Anyone got anything more convincing to suggest it's worth me putting more time into building out flows like this instead of just using a generic agent for everything?
Right - don’t make subagents for the different roles, make them to manage context for token heavy tasks.
A backend developer subagent is going to do the job ok, but then the supervisor agent will be missing useful context about what’s been done and will go off the rails.
The ideal sub agent is one that can take a simple question, use up massive amounts of tokens answering it, and then return a simple answer, dropping all those intermediate tokens as unnecessary.
Documentation Search is a good one - does X library have a Y function - the subagent can search the web, read doc MCPs, and then return a simple answer without the supervisor needing to be polluted with all the context
This is my experience too.
Make agents for tasks, not roles.
I've seen this for coding agents using spec-driven development for example. You can try to divide agents into lots of different roles that roughly correspond to human job positions, like for example BMad does, or you can simply make each agent do a task and have a template for the task. Like make an implementation plan using a template for an implementation plan or make a task list, using a template for a task list. In general, I've gotten much better results with agents that has a specific task to do than trying to give a role, with a job-like description.
For code review, I don't use a code reviewer agent, instead I've defined a dozen code reviewing tasks, that each runs as separate agents (though I group some related tasks together).
This!
Subagents open all the new metaphorical tabs to get to some answer, then close those tabs so the main agent can proceed with the main task.
Excellent article on this pattern: https://jxnl.co/writing/2025/08/29/context-engineering-slash...
This is exactly right.
This has been my experience so far as well. It seems like just basic prompting gets me much further than all these complicated extras.
At some point you gotta stop and wonder if you’re doing way too much work managing claude rather than your business problem.
No, this has been my experience as well.
I see lots of people saying you should be doing it, but not actually doing it themselves.
Or at least, not showing full examples of exactly how to handle it when it starts to fail or scale, because obviously when you dont have anything, having a bunch of agents doing any random shit works fine.
Frustrating.
I think the trick is the synthesize step which brings the agents findings together. That's where I've had the most success, at least.
That sounds crazy to me, Claude Code has so many limitations.
Last week I asked Claude Code to set up a Next.js project with internationalization. It tried to install a third party library instead of using the internationalization method recommended for the latest version of Next.js (using Next's middleware) and could not produce of functional version of the boilerplate site.
There are some specific cases where agentic AI does help me but I can't picture an agent running unchecked effectively in its current state.
I pretty much always attach (insert library here) LLM.txt as context, or a direct link to the documentation page for (insert framework feature)
Not very agentic but it works a lot better.
Indeed. Attaching the link (of the correct page) of the documentation worked in this case but I would've been faster than the AI. LLM.txt has been hit or miss. Maybe I need to adapt my workflow and have a granular plan of what needs to be done.
However the complexity is in knowing what to do and when. Actually typing the code/running commands doesn't take that much time and energy. I feel like any time gained by overusing an LLM will be offset by having to debug its code when it messes things up.
I’m training myself to have the muscle memory for putting it into planning mode before I start telling it what to do.
I'm commenting while agents run in project trying to achieve something similar to this. I feel like "we all" are trying to do something similar, in different ways, and in a fast moving space (i use claude code and didn't even know subagents were a thing).
My gut feeling from past experiences is that we have git, but now git-flow, yet: a standardized approach that is simple to learn and implement across teams.
Once (if?) someone will just "get it right", and has a reliable way to break this down do the point that engineer(s) can efficiently review specs and code against expectations, it'll be the moment where being a coder will have a different meaning, at large.
So far, all projects i've seen end up building "frameworks" to match each person internal workflow. That's great and can be very effective for the single person (it is for me), but unless that can be shared across teams, throughput will still be limited (when compared that of a team of engs, with the same tools).
Also, refactoring a project to fully leverage AI workflows might be inefficient, if compared to a rebuild from scratch to implement that from zero, since building docs for context in pair with development cannot be backported: it's likely already lost in time, and accrued as technical debt.
How do you not get lost mentally in what is exactly happening at each point in time? Just trusting the system and reviewing the final output? I feel like my cognitive constraints become the limits of this parallelized system. With a single workstream I pollute context, but feel way more secure somehow.
i suppose, gradually and the suddenly? each "fix" to incorrect reasoning/solution doesn't just solve the current instance, it also ends up in a rule-based system that will be used in future
initially, being in the loop is necessary, once you find yourself "just approving" you can be relaxed and think back or, more likely, initially you need fine-grained tasks; as reliability grows, tasks can become more complex
"parallelizing" allows single (sub)agents with ad-hoc responsibilities to rely on separate "institutionalized" context/rules, .ie: architecture-agent and coder-agent can talk to each others and solve a decision-conflict based on wether one is making the decision based on concrete rules you have added, or hallucinating decisions
i have seen a friend build a rule based system and have been impressed at how well LLM work within that context
Until your rules get poisoned…
Just one more agent...
I built this tool https://github.com/btree1970/variant-ui where you can use a sub-agent to spin up multiple branches with different code changes into the UI and compare them side by side in the browser.
These prompts remind me of the YouTubers giving people self-actualization advice. “Act like the person you want to be!” Telling the LLM that it is an experienced product manager doesn’t make it an experienced product manager, it just makes it sound like one. This is like launching an entire team of “fake it til you make it” employees.
as much as ai has been a boon to my own development i writhe at the thought of middle managers oversold on the promise of ai and its output, making unrealistic requests and demanding 'MORE PRODUCTIVITY' at the greater cost of making more work in the future. Diluting code-as-craft, and commodifying it down to shovels of coal into the furnace.
Let's ask the obvious question: Is there any hard evidence that subagent flows give actual developers better experience than just using CC without?
What's the difference between using agents and playing the casino? Large part of the industry is a casino hidden in other clothes.
I see people who never coded in their life signing up for loveable or some other code agent and try their luck.
What cements this thought pattern in your post is this: "If the agents get it wrong, I don’t really care—I’ll just fire off another run"
Slightly off topic but I would really like agentic workflow that is embedded in my IDE as well as my code host provider like GitHub for pull requests.
Ideally I would like to spin off multiple agents to solve multiple bugs or features. The agents have to use the ci in GitHub to get feedback on tests. And I would like to view it on IDE because I like the ability to understand code by jumping through definitions.
Support for multiple branches at once - I should be able to spin off multiple agents that work on multiple branches simultaneously.
This already exists. Look at cursor with Linear, you can just reply with @cursor & some instructions and it starts working in a vm. You can watch it work on cursor.com/agents or using the cursor editor. Result is a PR. Also github has copilot getting integrated in the github ui, but not that great in my experience
Would that be solved by having several clones of your repo, each with a IDE and a Claude working on each problem? Much like how multiple people work in parallel.
Yeah but it’s not ideal. I thought of this too.
Why not just use only async agents? You can fire off many tasks and check PRs locally when they complete the work. (I also work on devfleet.ai to improve this experience, any feedback is appreciated)
One can hardly control one coding agent for correctness, let alone multiple ones... It's cool, but not very reliable or useful.
It's resume driven development
> One can hardly control one coding agent for correctness
Why not? I'm assuming we're not talking about "vibe coding" as it's not a serious workflow, it was suggested as a joke basically, and we're talking about working together with LLMs. Why would correctness be any harder to achieve than programming without them?
Because they output so much code. It's a wall.
Using a coding agent can make your entire work day turn into doing nothing but code reviews. I.e. the least fun part: constant review of a junior dev that's on the brink of failing their probation period with random strokes of genius.
Is it a good idea to generate more code faster to solve problems? Can I solve problems without generating code?
If code is a liability and the best part is no part, what about leveraging Markdown files only?
The last programs I created were just CLI agents with Markdown files and MCP servers(some code here but very little).
The feedback loop is much faster, allowing me to understand what I want after experiencing it, and self-correction is super fast. Plus, you don't get lost in the implementation noise.
Code you didn't write is an even bigger liability, because if the AI gets off track and you can't guide it back, you may have to spend the time to learn it's code and fix the bugs.
It's no different to inheriting a legacy application though. As well, from the perspective of a product owner, it's not a new risk.
Claude is a junior. The more you work with it, the more you get a feel for which tasks it will ace unsupervised (some subset of grunt work) and which tasks to not even bother using it for.
I don't trust Claude to write reams of code that I can't maintain except when that code is embarrassingly testable, i.e it has an external source of truth.
> some subset of grunt work
What tasks are these? I don’t doubt they’re out there, but if I know the exact code that needs to be generated typing speed is not a bottle neck.
For me the slow part is determining what to write. And while AI helps with that (search, brainstorm, etc) by the time I know what to write trying to get the AI to enter those lines is often just a slow down. Much like writing up a ticket for a junior, I could write the code faster than I could write the English language rules describing how to write that code.
There is no generated code. It is just a user interacting with a CLI terminal(via librechat frontend), guided by Markdown files, with access to MCPs
Fascinating. Do you have a longer writeup about it or an example repo for me to understand exactly how it fits together?
Using LLMs to code poses a liability most people can't appreciate, and won't admit:
https://www.youtube.com/watch?v=wL22URoMZjo
Have a great day =3
Was going to ask how much all this cost, but this sort of answers it:
> "Managing Cost and Usage Limits: Chaining agents, especially in a loop, will increase your token usage significantly. This means you’ll hit the usage caps on plans like Claude Pro/Max much faster. You need to be cognizant of this and decide if the trade-off—dramatically increased output and velocity at the cost of higher usage—is worth it."
Anyone tried Conductor? I use Claude Code and like the workflow, not sure if adding Conductor makes sense or not.
Follow up from my last post; lots were asking for more examples. I will be around if anybody has questions this morning.
Can it work without Linear, using md files?
TBH I think the time it takes the agent to code is best spent thinking about the problem. This is where I see the real value of LLMs. They can free you up to think more about architecture and high level concepts.
Fast decision-making is terrible for software development. You can't make good decisions unless you have a complete understanding of all reasonable alternatives. There's no way that someone who is juggling 4 LLMs at the same time has the capacity to consider all reasonable alternatives when they make technical decisions.
IMO, considering all reasonable alternatives (and especially identifying the optimal approach) is a creative process, not a calculation. Creative processes cannot be rushed. People who rush into technical decisions tend to go for naive solutions; they don't give themselves the space to have real lightbulb moments.
Deep focus is good but great ideas arise out of synthesis. When I feel like I finally understand a problem deeply, I like to sleep on it.
One of my greatest pleasures is going to bed with a problem running through my head and then waking up with a simple, creative solution which saves you a ton of work.
I hate work. Work sucks. I try to minimize the amount of time I spend working; the best way to achieve that is by staring into space.
I've solved complex problems in a few days with a couple of thousand lines of code which took some other developers, more intelligent than myself, months and 20K+ lines of code to solve.
I’ve got this down to a science.
All of this stuff seems completely insane to me and something my coding agent should handle for me. And it probably will in a year.
I feel the same. We’re still in the very early days of AI agents. Honestly, just orchestrating CC subagents alone could already be a killer product.
I was bored yesterday and I tried to vibe code a simple react app yesterday using claude code and it was basically useless. It created a good shell of a code initially, but after 10 minutes I basically had to take over (It would be a feature, then regress the previous.)
Am I the only one convinced that all of the hype around coding agents like codex and claude is 85% BS ?
0 Days since AI post on HN