# AI Coding Agents: Quality, Complexity, and Engineering Strategy

**Podcast:** The Pragmatic Engineer Podcast
**Published:** 2026-04-29

## Transcript

What if I told you that one of the most influential AI coding agents of 2026 was built by a single developer in Austria who got frustrated with existing AI coding agents?
This is Pi, a minimalist, self-modifiable coding agent, which has quietly become the engine behind the wildly popular personal AI assistant, OpenClaw.
Mario Zechner is the creator of Pi, and joining him today is Armin Ronacher, the creator of Flask, and now an early adopter and contributor to Pi.
In today's episode, we cover the backstory of Pi and why self-modifying software is much easier to do with AI agents.
What Armin learned interviewing 30 plus engineering teams about how AI agents are changing how they work and why software quality feels like it's trending down.
The case against MCP and why CLIs are becoming so popular and many more.
If you want to hear from two very grounded voices in the industry, honestly talk about what's working and what isn't and why we need to slow down as an industry, this episode is for you.
This episode was presented by Statsig, the unified platform for flags, analytics, experiments, and more.
This episode is brought to you by WorkOS.
Engineers love to build.
Today's episode will be a great example of this.
We'll get into why and how Pi was built from the ground up.
But when you're shipping a product, some problems are better to solve with trusted infrastructure built for scale.
Enterprise features like SAML, directory sync, and audit logs are some of those.
WorkOS gives you APIs to add them in days, not in months.
without reinventing the wheel.
And now let's get into the episode.
Mario and Armin, it's so good to have you here on the podcast.
Thanks for having us.
Thank you.
So as a kickoff, Mario, how did you get into tech and eventually into building AI stuff?
Oh, well, that's a long story.
How much time do we have?
So I'm a kid of the 90s, actually, and got my first PC.
96.
And the trigger for that was that I loved computer games.
We were kind of working poor, so we couldn't afford any of the Gameboy and NES, Super NES stuff.
But I had an uncle with an Amiga 500, and I would go to his place every second day and just play games there.
And eventually my parents told me, if you work, you can save up and buy yourself a computer.
And in reality, my dad would do, what's he called?
Schwarzer Welt.
Well, you're not necessarily paying your taxes on it.
Yeah, so he would do his normal job.
And after his normal job, he would go fix cars and work at construction sites.
Yeah, it's very common in Europe.
Like, I know everyone does that.
And after two or three years or so, they just said, it's time, and took me to a computer shop in the nearby big city and bought me a 486.
And that's how it started, basically.
Pentium 486?
Yeah, an Intel 486 DX40 megahertz with turbo button.
And that's where I started.
And I've always been into games a lot, which also led to graphics programming.
And through sheer luck, I got a job while I was studying at university at the Applied Science Organization, who was doing NLP stuff, machine learning, applied machine learning, basically taking research results and trying to stuff them into industry applications.
And that's where I learned the ropes of machine learning.
That was all before deep learning.
became a thing.
And I actually quit that kind of domain in 2010, 11-ish, because I joined a startup in San Francisco.
And then later came back and joined another startup with two friends in Sweden, where we did an ahead-of-time compiler for a job by Code to iOS that got sold.
And since then, I have a little bit more time.
And I've always kept up with machine learning stuff, because obviously it's super interesting.
And yeah, and then GPT happened.
That's the story.
Yeah, and here we are.
And Armin, where were your roots?
My roots are definitely not working poor, but because my parents ran an architectural office where they kind of adopted computers for CAD drawing, my first computer was like old computers that they recycled.
So my first computer, even though I'm younger, was in 386.
I'm so sorry for you.
And so basically none of the computers that I ever had were capable of playing computer games properly.
Because one, they used Windows NT, which at the time didn't do anything.
So you had to sort of like build your way through it.
And like the only way that you could actually get them to run was because before it didn't know yet how to get the Windows 95 or like Windows 3.11.
That was like before it booted into either one of those, you could boot it into DOS.
Like really old DOS games at a time when you could already get better stuff.
But because it was sort of this kind of thing, I started toying around with Quick Basic a lot.
With Tuo Pascal, I bought a bunch of books on that.
And that was my roots of learning how these things work.
And I wasn't ever really good at this, but I found it really interesting.
This idea of like...
No, for sure.
We call it a tiefstapler in German.
No, I swear to you, when I started dabbling with this, I just really sucked.
But over time, if you keep doing this, you get better.
And then in 2002 or 2003, I used to use Delphi a lot, which was like a visual version of Turbo Pascal.
And in 2002 or 2003, someone also showed me, because I've got this idea, like I want to use Linux, and then Delphi didn't work on Linux, and then I found Python.
and through that i started doing some pipe programming and there was a ubuntu just came out in 2004 and that was a venture-backed vehicle but they created all this like local community so there was like ubuntu association so i together with a bunch of friends we started the german ubuntu foundation not a foundation association and we ran this online community called ubuntu users for four or five years and we and because ubuntu was popular the community grew and then the scaling problems came so like That's how I got into web development.
And then for building this, I just wanted to build a templating engine, a web library, all of this.
And then eventually I bundled that together and made this Flask framework, which was very popular.
And even nowadays still is a thing that clankers like to spit out.
That's hilarious.
But I left it and then in...
13, 14.
So I worked on computer games for a couple of years in London, but then afterwards I went back to open source and I worked on Sentry for 10 years and then left in April last year to try something.
So both of you are originally from Austria.
In fact, you right now live in Austria as well, right?
You were doing games, you were working at Sentry, you also did games before.
And then the third person who's not in the room, but was on this podcast just before us, Peter Steinberger, also from Austria.
Where did the two of you meet?
Where did the three of you meet?
Because I've recently seen a bunch of photos, especially before OpenClaw and Pi started, you hanging out, the three of you experimenting, playing with AI.
I think the two of us met on the internet, right?
On Reddit?
It depends, because I definitely met you once when I was at the university.
All right.
But you didn't recognize me at that time and I was useless.
I was already famous.
But yeah, we sort of abstractly met on the internet.
Eventually we met up in Vienna.
We were screaming a lot at each other, but on the internet.
But in a very cute kind of way, in a very non-confrontational kind of way.
And even though we might not think alike in all areas of our lives, it was a cultured exchange, I would say.
So that was nice.
And Peter, I...
Like six degrees of Peter Steinberger, basically.
I was working at an office in my town and the company that gave me free office space in exchange for being like a mentor to the CEO had some kind of business dealings with Peter's company, PSPDF.
PSPDF, yeah.
And eventually came to the office in Graz and I think that's where we met the first time.
And then also the same year we met at the conference in Istanbul.
just hung out for an entire night.
And that's basically where it all started.
Nice.
And then how did the both of you go from being skeptical about AI when these tools came out?
And again, both of you have, at that point, and by 2022, you've been doing a decade plus of building complex software in different domains.
What was your first reaction to it?
And then eventually, how did you kind of come across to the side of like, well, this thing is actually really interesting?
So for me, it was, I think, in 2022.
i think copilot um github copilot came out before gpt yes in 2021 yeah and through my previous startup stuff i was working with ned friedman and miguel de casa from xamarin because they acquired with xamarin yeah they acquired the company i talked about earlier that java compile thing i knew ned friedman from our early startup stuff and eventually moved to github and then was in my dms in 2022 i think and asked if i wanted to have access to GitHub Copilot, the tap, tap, tap, autocomplete thingy.
And I was like, I don't really care.
I don't think this is going anywhere.
And he's like, no, man, it's the future.
I've got to try it.
It's the future.
So I tried it, and it was absolutely horrible.
But yeah, when GPT came out, and especially when they started providing API access, I did a lot of projects just figuring out what works and what doesn't work, not necessarily in the coding space.
But eventually, once they had tool calling, that's when they became very interesting.
or function calling as OpenAI called it back then.
But it took until 2000 and I would say 24, end of 24, October or so, for that to actually be useful.
And that's where the coding agents also became kind of interesting.
And then 2025, the Cloud Code team came out with Cloud Code and that introduced authentic search.
So basically just give the agent a way to plow through your file system and read all your files and then make the whole difference actually.
Like all the things that came before, like cursor with indexing and any AST-based stuff and all of that, that just went away.
And I know that the CEO of Chroma is probably mad at me for saying this, but that was the difference.
It wasn't like a dense and sparse search thing that the agent could go through.
It was just give it access to your files.
That was it for me.
That's where it clicked for me.
I think my path was kind of similar.
because I think Copilot came out quite a bit earlier, but I know that there was a program at GitHub that gave you early access to Copilot at the time.
I think it was like this maintainers group or something where I still was in.
I got the feeling for Copilot that this will actually be really interesting, but not in any way in which it is now because I felt like, oh, I am in an open source for such a long time and now they're doing like training in open source data.
It's like there is something...
At the very least, this will be controversial.
I didn't think about it being productive.
I felt like, oh, this is going to be a controversial thing with training open source data.
And I remember I was trying to probe it really...
Whether there's Flask in there?
No, I was trying to probe it really adversarial.
So one of the things that I probed on, I probed on, will it retell GPL code?
And I remember at one point I got it to spit out the carmax inverse square root function, which was very easy because it had a very specific name.
So it was very easy to get your recall.
But I also found out you can sort of tab in a certain way, then it would then continue putting license text on top of it.
It was completely wrong.
So it came from an open source GPL drop of Doom originally, I think.
And so it would have been GPL code if it would have done that.
But it actually attributed MIT license from a random dude.
And I was like, oh, like Mr.
Copeland, that's the wrong thing.
And that tweet at the time got really, really popular.
And then sort of people started like sharing with me.
Because I was at a time not really exposed to how much actual AI progress was being made in those labs.
Like I didn't come from this AI space or ML space.
So like I was, I learned about a university and like, oh, there's AI winter and then nothing happens.
But through this tweet and some other things, I like, I was like, I recognized that there was something there.
Like there's actually CEOs in certain companies are convinced this will get off.
And that's how I started like paying attention to it.
And I was essentially, I was trying all kinds of stuff with the API.
Like, can you do like bug fixing things?
But I got really interested in it, but it didn't at all feel like the world is going to change until, quote quote.
And you also changed your stance on the whole, oh my God, this is spitting out open source code.
It memorized.
So because like my shtick for many years now has been that I really, I'm like a, I want people to share stuff.
Like I think like human progress comes from like building on top of each other.
And I'm a huge supporter of the fact that in the US, you basically take knowledge from one company and another company that then no competes.
Like I like this pirate kind of approach to sharing.
Yeah, spreading knowledge.
Yeah, and so, like, I was, like, my optimal version is, like, copyrights don't exist in a way, or, like, very, very, like, limited kind of version of this.
I was, like, I really didn't care that it spits out GPL code and doesn't attribute.
Like, I was, like, oh, maybe this will just completely destroy copyrights.
And, like, for me, it was, like, oh, this is, like, if that's the outcome, like, I'm fine with it.
But it was an interesting kind of thing in the beginning that it sort of, like, it sort of creates this license violation.
Like, I want to see, like, what chaos will emerge from it.
And so far, I think mostly what has emerged from it is like a strong belief now that the system in place for copyrights has some assumptions in the US about how it's supposed to work.
And we're all kind of like ignoring that right now because we want to create a mess first and then re-regulate it probably because like, at least in theory, a lot of the things that we're producing right now are probably by historic readings of the copyright.
interpretation actually not copyrightable.
Yeah, that's an interesting one.
But speaking of jumping to today, so an interesting thing that you did recently, we talked about it just before, is as part of your new startup is building things on top of agents.
And you talked to about 30 different engineering teams saying, hey, how are you using agents inside of your company, inside of your team?
What did you learn from large companies to startups?
I think that a bunch of learnings are entirely unsurprising is that whenever people had vacation, there was more time spent on trying these tools.
And just to be clear, like you talk with like folks at the likes of like meta startups.
Yeah.
Like a bunch of different people, right?
So a bunch of different people from like different like European dinosaurs, like.
Why are you pointing at me?
Well, I mean, like, the European dinosaur would be someone like Siemens.
Yeah.
Or I also talked to two companies which are sort of in a critical space.
And what I mean, like, when adoption happens when people have vacation is that, like, when your CEO or your tech lead comes and says, like, you've got to use Cursor now, you've got to use Cloud Code now, it's actually, you don't get it, in a way.
Because you need to actually spend some time on it.
Like, there's a...
It's like a two to three week kind of thing until it really clicks on you.
And so I always felt like with the people that I knew, like I had a lot of free time.
Like I left the company in April until October.
I was like, I can dive into this.
And I was like, this is like, how does nobody get this?
It was like catnip for all.
It was crazy catnip.
I didn't sleep much, all of this.
But what happened within the company seemingly is that when there was like Thanksgiving, there was, for the Europeans, a lot of it was over summer.
And then at Christmas, a lot of people sort of, and they also get free credits during those times.
And so like more and more people get.
Oh, you mean the AI companies often give you generous usage credits?
More and more people went into this.
And especially after Christmas, I would guess like in more than half the companies I talked to, after Christmas, it really exploded.
And it exploded in all the ways you would expect it where like all of a sudden the quality drops.
And it doesn't necessarily drop because like people want to make worse code, but because it actually takes some effort to stay within this.
And we have seen this in the startup ecosystem already in the summer last year.
Like if you pay attention to like the YC startups, a lot of them, some of them have their stuff on GitHub or for some period of time on GitHub and you can look at it and like...
at the time because like plan md files checked in and like everything attributed to claude so like that vibe coding kind of thing was for like prototypes and whatever and like that built it out it was already out there to see but then gradually a small version of this has like been code bases with a little bit of vibe slop on top and an interesting sort of part of this was like how engineering teams and companies are now responding to that With all kinds of like different findings, but a lot of it has been challenged to review PRs.
They're getting larger and larger and they're becoming like more psychological taxing.
Engineers specifically are having a hard time keeping up with the longer PRs, they're more frequent.
Yeah, and they're also, a lot of the code in those PRs is how an engineer wouldn't do it.
Because as an engineer, you sort of get a really bad feeling committing certain code because you think of your future self.
And the agent really does not care.
This is, I will retell this story over and over, but like I worked for an Xbox One game at the time, right around the Xbox One launch.
So that was like a fixed day.
It has to release on that day.
So I worked on the Halo Master Chief Collection.
And there was a game where you had like a matchmaking component and you had to like start this thing and whatever.
And it was like, it was an all hands on deck kind of situation where people had to go in and unslop the human-made slop that was the matchmaker.
And it was like, it was a system with like way too many states.
We call it an emergent state machine because it was like 16 bools on one massive thing.
And like in theory, they were only six valid states.
But in reality, it was a dramatic explosion of possible states.
And that's how a Chanty code feels like.
Where it really should only be like a very clearly defined system.
But in all reality, they're like, oh, we can, config doesn't load.
Let's catch it down and load the default config.
So instead of actually failing, it now, recovers but now your code is way more complex than it should be because instead of failing properly it is now recovering and entering these many more failure states and that makes it much harder to work with this code because you can also not really ask the agent to reflect that because it was like oh yeah this could be possible so we need to maintain this variant i think it's kind of even worse than what you described about your human-made complex system Because there are moments of brilliance in agents where they spit out perfectly fine, simple code.
Exactly the amount and type of code you didn't need for that specific thing.
And you as the steering engineer are looking at better like, wow, this is amazing.
I can just sit back and not care because it's obviously doing the thing.
Two minutes later, you have another agent running in this window and it spits out the worst, horrible garbage.
But you might not notice because now you have fallen into automation bias and think your agent is doing the job well.
Do you think this might be a bit of a human bias?
Because, you know, like typically like onboarding a new engineer, you have a new joint, a new grad, you review their code.
And if it's terrible code, you will review the next one thoroughly until they get to the point that, oh, it writes the code that I do.
And then it typically takes, you know, six months or a year or something like that.
But then, you know, I can trust this person.
Yes, but you don't have anything like that with agents.
Like agents don't learn.
You can put as much stuff in an agent's MD or build a memory system, but that's not the same type of learning than a human does.
Obviously, humans are failable as well, no matter, but they have some capability of learning.
And retaining that learning, right?
Yes, and they also feel pain.
I think that's one of the defining things about humans.
It kind of ties back to what you said.
Eventually, if the pain gets too big, you as a human are incentivized to fix the cause of your pain.
And in the code base, the cause is usually terrible interfaces, terrible complexity that you want to get rid of because you can no longer maintain that system.
Isn't this why, just going on to, you know, like senior engineers are always in demand because the CEO sees a senior engineer as like, they just get it done.
But in reality, a senior engineer, most senior engineers who are effective, they've had battle scars.
They've been burned.
They felt the pain.
They saw what happened when they left TechDevSpiral.
So they now make all these decisions that they know they will help avoid.
And of course, through this, progress goes faster.
I personally think, and your mileage may vary, but a good engineer is an engineer that says no a lot, and I don't need this a lot.
Because that keeps complexity down.
If you're using agents, the exact opposite happens.
You say, yes, I want this, and that, and I want this, and I want this, because I don't have to type it myself.
I don't have to think about it.
I just...
Give the little machine a prompt and it will spit out something that kind of looks like the thing I wanted.
Good enough.
And that's where all the problems start.
And one thing that I also think is like good engineering is all about knowing the trade-offs that you have to make.
And there is sometimes the right solution is actually if you were to sort of like sit at university and learn about it, you kind of learn that you shouldn't be doing this in a way.
I think Cal Henderson had this once where he said like you do the dumbest solution.
first until it doesn't work anymore.
Because the actual problem is there's so much stuff that you need to do that if you actually do the right solution, the correct solution, all of this, you're creating the kind of complexity that kills you at scale.
And the engineer learns that, but also if you don't have that battle scar, it's actually very hard for you to argue correctly because it is this learning process that gives you the authority to then convince other engineers in the engineering org.
That you should be doing it this way.
That is part of it.
You learn that.
But the other thing is also that the agents give you now world knowledge access.
And one of the other things that I learned through interviewing engineering teams now is that the senior person says no, knowing something.
And then 48 hours later, the junior comes by and said, like, I talked to the agent and I already had this inkling, but now I have all the evidence of why we shouldn't be doing it this way.
Because like previously.
you really didn't have that ready-made access to...
Someone who can tell you a senior off.
Yeah.
And this creates other stresses now that were previously...
Like, not every team has that.
It's like people going to the doctor with a ChatGPT printout and saying, this is what the machine said, you better do that.
Is it fair to say that we are, based on what you're seeing and talking, we might face a thing where it's very hard for...
experienced engineers, it's harder just for them to say no in spite of the product manager or a junior engineer saying.
It's much worse because the product manager now comes in and sends pull requests and automatically shoots them.
Yeah, that's another thing.
Non-engineers participating in engineering processes is a thing now.
Ask Armin how that works.
Ask him, how does it work?
How does it work, Armin?
Well, it's hard because if...
Because on the one hand, like, it's well-intended, right?
If someone who is, like, not an engineer...
What is your experience?
Is this your company talking with other people?
So, first of all, like, we have a little bit of this errand.
We're small, and so, like, my co-founder, for instance, sometimes sends, like, a poor worker on the website.
I talk to people that have that at scale, but, like, the marketing team all of a sudden does stuff on a website.
And the sales team, like, creates ever more elaborate, like...
sales demos that sort of land up on a github org and partially at this one one of the most funniest one was like where the sales demo built a feature that didn't exist but nobody noticed right so this this is all like this is new right because like previously none of that happened but i think it's empowering like if you're empowering it's like there's a good thing to it in too if your entire org if everybody in your org can participate in in in the creation of software in some form right Previously, people couldn't do that.
Like you had a designer who could figure something out in Figma, but they might not be able to kind of put it into a clickable dummy demo, whatever.
You might have a PM who wants to try out a feature without kind of wasting time of an engineer.
Now you can do that.
The problem is that people are now so focused on everybody can do everything now that they forget that you still need a process to kind of guardrail all of that.
And the integration part is the hard thing.
Peter gave this idea of like the prompt request, but I'm actually really warming up to this idea.
Like once you've demonstrated it, I no longer need your code.
And just to recap, the prompt request was him saying that he doesn't like to get pull requests and said he would rather see the prompt because he will run the prompt or he will tweak it and it will generate it in the style that...
For me, it's less about like, I want to see the prompt as it like, what is it supposed to be doing?
And now that we understand, because I actually, in many ways, I think like the interesting part is like, often you don't really fully know what you wanted to do in the first place and so like the act of creating clarifies what you really want to do and so like that part is highly valuable often the approach and the code that comes out of it is not what an engineer with sufficient seniority would have done so it's not like i want your prompt so that i can reclank my clanker so that it does it slightly better but more like now that we know what we wanted to build probably faster for me to start yeah and i also kind of disagree with peter and i just need your prompt i actually value seeing a terrible implementation of something.
Like if I get a pull request, and most of the pull requests we get on the PI repository are made by agents without a lot of human touch, let's say, then I immediately know, okay, this is going to be garbage.
But it's valuable garbage because someone has put in at least a minimum amount of thought instructing their agent to create this pull request.
And I get to see how a shitty implementation of what they wanted to build looks like.
And I get to, I don't need to waste my own time on trying that out.
So somebody else tried it out already, that the naive dumb agent do the thing, do no mistakes version.
And that saves me time.
I'm not saying I like pull requests by agents because they're terrible and I auto-close them now.
But they have value.
It's not just a prompt.
It's on an exponential, right?
The speed of everything.
I mean, it's a sigmoid eventually always because of thermodynamics, but I think we're going to find out way earlier than in previous cycles that this is a better idea.
That's good news.
What I think is going to be interesting, and I don't know the answer to this, but I read this fascinating retelling of the British Industrial Revolution and how it changed the textile industry.
The Industrial Revolution, yeah.
Yeah, and so the general thesis on that article was like...
Every time something at the head of the pipeline got optimized, it created an incentive downstream of the whole thing to create something, right?
So like in the beginning, like if you can weave the thing faster, then eventually you need to have GARN that can be weaved at faster speeds.
Then eventually you need to, everything sort of turned the bottleneck all the way down.
And like ultimately the biggest bottleneck in the entire thing turned out to be what I think like is actually the next bottleneck we're hitting in engineering, which is like.
At one point you made a shirt and if you didn't like the shirt, you went back to the person that made it and they fixed it up for you.
And so the actual thing was like, if the shirt is bad, nobody cares about anyone who destroyed the shirt in the process.
He's just going to get a new one.
The responsibility actually went from anyone in this chain to the entire factory as a whole doesn't have to carry responsibility anymore because we have commoditized the whole thing so much that you don't have to do this.
And if you take the engineering approach of it, it's like...
A pretty significant part of running a company and running a service is like running it reliably.
And so you have these postmortems on incidents to figure out like what went wrong in the process.
And you go back and fix the shirt.
Yeah, and the thing is like we are running all on this idea that every engineer that sort of is in this creation process that ultimately led up carries some responsibility.
And that we are going to that person and not to blame that person, but to figure out like why...
Why did you do wrong here?
And so like if you do, if like the machine now produces stuff at like 10 times the speed, the responsibility thing does not scale in the same way because the machine cannot yet be responsible.
And I don't actually know if there is a future where you can abstract away human failure so much in how we run engineering that now the entire company now no longer cares about who signed off on a pull request or something like that.
They'll be automated in the same way, I think, as we are sort of automating.
T-shirt creation.
I just don't yet see that.
So here's the thing.
I think one thing we software engineers or IT people underestimate is just how freaking complex the world is and how much human squishiness is in each little nook and granny and corner, right?
So we were thinking, oh, we were now able to automate that thing.
Now we can automate everything, like every bit of knowledge work.
But we as software engineers are so bad at becoming domain experts that we don't see all the non-machine parts that go into a workflow.
And we are running through the same fallacy here again.
We are seeing models doing incredible things.
I'm not disputing that.
For me, this is like, whoa.
Basically, all my research in the 2000s is now null and void because transformers can do all the things.
But we are overextending that to everything, like we always do in software.
Like we did in EdTech.
Yeah, we have tablets in classrooms now.
Sure, now it's soft.
Education is soft because we have now computers.
Well, in fact, I've heard, I don't know which country it was, but they're now rolling back.
Sweden.
Sweden, they're taking the tablets out from the classroom.
It turns out if you do some scientific investigations into the tactics and effects on pupils, if you do just throw a bunch of tablets into a classroom, close it and hope for the best, it turns out.
the best is terrible um so yeah i that for me i think the biggest takeaway in the past two to three years is the hype is terrible because it dehumanizes everything and i want to not be part of that circus well speaking of not wanting to be part of the circus let's talk about pi which is uh which is a very popular let me get my clown nose and also minimalist coding agent can we start with the the backstory of why you decided to build Pi at a time where there were already agent harnesses around, right?
Because they were suboptimal.
Tell me more.
Yeah, sure.
So I was a believer in cloud code just because they kind of created that whole genre through the invention of agentic search.
I mean invention.
There were precursors to that and shoals of giants and so on, but they were the first that packaged it up in a really...
compelling package and at the time that fit my workflow really well it was simple it was predictive saw the lm heuristic nature or a stochastic nature of being kind of unpredictable but everything around the other lab was kind of nice and tidy and easy to know where you were a happy user of clock code right i was super happy i was proselytizing it but eventually the team started dog fooding and getting more and more tokens i guess and kind of increased velocity and team size.
And with that came more features and much, much, much more bugs.
And I personally like simple tools that are stable, that I can rely on, even if they have non-deterministic parts.
But all the deterministic parts should be as stable as possible.
And that was just not the experience with Cloud Code around summer 2025.
So I kind of soured on that real hard.
Was it bugs?
Was it unexpected behaviors?
They take away your control of the context.
They would inject stuff behind your back, which is bad.
And then your workflows that used to work stop working because there's now a system reminder that you don't even see in the UI that will modify the behavior of the model.
They would also do this to the system prompt.
I reverse engineered.
I mean, I wouldn't call opening an obfuscated JavaScript file and unobfuscating it reverse engineering.
coming from a more low-level background but i reverse engineered cloud code during the summer of 2025 and build a little service where i can track the progression or evolution of the system problem tool definitions in cloud code and it's like every release it was like messing with stuff cchistory.mariosechner.at if you want to see that and yeah that just messed with my workflows and i don't appreciate that if i commit to a development tool i want it to be a stable reliable thing like a hammer i don't want my hammer to break a different spot every day yeah that's terrible so that's what happened with cloud but again i'm this is not like i'm not roasting the team i think they're some of them are really nice people i got to know on the internet they're just dog fooding and that's perfectly fine we need somebody who like goes to the full velocity kind of way but i don't want to work with a tool like that yep because i can't get work done it sounds like the move fast and break things the break things was not for you no And then I looked into alternatives, and AMP and Droid came out around that time, I think.
Pretty early in 2025.
I don't remember.
AMP was very early.
I think they sort of spun off from the same experience of taking...
Because I think AMP was around when Cloud Code came out.
I think so, yeah.
In any case, I looked into those harnesses, and they were super good.
They were just super expensive as well.
Because none of them could...
basically use what made cloud code enticing on top of it being a cool tool, the subscription.
And that works in an enterprise setting where you're paying by token anyways, but it doesn't work for the small tinkerer in the garage.
While I'm not a small tinkerer in the garage in the financial sense anymore, I kind of still relate to that community and I would like to use my subscription with something.
So I looked into open source alternatives and found open code.
But while that kind of wipes me from my OSS roots, It too did stuff to the context I didn't appreciate behind my back.
Pruning tool results after a certain amount of tool result token output, or asking in a Let's Be server after every single edit the model makes.
If there is an error, yes, there will be an error because the model isn't done yet with its work, so the code doesn't compile, so the LSP server will...
So like reaching LSP, the language...
Language server protocol server, yes.
So when you go into VS Code and you type some TypeScript, you have in the bottom some error diagnostics, and that comes from an LSP server for TypeScript.
And OpenCode runs an LSP server on your behalf in the background and feeds the model with diagnostics from that server on every edit.
We as programmers, how do we work, right?
We go into one or more files, we edit line after line after line, and only then look at the errors that resulted from that.
In OpenCode's case, or in other harnesses cases that also support LSP, the model calls an edit tool to change lines.
And they would inject the diagnostics after every edit call.
And that's just not smart, because now you're confusing the model with, you have an error, you have an error, you have an error on the model.
It's like, yeah, I know, I know, I'm not done yet.
It's not.
Yeah, it's not great.
Anyways, TLDR, OpenCode wasn't for me either.
It was also, I had to fork it to modify it, which I don't think should be necessary.
So then I just thought, how hard can it be?
I built my own little thing.
And then your own little thing is pretty minimalistic.
What does it use?
What's the basics of Pi?
The basics of Pi are my own abstraction over all the LLM provider APIs.
because I didn't like the Vercel SDK, the Vercel EI SDK for various reasons.
Armin kind of wrote a blog post eventually about that as well.
It's obviously good to use.
A lot of people use it.
It just didn't fit my old man sense of abstraction.
This is the beauty of software, especially open source.
You can build your own, always.
Yeah, and now with agents, you can even do it faster and produce terrible complex software.
So I built an abstraction with that, and I built a little abstraction for it.
generalized agent loop with tool calling and streaming all of that i built a bespoke little tool that doesn't flicker or not a lot and then i tied it all together into a coding agent that looks like plot code or codex or whatever you have um that's it and the extensibility comes from the fact that this minimal core has so many hook points that you can basically hook into with a simple typescript module that gets loaded into the same node process and that allows you to do things like provide the llm with custom tools Do your own compaction implementation, fully revamp the TUI itself.
You can modify everything in the TUI.
So if you have a special terminal UI.
Yes, exactly.
If you want the TUI to behave differently for a specific workflow you have, like say you're non-techie, you can change the TUI to become whatever you need as a non-techie.
And I have a couple of non-techie friends that did that because they don't need to know how to build this.
They can just ask Pi to build it and Pi will modify itself.
Oh, so this is the thing, right?
So you can ask Py to modify itself because of the extension points, and it can write code that extends itself.
And it's trivial, but it's a big unlock.
Is this what you meant when you said that?
Or open code, you needed to fork it to modify it?
It doesn't have this.
It does have a plugin system, but there's not a lot of extension points, and it was very rigid.
I think they changed it recently.
I think it's much more open now.
I haven't kept up with it, but...
Might be better now.
So I guess PyStars has this very minimalistic thing.
As I understand, the tools it has is read, write, edit, bash.
That's all you need.
That's it.
And then you can actually start to make it your own.
What are examples that people would add?
Py doesn't have MCP.
People just ask Py to build MCP support into Py.
Py doesn't have a plan mode.
Armin goes.
And my plan mode must be fantastic bespoke.
I don't have a plan mode.
Yeah.
But he has, like, five implementations of a plan mode until he realized plan mode is entirely useless.
Other people just like messing with the UI and making it their own, like, a different visual style of the editor box where you enter your prompt stuff, like trivial stuff, more cosmetic stuff.
Other people have re-triggered it for a full-blown RL environment for open weights models where they use Pi as the agent that is part of the RL execution environment.
So it's...
You can do anything really.
What drew me to it beyond like actually using the library abstraction was in fact the custom tools part.
Because one moment for me was over Christmas again like many people.
I had some time and I tried to build other things and Peter was talking to me in November that he's like vibing before looking at code more or less.
I don't know exactly how he said like he said he can do this now.
Like okay I want to build a thing where I don't look at the code.
I want it to not look like slop.
I wanted a version of it where afterwards, even though I don't really look at the code, it should look like what I would have written.
And I wanted to make a game.
And so then I basically started the whole experience of just basic pie.
I was like, I want to build a game, but actually before we build a game, I want you to set up the code base in a way that you can validate the changes that you're making, but also I can see them.
Like a two-pronged kind of approach.
I wanted to be in the loop.
but also have the agent be able to validate itself.
And what sort of emerged out of that was, well, first of all, like, it built itself some debugging tools into the game so it can make screenshots and, like, run a simulation and sort of dump out state and read it again.
But also Pi can show images in a TUI and I added a bunch of, like, I talked with the Twanker to figure out, like, what would be interesting things to do, but we ended up having, like, all these screenshots I can tap through quickly in the UI or I can...
Pi had also this great feature that can reverse to an earlier state in the conversation, and then it can branch within the conversation to build a bunch of stuff around that.
Because these sessions, especially with screenshots, and it became very token inefficient very quickly.
It was actually one of the other things that Pi was rather quickly, rather good at, was having a lot of screenshots in it.
Because OpenClaw people had a lot of screenshots in their chats, and OpenClaw is using Pi.
But having this, it felt really magical for me.
to actually treat the problem as, I don't know what the right way of engineering here is, but very clearly part of it is like, I should be in a loop so we can figure out like how to specifically for the problem at hand, do that.
And it turned out like for web project and computer games and some of the other things I tried, they're kind of different, but very many of them are sort of come down to a similar thing where like the agent interacts now with my program and should do the most optimal way.
And I want to interact with it in conjunction with it interacting with the program.
And the entire experience should be as little confusing as possible to both me as a human and to the agent.
And I found it very, very fascinating just to see how that emerges.
Where like your tool all of a sudden when you launch it in this program looks and feels different than if you launch it in the other program.
I really like this point Armin made just a few seconds ago, that AI works best when the engineer stays in the loop and the system can actually validate what changed.
And this is a great time to mention our season sponsor, Sonar.
AI can now generate code faster than you can verify it.
Sonar, the makers of SonarCube, sees this leading to a serious gap in verification.
With the rise of coding agents autonomously writing code, verification is no longer a nice-to-have.
While the latest coding models are extremely intelligent, they also are error-prone and they don't fully understand your code base and your context or your objectives.
This is why verification must be mandatory in agentic workflows.
SonarQ provides a zero-trust, multi-layered approach to code verification that is consistent and repeatable.
It analyzes semantic syntax, data flows, and architectural boundaries at agent speed, acting as a critical trust and verification layer before any code reaches production.
Covering 40 bus languages and 7,500 issue types, SonarCube is the most comprehensive code verification platform available.
And with easy integration via MCP, CLI, and hooks, it fits right into your existing AI tool chain.
Let agents move fast and have SonarCube as the independent, multi-layered verification for safe, reliable, and auditable agentic development.
Head to sonarsource.com slash pragmatic to start verifying your agentic workflow today.
I'd also like to talk about our presenting sponsor, Statsig.
Statsig builds a unified platform that enables both experimentation and continuous shipping.
Built-in experimentation means that every rollout automatically becomes a learning opportunity with proper statistical analysis showing you exactly how features impact your metrics.
Feature flags let you ship continuously with confidence.
And because it's all in one platform with the same product data, teams across your organization can collaborate and make data-driven decisions.
To learn more, head to statistic.com slash pragmatic.
With this, let's get back to the episode and to the topic of general versus purpose-made tools.
I spend a lot of my youth on construction sites to earn money.
And you don't use a hammer for all your problems at a construction site.
You have a screwdriver, you have your hammer, you have your drill, you have whatever.
And I think in engineering, it's kind of the same.
I'm not using the same tool for every task I do as an engineer.
So now if I use an agent, I don't want a general agent for every task per se.
I want a specialized thing where I know the performance will be top notch for that specific task because we built the harness in the way.
that the agent can be most effective at this task just because of the construction of the way the harness is constructed.
And that's what I wanted to enable with Pi.
That said, I'm probably the person that has the least amount of modifications in Pi.
I have like two extensions that I use and they're trivial.
They're basically just, if you see a URL that looks like a GitHub issue or pull request thing, pull down the details via the GitHub API and display me a small little widget on top of the editor that...
gives me the issue title, the author account, and a link to the issue.
That's basically all I do.
Well, it might work for you as a minimalist.
Yeah, I mean, that's how I work on the PyMono repository, because I might have two or three sessions open in which I process an issue or pull request.
That way, I remember what the session was about.
But it sounds like you also made your Py for working on the PyMono repository specific one.
And if you were working on a, if you went back to building games.
I never thought of the fact that you might want a different harness for a different task.
I guess we just kind of assume that most developers, you work on your main thing at work.
You might have a side project and just experiment with whatever.
But I wonder if this is a new thing that we could never have.
We could never have custom tools for a project.
That just sounds crazy.
My intuition is this.
I think where we are going is software that modifies itself.
on behalf of the user's wishes and needs.
And the agents can do that now if you give them enough rope to modify themselves.
And I think with Pi, that is my first foray into this kind of self-modifiable, malleable thing, just for the coding agent sector.
But I think this actually can be extended to all kind of knowledge work.
So I agree.
For specific tasks within the broader set of knowledge work, obviously, dehumanization and so on, you know.
But yeah, the next plan here is actually to have an alternative user interface to the TUI, because the TUI is obviously limited.
And the best alternative stack is obviously the web, because it works everywhere and can do anything.
So once I have that built out, then it really becomes interesting, because then you're not limited anymore to the line-based rendering of a terminal.
Now you can do really, really interesting stuff.
And so yeah, we'll see how that works out.
And one reason that I learned about Pi before I knew that It was this minimalist interface is how OpenClaw is using Pi.
How did that come?
When we were hanging out and reviewing each other's blog posts and just throwing ideas at each other.
And in October, I started building out Pi and Peter started building out VarRelay, his little WhatsApp assistant.
Oh, that's how it started.
Yeah.
And he was in search of a agent Decor he could reuse or copy.
I think it started out by him taking Pi and...
cloning it and calling it Tau and then modifying it.
But eventually he got tired of having to maintain that.
So he just said, I'm going to use your stuff.
And that's how it ended up being.
I wouldn't have compaction if it weren't for OpenClaw.
I specifically built that because Peter was crying in chat and I need compaction.
Okay, you get compaction.
But I'm going to tell all my users, don't use compaction.
It's bad for you.
Yeah, but that's, I guess, the beauty of a building on top of OpenSoftware one another, right?
I mean...
It does pros and cons, yes.
I now get to enjoy all the OpenClaw instances that think bugs in OpenClaw are actually PI bugs.
So they autonomously send me a gazillion issues and pull requests without their users probably even knowing, and I get to deal with that in my open source.
So that's a negative side effect.
Well, so you're really on the receiving end of this, I guess.
I mean, just like OpenClaw itself is.
which is much more exposed to this problem.
I mean, there are tens of thousands of issues now, and there's no way they can get a good creep on that.
But how are you dealing with the fact that you now have OpenClaw just AI autonomously opening things on your repo as a maintainer?
Do you build tools too?
battle this and try to close them out or build a tool for open claw ones which embeds issue and pull requests into a 3d space so i can see the clusters of similar things that agents would have sent to the repository and then i can bulk select things and close them out in oh really so you actually have a 3d like visualization open claw for context at i think it's less crazy now but end of december to i think mid-february I mean, it was exploding, obviously, but like this explosion almost like directly translated to, I was on this repo refreshing pull request and the number went up.
Yeah.
We actually tried to contribute and help out Peter a little bit, but I immediately gave up.
I didn't know how to do anything useful there.
I was looking at this and I was like, this is a type of software engineering I'm just not used to.
I would fix two things and spend an hour on them and then five minutes after I committed and pushed it, Some Clanker comes along and just reverts my fixes.
And this is not how I work.
Can we talk about the name Clanker?
Oh, sure.
So Clone Wars, Star Wars.
I actually never watched it, but kids of friends of mine watched it a lot while we were visiting them.
So I kind of threw osmosis, got the lore.
And there is an army of robots, and the Jettas would call them Clankers, or people would call them Clankers, because when they move, they clank, clank, clank.
Yeah.
That's the origin of that.
So an AI, a droid.
Yeah, exactly.
But coming back to the how do you deal with the influx of agentic pull requests and issues, I just autoclose every pull request.
Human agent doesn't matter.
What I do is if I haven't had contact with you previously, my GitHub workflow knows about this because if you had, you're in a file in my Git repository, your account name.
So if you're not in there and you send me a pull request, your pull request gets auto-closed.
And then my little workflow posts a comment under your pull request that says, hey, thanks so much for contributing, really appreciate it.
Could you please open an issue in a human voice?
No longer than a screen's worth of text.
And if I like it, I type, looks good to me.
And then that account name gets put into the file.
And the next time they send a pull request, they pass.
And it turns out...
agents don't see the comment my GitHub workflow posts underneath their pull requests.
So this is a great filter for filtering out agents and keeping the humans safe, more or less.
It's interesting.
I wonder if this might be an unavoidable future where we need a way to separate, is this coming from a human with an intent or an AI?
I don't necessarily care if it were actually good PR.
Then if it came from a machine, it's actually fine-ish.
I think what's interesting in Pi and OpenClaw even more so is it accumulates pull requests where actually there was no intentionality behind it at all.
And so the person that dispatched the machine didn't actually care that much about it.
Or didn't even know about it.
Or didn't even know about it.
And I've done open source for many years.
And there was also...
There was a big difference between someone sending a pull request up or like an issue and was like, hey, please fix this.
But actually didn't care enough to even reply to questions anymore.
Like this is not uncommon.
And then you don't actually have to fix that.
But you have to close it out because like maybe it's still useful input, but like clearly that person wasn't caring enough.
And with the pull request is even worse now because they come in so quickly that many of them cannot be merged anyways without manual resolution of the conflict.
There's a lack of back pressure mechanism.
Because even I as a human, if I see there's like 500 pull requests open, I probably will not contribute to this thing now.
Because at worst, I will make it worse.
And I think previously in open source, you had the people who would just send issues and be very entitled and say you're the worst person on the planet if you don't fix my little issue.
But that's fine.
That can be handled.
And pull requests were kind of special because it needed a human to invest quite a bit of time to produce them.
you don't have that anymore you just have people oh this this should be easy uh agent please do a thing make no mistakes send it to this repository and that's just not going to happen so basically what we need are bottlenecks i'm not necessarily i don't necessarily need human verification or a verification that you're human i just need a bottleneck that allows me to process the amount of incoming things as a human because in order for pi to not deteriorate into a pile of garbage, I still believe that it needs me and other capable people reviewing at least the important code.
And for that, I need bottlenecks because otherwise I can't deal with.
It's the second law of thermodynamics, right?
It's like everything degrades towards chaos and you have to put extra energy in to keep it away from this outcome.
And we don't see and feel like the pain of the codebase anymore if we stop looking at it.
And people don't feel the pain or like they feel no restraint anymore.
And the issues are also interesting because on the one hand, it is something great about someone doing an investigation and sending you a description of that.
That can be good and can be bad, but they look very similar.
Like it takes quite a bit of energy to tell apart a good and a bad AI generated issue request.
And unfortunately, like most of them are not great.
But some of them are actually good.
And that's also kind of, it's weird.
Like all of it is weird.
I really don't know what the future of open source is in many ways, because like a lot of open source really worked because people piled out on hard problems.
And so they congregated around it and said like, now we need to have a good database.
So we're going to put all this energy on building a good database.
And so the value of open source came from there's some hard problems and we're going to throw our energy together and we're trying to figure out how to solve it.
And now it feels like, Open source is all about like throwing stuff up.
What really grinded me so mad was people, particularly a lot of authentic engineering right now is like building more stuff for authentic engineering.
So it's like, it's Uboros or Uboros or whatever you call it.
And I see this tweet and it's like, oh, I solved problem XYZ and here's my solution for it.
And you click on this thing as like, it's 48 hours old.
That person probably never used the thing that they built.
I would like to suggest to the viewership to look at Armin's GitHub account over the last year and what happened there.
Yeah, I built a lot of this stuff, but I don't then go on Twitter and say like, hey, I solved the problem.
It's like I have a shit ton of vibe slop on my GitHub account and I wish I could mark it differently because maybe there's some utility in it.
But unless you're going to actually have that code base still be there a year, a year and a half from now and someone is still using it.
The utility of that is actually not validated in a way.
And there's so many markers and metrics you can look at now for GitHub that really demonstrate this explosive growth of it.
But if you were to then maybe find some other number to see like how many of the things that are being created are actually turning into like really fundamental pieces that can sustain open source communities that can actually deliver this value that scales amazingly.
We haven't actually created many Vibe-engineered projects that have become that.
But I like how you mentioned energy and how open source always worked.
If we just think pre-AI, again, let's say Linux, the most successful or widely used open source project, it has both an energy and a structure.
People come in with intent that they want to add something.
They have a process where it goes through.
There's human trust at every level.
There's a little pyramid.
And in the end...
It all goes back.
Each change request goes up one level.
And in the end, Linus does the cut.
But there's a lot of energy.
There's a lot of intent.
There's a lot of humans.
There's a lot of humans.
And it was always about human energy.
And now we suddenly have this AI, which it's just tokens.
Right now, who knows how much they're subsidized or not, or it's just machines doing.
And then suddenly, they create plausible things that look like human energy.
And it's hard to differentiate.
And suddenly...
Just like through all this wrench.
Actually, I disagree.
I don't think a lot has changed for open source.
Okay.
The volume has changed.
No.
Yes, but that's just a number.
The amount of, as you said, the amount of actually useful and maintained projects has probably not changed a lot.
So you're saying that the ones that were there, they're so useful and maintained?
Not even the ones that were there.
I mean, there's a specific rate of new open source projects that survive longer than two weeks.
That's always been the case, right?
So now we just have more.
projects that die after two days than before.
But we still have the same amount of projects that will have a long-term viability just because there are humans that actually care to maintain the thing over a long time, build a community of humans that support the entire thing, build an ecosystem around the entire open source project.
So you're saying you're not believer into Moldbook?
No.
I mean, good job, Meta, buying that up.
Super useful.
Now, I think at the end of the day, we were kind of freaking out when we don't actually need to, because apart from the fact that I personally cannot generate code faster to a speed of light, for me building an open source project, and that entails not just the code, but the community around it, the spirit around it, the ecosystem around it, nothing changed.
What changed is mechanical parts.
I need the bottlenecks to deal with the influx of exponentially growing agents, pull requests, whatever.
GitHub itself is under immense pressure because now it's not just humans hammering their infra, it's now billions or millions of OpenClaw instances hammering their infra.
Everybody complains about GitHub going down.
I actually think they're doing a pretty good job.
Like, that's a lot of traffic that's coming their way since basically Christmas.
It's basically OpenClaw.
So yeah, I would be a little bit more optimistic.
We're just in the messing around and finding Outstage at the moment and everybody...
wants tokens to be a KPI, just like lines of code used to be a KPI.
We've seen this.
Speaking around of things that don't change and messing around and finding out, you wrote a tweet or you wrote somewhere that your biggest enemy is complexity.
It's also your agent's biggest enemy.
Can we talk about that?
Very simple.
If I have a 600 lines of code code biz and my agent can, at best, be effective...
Effective up to a context window size of around 200,000 tokens.
How much of the code can the agent see?
A third, right?
Right.
If you manage to get all the relevant code for a task into that context window, you're probably okay.
Although that is a separate project, an information retrieval problem, which is not solved and which authentic search also doesn't solve.
That is, are you sure that the agent finds all the relevant code it needs to find?
to fulfill a thing.
That's also where all the garbage code comes from, because it doesn't see all the things it needs to see.
In this case, let's assume the best case information retrieval is solved, everything fits into a context, agent does a good job.
That's not the reality we're living in, because now the agents spit out so much code that they themselves cannot possibly read into their context on a new task anymore.
You know what I mean?
Yep, they fill up their own context window.
Yeah, exactly.
The complexity they add is their own worst enemy because eventually the codebase will be so big and so complicated and so interconnected that the agent has absolutely no way on a technical level to ingest all the context it needs to do the new task.
And I would like to point out that the agent has learned all of this garbage from the internet and from us because on the internet there's all our old code.
While there are some pearls, there's also a lot of swine.
um because we have a gazillion github projects from the olden days where we just tried out things and because instances like linux or any other really well maintained and well written open source project are minuscule in compared to all the rest of the garbage and a machine learning model will kind of converge towards not well simplified to the mean right and what is the mean then it's not the handful comparatively of excellently engineered projects It's all the garbage on the internet, all the cargo culting, all the trend type of the day kind of stuff.
And that's what we get when we let the agents do all the things for us.
Yeah, so we have this problem of things are getting more complex, which slows agents down, which will in fact impact quality, which we were just talking about.
But Armin, now that you're building your own startup, the two of you are building your startup now, how are you...
And you're working with agents, right?
And they will have these things.
How are you dealing with generating code building products, balancing quality, tagged up, complexity?
How are we dealing with that?
Badly.
Look, I think that...
We're coping.
We're not dealing.
I don't know if I wrote this in a blog.
I definitely have it on my slides for the conference here.
I enjoyed the time from April to about October immensely.
Because...
It felt like I can do so much, but also like there was no heightened expectation.
Like the world has not yet gotten used to this idea that everything has to now also move at 10 times the speed.
And there was a moment of time where I felt like, like we worked on this Vibe Tunnel thing in the beginning and I was like, it felt so much fun because like I have time now to play with the kids and I just prompted a little bit on my phone and like it felt...
Vibe Tunnel was where you could set up with your phone.
talking with your agent on the machine where it wasn't as easy.
Yeah, it was just like a remote terminal, basically.
And it's not that we did much with it, but it had this happy vibe.
And I know that I spent too much time on the computer, but I didn't feel any pressure.
But now we're collectively feeling like everything has to ship faster.
It has to iterate faster.
The baseline that we want to achieve in terms of fidelity and everything has to be higher.
And so now it feels very stressful.
Even in your own startup.
Yeah, because to some degree you cannot, like, you can be the most stoic person in the world and it's still going to get at you in a way that I'm slowly learning to work with my own emotions in a way on dealing with this.
But I find it very, very hard in a way to, because I was used to things working in a certain way and I knew how I do some stuff.
And then I fell a little bit too much in the trap of giving in to the machine and actually doing things in a way that I normally wouldn't have done things.
That you regret.
It's definitely a gentile regret.
Gentile regret, yeah.
And so quite frankly, the answer is I feel like now with a little bit of power of hindsight, learned some things that I wish I would have learned probably in November.
Tell us.
Well, I mean, a lot of it is really the recognition that if you...
There is no back channel to me or to any other engineer.
When under normal circumstances, there was a back channel.
There was this feeling of like things are not quite right in the code base.
Like there was this, now the change is harder and like the complexity, like you sort of see it in the complexity of pull requests getting higher, but like if you rubber stamp it, then like what's the back channel there?
And so like this mechanism, this back pressure, this friction in the code base, you don't feel when you work with the agent.
I think there's a way to kind of measure it.
And like if I scan through my sessions on a project from start to current date, I think the frequency of curse words increases because the agent starts messing up more because it itself cannot deal with the complexity of the project.
And I would be actually really interested in whether this is measurable because I feel it in most of my projects now that occurs a lot more.
But you mentioned friction in the software.
You didn't say tech depth.
You didn't say complexity.
What is this friction?
Because I don't remember us talking about this pre-AI at all.
So I found this ironically kind of funny.
And it's kind of sad.
I will not name any names.
But there was what I assumed was an incident related, at least in part, to engineering on a company where they shipped out a configuration change that ultimately resulted in a security issue.
And look, things happen.
But the link that I saw on this.
at the social preview of that company's tagline.
And the tagline was ship without friction.
And that really gave me pause because I know as an engineer, we used to talk about you got to get rid of all the things in the way so that you feel happy shipping stuff.
But there always were changes where you really wanted to think, do you want to drop the database?
Do you want to merge this migration which might take a table lock that could potentially take you down?
There's moments every once in a while where you really You were really supposed to think and people created checklists or people created like mechanical gates where you would have to confirm something.
There's certain things that we used to put, particularly if you run a SaaS company, did it put stuff in to slow things down?
Or in some of the best engineering teams, in order to mature a service, you have to define an SLO, you have to define...
expectations and if your service is supposed to be critical, but there's some other stuff that unlocks on this sort of tree of requirements.
And a lot of engineers feel like, oh, this is also this bureaucracy.
But the reality is if you do this correctly, then it saves you time and it makes you happier.
You're not waking up at three o'clock in the morning.
All of this is useful.
It's like friction injected to deliberately slow things down.
I guess the easiest example in any decent-sized company.
You have services based on tier, based on criticality.
The highest tier software now needs to have, let's say, two or three code reviews or an approval from a director to do a configuration change, which, again, all slows down.
But it's kind of like, we know that this is on purpose.
Like, by adding this friction, we want you to think, do I want to push through this friction in terms of time invested or effort or having to justify things, et cetera?
It makes you think about.
Do I really want to add this to the codebase if I know that the end effect will be that it has to go through this entire chain of codeways?
So we're coming back to saying no to yourself to avoid pain going through that process.
And then taking on the pain when you know that you have the conviction, you have the...
The backing, you have the confidence as well, right?
Like, so typically when it's a high friction thing, let's say a tier one service or a highest tier service where a director have to sign off.
When you're a new joiner on the first day and you don't know the context, you probably know that that's a pretty large ask and you'll probably socialize, get buy-in from an experience and say like, oh, this is the right thing.
You'll go with them, right?
Back to human dynamics a little bit.
And I think the thing is like, there's a very delicate balance in the whole thing.
just an accident of having created bad developer experience, right?
But some things look the same.
But they were deliberate, but they may have not sufficiently documented.
But there's this feeling now like get rid of all the friction so that the agent can be very autonomous so that he can run many of them simultaneously.
A lot of it comes from that.
It's like these things are actually rather slow.
And the only real time saving that you get from it is parallelism.
And so somewhere there is this trap.
I feel like a little bit more experienced now in managing the trap, but I don't have the solution for that either.
And I will not say that is an example code base where I felt like really, really great about the stuff that I built, except for pre-existing libraries from before authentic days.
where I still feel like there's strong emotional attachment to them and I'm much more careful about doing them than any of the code that we other than Py to which I don't have access to.
Oh no, there's still no right access.
There's a lot of slop in Py, but I try to avoid it in the bits and pieces where I know that's important code.
Like we have an HTML export functionality where it takes the current session and just spits out an HTML file that you can then host on Git.
top and whatever i have not looked at a single line of code for that function i don't care if it's broken if it looks right when it comes out but then there's the the agent loop itself or the the extension loading mechanism and all of that stuff and that's important and the way i deal with ensuring that that has or at least trying to ensure that has high quality is i refactor mercilessly because that pulls me into the code base i need to understand what i want to change structurally not just line per line and syntactically or whatever.
I need to understand what's going on to do a good refactor.
And I'm doing that every now and then, like I'm doing now at the moment, prompted by wanting to add a new feature that's currently not possible with the current architecture.
Being in the code is the one thing that keeps the code base quality high and the complexity low.
But that's against the industry wisdom of burning as many token maxing, basically.
Yeah, that's an interesting one happening.
but you just recently wrote on on the same theme a blog post called we all need to slow the f down can we rehash some of the thinking and what triggered you to just put it out there okay so the basic is this okay your agent can now spit out 10 times more code a day than you can but it also means it spits out 10 times more boo-boos errors even if it has half your error rate then okay it's not 10 times more it's five times more and still more than you would spit out So the rate of deterioration in your codebase has now increased.
And now go dark factory.
Now take 100 agents that do this to your codebase.
What's the end result of that?
So that's the first problem, right?
You need some way to review all of that code that now gets generated to fix all the boo-boos.
But you can't as a human, because as a human you're used to spitting out 1.5k log a day, and that's about the limit that you can actually review well, right?
If your agent spits out 10 times that, No chance you can review that.
And not all of that code by the agent might be important, like the HTML export thing, right?
But even if the agent speeds up 3 to 5K a day, you have no way of reviewing that in any meaningful sense.
And then if you do the armies, yeah.
And then the armies, this is interesting.
So you call it the dark factory, the idea being that tens or hundreds or thousands of agents, you give them a spec, they go and they break it up, they organize themselves like the mayor and all that.
jazzed, they have the QA agent, they have the, you know, you give them roles, you give them context, and then you give them enormous amounts of tokens and spend.
And the idea is, or the hope is, that your software will be done in...
Oh, there will be, something will be done.
Definitely something's going to be done.
First your purse, and then...
No, yeah, sure.
More power to the people that make that work.
I can't make it work.
And the reason I think I can't make it work is because I still care about the quality of my product.
And I don't care if it's built by hand or by agent.
I just want the quality to be good, both in terms of how easy it is to maintain it and add new stuff to it on a developer side and on the user side.
All the companies claiming that all of their code is not written by agents, yes, we know.
Quality is garbage.
We feel it in our bones when we use your product.
It's garbage.
So I don't want that.
And, yeah, basically, I think people need to turn around and say, hey, what...
What are we even doing here?
We have these wonderful machines now that can take away so much pain from us by doing stuff we hate doing and doing that really well.
Why don't we start by giving up some more free time to work on the interesting bits and delegating the stuff we know they can do to them on large, like across the entire organization.
Find all the things that annoy the sh out of you.
And have the agents automate that for you.
And then you suddenly have time to think about what do we actually want to build?
What do our users need?
And if we decide to build a thing, then we can pull in the agents again and say, and we're going to polish the sh out of that.
Because now we have the time and the means and the tools to do an excellent job.
But that's not how we're working.
We build an army of agents and install beats and make a big spec.
that hopefully will result in something amazing.
But here's the thing.
We talked about where did the agents learn their knowledge from, right?
The internet.
So garbage to mediocre.
Now, if you write a spec, what's the best possible spec you can have?
The best possible spec is, well, you define exactly how it should work.
You give it test cases.
The best possible spec is the software itself.
Oh, I see what you mean.
Yes.
Okay, you write a spec that's not the software itself.
So that means there's a lot of planks that need filling in.
Yes.
What do you think is the agent going to fill those planks in?
Most likely from his training data.
And we already identified what the quality of that training data is, right?
Garbage to mediocre.
Well, and even before AI, don't forget, like Stack Overflow had a really big criticism because there was this thing of like, well, you control C, control V from Stack Overflow.
And oftentimes there will be some answers where the first answer was either not correct or not correct.
In many cases, regex for email was a good one.
You emailed regex for email.
First page was Stack Overflow.
Everyone just copied the first solution.
And I think underneath number three, it was said it missed a bunch of cases.
But here's the thing, though.
I'm not saying agents or humans are better.
They're clearly not, but agents also don't solve that problem.
And if you then don't let just one agent that's already 10 times more productive as you do the thing that it's bad at and that you as a human are bad at, but a hundred of those, what do you think is the outcome?
It's just very simple math.
Let's talk about another controversial topic, MCP versus CLI.
It's coming up.
And, you know, right now I'm hearing a lot of people really going for CLI is the future.
And I think I'm sitting with two of them.
But also MCPs are also really popular inside of large companies, especially when you talk with a bunch of people working at large companies.
It seems MCPs have found a real product market fit inside of larger enterprises.
Despite what people might think, I don't actually hate MCP quite as much.
Oh, wait, we have it on recording.
Yeah, no, we don't deal in absolutes.
We're in SIF.
So my fundamental challenge with MCP is that I think, but first of all, the spec is very complex, I think, for it.
But I think this is just generally how specs happen to be.
So it's a bit like the core of its time.
So there's an inherent complexity in it.
But if you wait to say, like, okay, so what is it really doing at the end of the day?
It's authentication and it's sort of invoking some stuff.
And MCP, even...
Theoretically, there's structured responses, but MCP for the most part is run some stuff, put stuff back in the context and then work with it.
So it fills your concept very quickly.
And there's a Cloudflare has this code mod MCP, which in principle, I really like.
I have an MCP for testing, which is a JavaScript interpreter that gives me access to the Google API.
And between an MCP like this and a skill, there's not a huge difference because the skill also needs to be in a system prompt.
So that defines it.
But the agents are just very, very, very, very good at running code.
And MCP is not quite running code.
It's basically RAG.
It's like input in and do some stuff.
And maybe some state transition at the model also doesn't see.
But it is in that sense, it's a hard problem to solve.
But it does solve off.
It solves a whole bunch of things.
I want it to work.
I just still don't get it to work like I wish it could work.
My suspicion is still the glue has to be code execution.
But because MCP servers are largely not defined in a way that the model actually understands them, I haven't found ways to compose MCP tools reliably.
I found ways to make the MCP itself be composable by having the MCP be one tool run code.
But I haven't found ways to then orchestrate larger ones.
I want it to work.
And I think it has found its niche, and I don't think it's going to go away.
I think it's just a victim of its own success, really.
When the whole thing started, I think it was in October 2024, it was more or less a solution to get external services into consumer-facing chat apps.
Connect your emails, connect your OneDrive, connect your whatever.
Pretty much.
And then IDs also took it over because it was convenient.
The cursors, the windsurfed.
Yeah, but I think the origin was...
basically the consumer side, not the developer side.
And I think that's a totally great use case.
I don't want my mom to, having messed around with code generation or whatever, to invoke some API or call some API and so on.
So it's a perfectly fine use case.
And then developer side also picked it up and thought, oh, this is a great way to provide tools to my LLM.
Tools as in, in the system prompt somewhere there is, if you want to call this tool, provide this JSON payload and you get this thing back, right?
And that kind of felt right at the time, because if you read Anthropik's documentation, they would say, our models can deal with about 30 to 40 tools in the context.
And even that wasn't the case.
Like at 12, 20, they would just break down, but it doesn't matter.
But there was still like, yeah, this can work if you kind of keep it small and contained and very specific to your use case.
And then people started building MCP servers that would just basically map an entire OpenAPI spec.
into a gazillion tools yeah and that's where it all fell apart so that's the first problem very bad mcp servers from big corporations that thought we need this now what's the the fastest thing we can build i just push the open api spec of our apis through the thing and make it an mcp server that's garbage the second problem is that it's inherently non-composable if you want to combine a tool out the mcp tool outputs of two different servers they need to go through the context the the model itself needs to do the data transformation the the the yeah the composition of of multiple pieces of data fetched through and then compare to this with a cli it's a pipe right exactly you did the model only sees the end result and it is it is super free in how it massages that data and that's also the idea behind code mode basically it's a hack it's basically okay we now have mcp we know it doesn't work for this specific use because we have multiple sources of true data and you want to combine them but don't kind of pull that through the context.
So let's build code mode.
And code mode is basically we take all the MCP servers, we expose that as functions in TypeScript, and then the model can actually just write some code that calls the MCP servers and then does the composition in the code.
It's like, how many interactions do we want here?
We can just let the model write the code.
We don't need the MCP server.
And then the third part is David from Sentry is a big proponent of MCP because it's off the off thing.
And honestly, that's, again, for me, super valid.
But the model itself kind of doesn't make sense anymore.
I think that there's a world for MCP2, which is ironically maybe based more on...
So there's a company called Stainless, which basically generates SDKs out of OpenAI specs.
And I'm really warming up to the idea of, like, maybe there's an MCP that's entirely based on off plus...
libraries or like directly like HTTP request against OAuth specs because if you compose it together there and I think like one of the things that's also like kind of underappreciated and sort of as you see I think if you see Pi do its stuff because it's kind of transparent of the tool cost that it does it's kind of magical at times like how creative agents get at large outputs like for instance Pi when it when it runs a program in bash and it produces too many lines of code it actually only reads I don't know what the cutoff is, but it reads the first couple and it's like, oh, if you want the rest of the file, it's 20 megabytes large and it's in this file.
And then the agent's like, oh, 20 megabytes, that's too much.
I'm going to grab on the file.
And they get really ingenious in how they're interacting with it.
And MCP takes that away.
The question is like, how would you define MCP in a way where it wouldn't take that away?
Where it still has all of that magic and capability.
And I don't really know the answer because I think it's hard, but off need solving and composability need solving.
And I think there's a bright future of that kind of stuff.
And also like what Mario said, if coding agents wouldn't have become so popular, then the idea of code generation code running for like non-code related problems probably wouldn't have taken off quite as much too.
But like the most...
capable personal agents, OpenClaw being a good example of it, they're just coding agents hidden from you.
And then that just naturally some random person who is not a programmer is going to say, how am I going to do this?
And the model doesn't say like install this MCP.
The model says like, okay, I can write a Python script that does it.
And so you naturally have this in the sort of the crazy space, you have the adoption of more code execution and the compliant enterprise space.
You don't have that.
There's a different path.
And I personally don't think that models are going anywhere else other than code generation going forward for any kind of authentic task.
I think that's mostly a function of there being a lot of training data for code generation.
And code generation being a very easy means to control computers.
So I don't see a different paradigm there coming out of the model labs anytime soon.
So I think taking that as the assumption where the future is going.
We just need to figure out how to make code generation kind of work within an enterprise setting with auth and all of the other enterprise things that entails.
So let's do a fun trying to predict a year out, which is hard, but in 2027, knowing some of these basics, just again from first principles, where do you think these coding agents might be and the software engineering workflow?
Basically, this is just speculation.
We know we cannot predict the future, but where do you think that there'll be a lot of focus in the coming year and we might, in an optimistic case, see some results in tools and how we work and what's working, what's not working?
I have no idea.
Honestly, I have no idea.
I could make up something that's probably not going to happen.
I think the self-muteability thing is obviously something I believe in.
I think we will see more of that.
Self-mutable software.
including the tools themselves with which we built the software.
And I think that will expand not only to the tech sector, but also to non-tech applications of agentic tools.
Is it dog years with your time seven?
Is that how it works?
So that's basically the model I have right now of like how this stuff works.
It's like when you ask me like what's going to be in a year, it's like seven years, right?
to me that makes it incredibly hard to have any sort of predictions about the future because like it's still not one year maybe now it's one year from like people starting to using cloud code but it feels like it's much much longer much more time behind and my time has passed and and i think like right now that the closest that i can imagine is going to be like we we know that code execution and code generation and like this harness thing around it this is this is going to be it because reinforcement learning gets more of that data and my strong hypothesis is that as more and more people are starting to wake up to this you can do interesting things with agents there will be a societal recognition also of how much more dependent you are on basically two companies and I think we'll have a conversation about that part we should have a conversation about that part particularly as Europeans because we don't really have these labs over here and so I hope we have that conversation but like my best guess is that will wake up to the fact that we are now, I mean, engineering teams are already now telling me that they have code bases that they think they couldn't maintain anymore without a machine.
My guess is that one of those companies will be public and it will be expensive.
And I think that might actually dominate or at least become a conversation that's much bigger than the question of are you using Pi or using Cloud Code or something like this.
I also see a...
We've seen this with, was it Mises, the new cloud model?
Oh no, Spot, the new GPT model.
They will only give this to select partners.
So now we are seeing a split in who can get the best intelligence.
Or the perceived best intelligence.
That'll be interesting dynamics.
So both of you are working on popular AI tools.
You're building a startup that, of course, you're using AI and it's also around agents.
How do you both keep up to date?
I've just seen things.
And it's not as easy to get me on a hype train as it used to be.
But that comes with age.
It's definitely easier not being in San Francisco because I think that just drives me crazy.
I hear so many things from my peers over there that's just like, yeah, I'm not going to go to San Francisco.
Thank you.
So having a peaceful environment around you where it's not all about tech might be helpful.
It helps having a kid.
It helps just going outside, climbing trees, going ice skating, and then looking back at what you did just half an hour ago and be like, why would I do that?
That's just stupid.
I mean, to the detriment of maybe people that are trying to stay in contact with me, I got very good at not muting notifications, not reading emails.
And that has in part become necessary, I think, over the last year or so.
But it actually turns out that passage of time sometimes clarifies stuff a lot because if it's really necessary, it's going to...
I'm going to reach you again.
Like I have an unhealthy Twitter addiction, which I'm not particularly proud of.
But in terms of source of like interesting things, that is still a thing.
But I try to now sort of consume it in a form of, if it's really, really important, it will stay in the discourse for quite a while.
And I just wait it out.
And if it's there like three weeks after it originally happened, then probably something to it.
And I don't need the three-week start necessarily.
But it is, honestly, it's really hard.
It is really hard to deal with this because there's a genuine excitement in it.
And I feel like my more than 20 years of experience in that space of software engineering, it tells me a lot of stuff.
But at the same time, it hits you in certain ways where you felt like there will be grounding and there will be something to build on and a strong foundation.
And now it feels like...
Well, seemingly everybody else doesn't care about that foundation anymore.
So maybe you don't need the foundation.
And for quite a while, it works.
And that is sort of weird.
I kind of feel like since we've been fun employed in 2025 and all this started, that we had like a head start.
Like I see all the excitement the two of us and Peter had in April last year.
Has waned.
Nobody else.
No, no.
But nobody else at the time has kind of shared that excitement that much.
And then the Christmas break came.
And now everybody else has that excitement that we had in April, right?
So now they are learning groups.
Now they are catnipping themselves to immeasurable amounts of lost sleep and at terrible code bases.
And I think it will self-correct because it's not sustainable.
Yeah, we did see this as well.
I did a deep dive with the pragmatic engineer early March.
When a lot of people who were very excited in January about all, and they started to use the new models, what they can do, they went all in at work or on side projects.
In about two months time, a lot of them were like, hang on, it introduced all this complexity.
It has these things.
I'm not going as fast as I thought I would be, et cetera.
So I guess there's just a natural thing where you have a time, anything new, right?
A job, anything.
You have a honeymoon period where you've got the blinders on, which you should, by the way.
and then you start to realize and maybe overcorrect but but there's a natural thing where it in general like it just takes time to see the outcome of your decisions yeah so i'm not worried about all the dark factory and all the software is dead and sus is dead and all that i generally believe this is just part of the hype machine and that will self-correct yeah as closing what's a book that you would recommend and why code by pet salt classic i just love it it's just Such a great read.
It's also for non-techies, and it's the first thing I recommend if anybody asks me, what's your job?
I'm pointing at that, and it's like, it has much less to do with computers than you think.
And I read recently Breakneck, which I unfortunately forgot the author of.
It sort of goes a little bit into an exploration of how China works, or maybe Europe and the US are different, and I found it at least thought-provoking.
Well, Mario Narben.
Thanks a lot for this conversation.
It was great to have it in person.
Thanks for having us.
Thank you.
This was a really fun conversation, thanks to Mario and Armin.
The idea of self-modifiable software really grew on me.
Mario said how Pi doesn't have MCP support, plan mode, and many other features that devs would want from it, but you can build it into its own code.
So far, it's working.
Pi is popular because it modifies itself.
I wonder if and when this concept of self-modifying software thanks to AI will spread outside of just this dev tool.
I also liked how we talked about the observation that agents don't feel pain.
But humans do.
When a codebase gets too complex, the human engineer feels the issues this creates.
And this tech depth is what pushes, refactors, and rewrites.
But agents simply do not do this.
They just keep adding to the complexity.
And in a codebase where devs regularly feel the pain of the codebase and do something about it, the quality will probably be also better.
And finally, the MCP versus the CLI discussion.
This was a good one.
MCP is more about offering tools for AI through context, and CLIs allow piping one tool after the other.
Both Mario and Armin are more of the fans of the CLI, but in all fairness, MCP has its use cases, for example, inside larger companies.
The right tool for the right job.
Do check out the show notes below for related to pragmatic engineering deep dives that go vim deeper into related topics.
If you've enjoyed the podcast, please do subscribe on your favorite podcast platform and on YouTube.
A special thank you if you also leave a rating for the show.
Thanks, and see you in the next one.
