# Harness Engineering: Optimizing AI Coding Workflows

**Podcast:** Thoughtworks Technology Podcast
**Published:** 2026-05-14

## Transcript

Welcome, everyone, to yet another episode of the ThoughtWorks Technology Podcast.
My name is Prem Chandra Shekhan, and today I've got my co-host, Nate Shuta.
Nate, do you want to quickly introduce yourself?
Absolutely.
I'm Nate Shuta.
The best way to describe me is architect as a service here at ThoughtWorks.
All right.
And today we are joined by Brigitte Bokler, who is usually a host on the ThoughtWorks Technology podcast.
But today she is playing the role of a guest.
And she recently wrote an article on something that's called Harness Engineering for coding assist agent users on martinfowler.com.
So for me, that was the clearest mental model that I've seen for what teams running.
coding assistants day-to-day should actually use to build around them.
Welcome, Brigitte.
Do you want to quickly introduce yourself as well?
Yeah, hi, Prem and Nate.
Yeah, I'm Brigitte.
I'm a distinguished engineer at ThoughtWorks, and I'm based in Berlin in Germany.
And I have been a host in the past indeed, but I haven't been on the podcast in a while, so I'm glad to be back.
So before we get into definitions and such, here's the question that I would want users to keep.
in their heads the whole time.
So if you're running a coding agent, Cloud Code, Cursor, Copilot every day, and you feel the gap between what these tools can produce and what you would actually trust without supervision in some cases, this episode really, for me, is about closing that gap.
So let's start at the beginning, right?
What is our harness engineering and why are you writing about it now?
Yeah, so that was one of the challenges when I was writing the article to figure out how to even define it, because also in these days of AI, people create a lot of content and there's like a lot of discourse happening, like very, very quickly.
There's a lot of throughput of communication as well, right?
So we just throw out a lot of terms and then get semantic diffusion really quickly.
So I ended up kind of describing it almost like an onion kind of model.
you have the large language model as your ultimate tool that you use to do something.
But then you put something around it that people have started calling the harness, right?
And so a coding agent is a harness of an LLM in a way.
So cloud code is a harness.
The Pi coding agent is a harness, right?
So there's lots of choices there.
And the way that they're harnessing is by putting together a bunch of tools that can be that can be used by the LLM through this harness to do stuff.
So in the case of coding agents, that is like editing files, reading files, certain code search tools, like maybe access to a language server, all of that type of stuff.
And then they also orchestrate prompts.
They have a system prompt.
So all of that is kind of like making the model much more useful to...
to be able to code for us, right?
But then as coding agent users, as users of this harness, we can also expand the harness.
So that's kind of like the next onion layer out, right?
So we can take the ready-made harness like Cloud Code or Cursor that some people have already thought very deeply about like what are all the things we need for coding but then make it more specific to what we are working on right so if i work on a typescript code base um i want to think about like how like what are specific things in my particular application and typescript code base that i um also want to harness, right?
So for example, we'll get into it later, like static code analysis or stuff like that, or what are additional tools I want to make available to it?
What are my guidelines, my specifications and stuff like that, that I feed into it?
And so that's also what people have started calling a harness, right?
I almost wish it was a different word because that might make it easier.
It's almost like two different bounded contexts, right?
And so that's what people now call harness engineering.
And I would always like when...
when you have conversations with other people about this, I would always try to make sure that everybody's talking about the same type of harness so that there's no misunderstandings.
Right, right.
Absolutely.
You know, so you seem to break the harness into two halves, guides and sensors.
Can you walk us through that distinction?
What's the difference and why it matters?
Yeah, so I was basically just like...
reading about what people are calling harness engineering.
There was an article that got a lot of attention from a team at OpenAI that was called Harness Engineering.
I think the author's name is Ryan Lopopolo.
And so I was reading that.
I was reading some other articles, listening to what our colleagues at ThoughtWorks are doing on different clients.
And I was just trying to find some vocabulary for what's going on.
So it's not really like inventing new things, but just trying to find some language for us to think about it.
yeah, to make it easier for us to think about it.
So the one thing is what you were saying also, the guides, so kind of like feed forward.
So we're thinking about what we want the agent to do, and we're trying to anticipate what it might need and where it might do something wrong.
So feed forward is like, I write, my typical things in the early days of instruction files was like, remember to activate the virtual environment before you execute a Python command.
Or in this project, we always use the following coding convention patterns or stuff like that, because maybe we've seen the agent fail at something multiple times.
So we're trying to anticipate that and tell it beforehand to give it a good chance to create the code that I want in the first place.
But usually it's never Perfect.
Right.
So then in the second step, we also want to give it feedback so it can do some self correction even before I, as a human, even have my first look at the code.
Right.
So it's all about trying to direct where I have to put my attention.
So the feedback then can be stuff like.
Very classic, a lot of people do that right now, is like a code review agent.
So it's like another little agent, another LLM, that looks at the code that was generated in the initial generation and finds flaws in it or finds places where it maybe didn't comply with the guides that I gave it.
So that would be a type of feedback.
But we also have actually a lot of tools available already historically that we've been using for years or sometimes even decades that can give automatic feedback.
as well that are more computational, right?
So that is not interpreted by an LLM, but is more deterministic.
So the classic example that we can dive deeper into later as well is static code analysis, right?
So for everybody listening who's been using coding agents for a while at this point knows that typical failure modes are very long files or classes, very long functions, cyclomatic complexity is high.
or functions that have 10 arguments or parameters, which is often a smell for bad design.
And we can actually give an agent that feedback.
We don't have to type that out.
A static code analysis tool can do that.
So it's just this constant loop of what do I anticipate and feed forward so that it does a good job in the first place?
And then also, how do I think about giving it feedback?
And then those two things is like a thing that is a new job that I have as a developer that I work on, right?
So whenever I see something go wrong, I think about how can I steer this in the future with like a new guide or with like a new feedback sensor or stuff like that.
So for those of us that have been doing this since before AI existed, I mean, this seems an awful lot like what I would have done with like an intern.
you know, so I get a new intern for the summer and, you know, perhaps we have a conversation about, you know, this, these are our coding standards.
This is how we do things here.
And then I would be watching their work for a while and observing and giving feedback.
And it feels to me like this is yet another example of nested feedback loops, which come up an awful lot in software engineering.
Yeah.
Yeah.
Yeah.
Yeah.
And also, I mean, I literally once had a grad, so a person fresh from university on my team who we often paired with, but not always because, you know, like for grads, I always want them to figure out how to do stuff by themselves as well, right?
And not just always pair.
So and then when they were not pairing with us, with the more experienced people on the team, the static code analysis tool that was set up was actually very helpful to them because like some of these basics they didn't know yet, right?
Like we were already used to some of this stuff.
However, the difference between a grad or an intern and AI is that like this grad at some point very quickly learned, it's not a good idea to have lots of arguments in a function, right?
So like they learn, right?
But whereas with models, they of course also get better.
We know that, right?
But there are certain things that, you know, we just...
We just cannot rely on that at some point.
They will just never do them anymore, right?
So the feedback is extra helpful when it's deterministic like that.
Right.
So you also seem to draw a line between computational and inferential, right?
So can you give us an example of each and where that line actually matters in practice?
In context engineering, right?
I would say harness engineering is a type of context engineering, I would say, right?
What everybody seems to be focusing on heavily so far is like basically lots of markdown files, right?
So like a markdown file to describe the coding conventions, to describe the project.
context, the architecture, or also a markdown file in the form of a skill to do the code review.
So that is all then interpreted by an LLM.
So that's what in the article I call inferential guides and inferential sensors.
But those are always up to interpretation by the large language model.
Right.
So they are particularly helpful on the feedback side, for example, when it's like semantic stuff that we just cannot catch with something like static code analysis or regular expressions, stuff like that.
Right.
So they're super valuable there.
But I think a lot of teams have so far been under using the computational guides and computational sensors.
And I think there's a lot of potential there.
So on the feed forward side.
There's different tools that we can make available to the agent that, again, increase the probability that they are good at manipulating the code in the way that we want.
So I mentioned language servers before.
That, for me, would be an example of a computational guide because with the language server, I can, for example, do stuff like OK, I want to rename this core concept in our code base.
And we use this term all over the place and functions and class names and so on.
Find LLM, please find all the places where we're using it, but then execute the actual renames with the help of a language server or with JetBrains has an MCP server that kind of uses all of their rename symbol functionality and stuff like that.
It's very token intense, right?
So maybe I want to reduce that.
And also, it's maybe a little bit more error prone.
So I just give it this tool.
A cool example of that is code mods, right?
So these category of tools that are really good at doing large-scale refactoring in particular in situations like version upgrades or library upgrades, right?
So tools like Open Rewrite, or there's also a few tools in the JavaScript space for that, I think.
So again, I can make that tool available as part of my expanded harness to the agent so that it has a better chance of doing reliably what I want to do.
And then, of course, on the feedback side, I mentioned already a bunch of examples.
There are so many computational sensors that I think we still have a lot of potential to get stuff out of them.
So I recently used good old static code analysis a lot in a project to set up, like I said, these low-hanging fruit, like long files, long functions, cyclomatic complexity, lots of arguments.
And I was actually positively surprised how much potential there is in that because, you know, we've always kind of like, yeah, static code analysis is fine and all of that, but that ultimately doesn't, you know.
doesn't guarantee us quality, right?
But it actually, when I use coding agents, it gets triggered all of the time now that I have it integrated and actually gives the agent signals of like that design and it tries to rethink and like break things down further, reduce complexity without me even having to look at it, right?
So it's, yeah, I was positively surprised how much potential there is in that, I think.
That's interesting.
Anecdotally, I just heard about a team that was complaining about all these Markdown files in their repo.
And so perhaps the thing is, well, you know, use some of these computational tools.
And it also, I think, is a good reminder that just because there's new stuff doesn't mean we throw out all the older stuff.
There's still value in these tools that we've used in some cases for decades on projects.
The main distinction that you seem to be drawing is that computational sensors or computational harnesses are 100% deterministic, whereas the inferential ones may not be.
I mean, I don't want to put a percentage to it.
It's more of a suggestion rather than an assertion where, you know, for the static code analysis tool, it's deterministic.
It's yes or no.
There's no in-between answer, whereas with the inferential ones, although you might have told it to write great code, use only a cyclomatic complexity of less than 10, it might violate that when it's under pressure, which we don't really know when that happens, but it does happen.
Yeah, so the inferential ones are all about semantic interpretation.
you know, as you know, the interpretation that happens when the token prediction happens, right, which is their strength, but can also in some situations be like a negative, right?
Or like we want to complement it with the other stuff, right?
Yeah.
And then you can also think about how do you balance that, right?
Because then when I was looking at some of the...
possibilities like static code analysis, structural tests of like module boundaries and stuff like that, right?
Then maybe at some point you can think, okay, this part I actually have covered pretty well with sensors and the AI doesn't even do it wrong that frequently.
And when it does it wrong, then the sensor gives it feedback, right?
Maybe I can just delete a bunch of my guides upfront, right?
Because I have this like...
feedback set up on the backside.
And then on the other hand, I was then wondering, but what if maybe I want to start using weaker models who then do certain things wrong more frequently so the sensor constantly triggers and I always have the self-correction.
Then maybe I want to have them back in the guides.
So I was just thinking about just speculating about how the combination of strength of the model, what guides I have, what sensors I have, and how those will all influence each other.
potentially what I want to use.
And having these words to describe it helps me think it through.
Yeah.
So until the time that the token economics are pretty significantly subsidized, I guess we don't have to worry about it, but maybe that day will come very soon.
I think that day is coming already.
I mean, we're already seeing the switch from all you can eat to, you know, we're nowhere in charge for what you use.
And I do think that's going to be one of the fascinating aspects of this is the subsidized tools using subsidized models goes away as investors want to see the return on the money they've poured into this space.
I think we are going to have to get a lot more clever about some of that.
And I was just talking to Chris Kramer about that.
The constraints often do set us free in some ways, and they force us to get creative and try some things.
And I think a lot of what I'm hearing here in this, Brigitte, is that it's one is none, two is one, you know, that we use these layers to catch these things because you can't just rely on, well, I mean, I had a system prompt that said, don't delete prod, so it's never going to delete prod, right?
Yeah.
Yeah, yeah, yeah.
And what's also interesting about the static code analysis, we were talking about how it's like a tool from the past, right?
But so there are two things like one is that now with AI, AI can help us build more of these sensors, right?
So we don't even only have to use those tools out of the box.
Like for my like little application that I'm working on, I actually like created lots of like custom scripts or like additional little rules.
And also like, you know, one big potential of this is that you put in your own custom messages for the different rules to give like a little bit of self-correction guidance, right?
So that's also quite powerful.
And then the other new potential or opportunity for static code analysis is that one of the reasons in the past that we often like got tired of it and it just started gathering dust in the corner, the server, right?
One of the reasons for that was always that it was hard to get the signal to noise ratio right, especially when it came to the warnings, right?
So you would get all of these warnings and you would never take the time to suppress them case by case in the code base.
So you just had all of this noise and you just gave up looking at them.
But now with AI, for example, let's say this thing about long functions.
I can tell AI in the guidance that I give it for that particular max lines function violation, I can tell it, hey, this might be a design smell, this might be too complex.
Think about this, right, and make a judgment call.
If you decide that it's okay, it's just two lines over, or that's just what this function is, we can't break it down further, then you're allowed to like suppress it for this particular function in the file, right?
Or like increase the threshold slightly so that we can still get it like violated in the future if it gets even bigger, right?
And then, you know, AI might sometimes take the wrong decision, but I actually have it documented in my code, right?
I can even have yet another custom script that shows me all of the exceptions that it made.
And I just start my review there, right?
That's maybe the first way, because those are kind of decisions that it took about the design.
So that's maybe a good place to start my review, right?
So this actually gives us the chance to wrangle maybe those warnings and actually keep a clean house for once.
again, make static code analysis more useful than in the past.
So we can make the AI do all the anal retentive things that we wanted to do, you know, in our best intentions.
And then, ah, you know, we're just going to not worry about those warnings.
It's fine.
Yeah.
So you make a strong case for keeping quality left, which is something that is par for course for ThoughtWorks teams that we have been saying for a long while now.
But what does this actually look like for a team that already has CI, but is now layering all of these coding agents into the process.
Is there anything that you have specifically for them in addition to the static code analysis that you're talking about?
So if you've already been doing all these things, you're much better set up to do this, right?
You're much better set up to give these S sensors to the coding agent as well, of course, right?
So it's just a question of like, how do you...
run it also during the coding session.
And that would be the shift left, right?
Because I see a lot of stuff about like where things just happen once a pull request is done, right?
And then all of these like review and sensors and start happening.
And I'm like, why is this happening after the pull request is created?
Shouldn't this be happening even before commit or at least like some of those things, right?
So I, and I think there's also a lot of potential for tooling here.
So I've been playing around a little bit with like, you know, we can experiment with tooling a lot now as well, right?
vibe coded some stuff locally, where I built a little sidecar that was running next to the agent and continuously executing all of the cheaper sensors like linting, the test suite, and so on.
And then the agent could just get an agent-optimized snapshot of what is the status of stuff regularly from this little sidecar.
So I think that's one thing about maybe some of the things that you're only running in the pipeline at the moment, how can you shift them even further left?
things are already clean when they get into integration, right?
So I come from a world where I think I haven't worked on a team that is pull request by default ever.
Because at the time when Git came around, I was in an environment where the teams that I was working with were doing trunk-based development by default, right?
So I always obsess about the...
the perfection of the commit, right?
I always want every single commit to be deliverable, to be clean, to be, you know, so that's that's how I always I always think about it.
Test suites, by the way, is a special kind of sensor, I would say, because they they're kind of like half inferential and computational clothing.
So they feel computational and deterministic because it's green or red.
Right.
But A lot of people at the moment just let AI generate all the tests and the tests might be testing something that we don't even want.
Right.
So this whole thing about like the behavior is like a whole other beast, I would say, that is a lot harder to deal with.
But.
What is a bit easier is to think about test effectiveness and test quality, right?
And acknowledge that coverage is not enough to tell us if tests are effective, right?
So that's also what I tried in a code base is like I had an AI-generated test suite and it had pretty high coverage, but I found like a bunch of unasserted things with mutation testing, like lots of them actually.
So there was the regression.
was, you know, I think of the test suite in that sense as a regression sensor for the agent, right?
It tells the agent you broke something or you changed something in the behavior that was there before, right?
So it can then, again, quote unquote, think, reason about, is that a good thing or a bad thing?
What do I have to change, right?
But yeah, is it actually testing the behavior I want?
That's like the hard part that where we really need humans to look at the test.
Yeah, that's a really important.
consideration too.
I was actually just talking to someone who told me that their AI agent decided to comment out the authorization because in dev, they didn't have the role to execute this specific function.
So the way the AI decided to get around that problem was just comment out authorization.
And my first thought on that was, well, an intern or a new software engineer might do that.
However, us as a senior engineer on the project would see that, take the engineer aside and explain why you should never, ever do that.
And hopefully they would learn they wouldn't do it again.
Yeah.
One thing that I've tried to do, and Kent Peck talks about this, right, where it's a human responsibility to define what tests exist, and it's a human responsibility to even write them, right?
So that way, the AI is confined to just...
just writing the production implementation and you define what goes in and what stays out.
I'm not as strict as Kent does it.
So my workflow is more along the lines of, okay, I'll describe to you what I want.
I'll also let you write a few tests and then I'll review those tests.
And then...
Make sure that the tests are what I want.
And if not, we go into a revision loop until we are both satisfied and then finally move to production.
That seems to have worked reasonably.
I won't say that I've got it and nailed it completely, but it definitely mitigates the risk of the AI just wrote a bunch of tests that nobody really cares about because the implementation.
demands it as opposed to the requirements meeting it.
So that's something I would like to get your thoughts on.
What's your take on that?
Yeah.
So I haven't tried.
too much in this space yet, but I catch up regularly with our colleague Matteo Vacari, who has a lot of background in the different testing traditions and all of that.
And he's thinking a lot about this.
And one approach that he has used on multiple teams now already at this point, I think, is he's been trying to find a place where there's a good place for almost acceptance tests.
I think more...
broader acceptance tests are becoming a lot more popular now because of this, which has advantages and disadvantages.
But basically, so for him in a lot of these teams, it was like the HTTP API kind of entry point.
And then at that point, you have functionality that always has input and output.
And it's always like request response in the case of HTTP APIs, which is useful.
So and then he's just like, custom build himself like a little test runner where the input and output is always written in a way that is easy to review by human.
A comparison would also be BDD, Behavior-Driven Development Frameworks, that also have input-output scenario descriptions that are easy to review by human.
And then with that, he focuses a lot of his review on those tests.
And then I don't know what he does with unit tests, actually, like how deeply he reviews those.
as well.
And then, I mean, what I've seen in my code base is that those acceptance tests then really drive up the coverage, right?
But then often they don't do as many assertions on every single little detail, right?
So then that's kind of like dangerous that then we might have some gaps there that we can catch again with mutation testing, right?
But so, but this like, I think it's called the approved scenarios pattern or something.
I've forgotten the name of the person who described it.
There's a website somewhere.
about AI assisted coding patterns.
And so this approved scenarios pattern is what he's been using and has quite liked.
But yeah, it still depends on what type of code base it is, what type of functionality.
So it's like, I still don't see really like big patterns on the horizon how to make this easier.
And Nat Price and Steve Freeman in there.
Goosebook, a growing object-oriented software guided by tests.
They talk about this approach as well, what they call outside-end testing, where they start with higher-level tests, acceptance tests, as you're calling them, and then move in to say, okay, we'll write more finer-grained tests.
And then finally, when we are at a place where we feel like we've got most of it there, we'll try and retire some of those higher-level tests.
and then rely on more on the lower level ones because obviously these unit tests run much faster and a lot more.
But that does require a lot of discipline.
But it's a technique that has existed obviously for a long time.
There's so many things to rediscover, right?
And I mean, this testing thing is also an example.
Another concept I brought up in the article is that then I try to think about sometimes like almost different dimensions that I'm harnessing, right?
Like, because in some areas, it's a lot easier than in others.
Like for the behavior, it's like a lot trickier, I would say, and we still need a lot more human involvement.
And then for maintainability and internal code quality.
the whole static code analysis stuff and structural tests and stuff like that is a lot easier maybe.
And then we can also think about other dimensions that are almost like regulating with these feed-forward feedback loops, right?
Like our architecture fitness, for example, right?
So what are our executable architecture fitness functions that we can give as sensors to agents and so on?
So I think that's also useful to just think of it in different dimensions.
that we're regulating and not trying to do everything at once.
And, you know, harness is just a word for all of those things.
And you can do lots of little small things in that big spectrum of sensors and guides and, you know, dimensions.
What's old is new again, is what I'm hearing you say, you know.
And I thought that was funny.
You mentioned the fact that we seem to constantly rediscover these things.
And I've seen that throughout my career, right?
These things that we've learned and then.
I don't know, a whole new wave of developers comes in and we need to reteach them these things over and over and over again.
And this, this is a lot of what I'm hearing in this conversation is.
These are things that we have been doing for many, many years.
I mean, I feel like shift left has been like the defining concept of software engineering from like day one.
And this just feels like yet another example of that and the importance of having layers of these things.
You know, I mean, I think we've all been involved in some of those conversations about, you know, do we need unit tests or do we need integration tests?
And it's like, well, yes, you need all of them.
And the exact ratios and percentages, it depends on your project and where you're finding pain.
You know, these fundamentals that many of us have learned the hard way are still as important today as they were when we were using zeros and ones.
Right.
Arguably more, even more important.
Let me push on cost a little bit.
And I think Nate talked about it slightly earlier in this.
We've got a lot of these inferential sensors now moving left.
You know, they're running pretty much maybe after every change, definitely after each commit.
And now you've got a pretty large organization.
Now, is this something that is once sustainable?
And have we gotten to a point where the investment actually starts paying off?
Or is it too early to call that?
Yeah, I mean, the investment of sensors, I think, can pay off in different ways, right?
It's like one can be of less token usage, but another can also be...
Yeah, just having our equality code in general or yeah, we've always thought about like in our path to production, where do we put certain things?
Right.
It's that we just talked about the tests and the test pyramid.
And there's always this understanding that some tests are more expensive than others.
Right.
Both in terms of maintenance, but also in terms of like how long they take to run.
Right.
And you always want quick feedback loops, especially in the beginning.
Right.
So some of these inferential sensors also just take a while to run.
Right.
So they give me like a longer feedback loop.
So that might be another reason why I don't want to run them constantly during my session.
So then you kind of think about how do you distribute these things strategically across your path to production, across your pipeline?
Right.
So what do I run before I even commit, before I integrate?
What do I run on a pull request or in the pipeline?
And then what's also happening a lot that's also mentioned in the article from OpenAI, but I've heard stories about that from multiple teams at ThoughtWorks as well, is like, what do you run continuously or on a schedule?
So let's say once a week or every two days or stuff like that, right?
We already have always done that with things like dependency vulnerability scanners, right?
We also run them regularly, even when there's no change because...
They relate to our environment, right?
The environment changes around us, right?
So there's all of these different things to consider in where you put these.
And for the, you know, running them repeatedly, the OpenAI team calls that garbage collection, right?
Because even with like all of these like guides and sensors and this harness that they designed, they still saw technical debt compound over time.
So they have things running continuously that...
like double check and review all of these different dimensions, right?
And see like where the garbage is piling up or where the debt is compounding.
And so again, as an example in the application that I'm working on, which is a small internal application, I have three of those in place that, you know, I could maybe run once a week or so, and then a human would look at the result.
One is like a security review.
That's basically a prompt derived from our internal security checklist for internal applications.
Another one is like specifically, you could call it an architecture fitness review function that looks at specifically for this application, some of the...
things that how we want to handle data and not show certain data ever on the UI and just like double checking that we really still doing that.
So it's kind of about sensitive data and stuff like that.
And the third one is about dependency freshness.
So it's actually a script that checks, that looks for the dependencies that are quite old, like over six months old or something.
And then AI takes the result of that script and creates a report with web research and so on about like.
oh, this one seems to be deprecated, or this one is really outdated.
You should look for an alternative, stuff like that.
So it's a nice example of a combination of first giving AI a leg up with the script.
So you don't have AI go off on web research tangents, but you just have a deterministic script that tells it what's outdated.
Yeah, but those are examples for this continuously running ones from the ThoughtWorks teams that I've talked.
to there were examples like an organization that had some tech debt in their APIs, for example.
So they almost created like API linters and API reviewers that would go through all of the APIs in the organization.
And of course, in some cases, you can also immediately trigger new coding agents that make pull requests to suggest the improvements so that the human doesn't even have to do those but review the pull requests.
Yeah, so I diverged a little bit from your question, Prem, about cost, right?
But this is also something to consider when you really think about where do you put what and how often do you run it?
And then just like over time, another way of how we're steering this that we constantly want to balance these things, right?
And learn from it.
But that tech debt is a cost too, right?
And I think that's an interesting sort of thing here that I think we've always had that as an ideal.
that we would stay on top of our dependencies and that we wouldn't let that creep into our code base.
But there's just a certain level of that that you just can't stay on top of as humans.
We have other things, other demands.
Oh, this feature needs to get done.
Oh, there's this critical defect.
And so we end up kind of sliding that aside.
And I think we've always kind of had a little bit of, I don't know, maybe shame's too strong a word, but like guilt over, oh, we let it get, you know, the code base got out of hand again.
But it feels like some of these tools, if applied, can...
prevent that or can take care of against some of that toil and maybe we end up with much cleaner code bases than we've had in the past.
Right.
And then there is the question, right?
I mean, who owns the harness?
Do platform teams now build it for app teams or does each app team own their own harness or is it a combination of both?
What's your thought on that?
Yeah, there's like lots of open questions, right?
Also, like, what are other tools that can help us with this?
Or like, how do we keep things consistent and coherent, right?
That the guides still match what we have in the sensors and all of that, right?
Who owns it?
Like, I think it would definitely be a combination of things because it's hard to like, say the word like, the harness and then immediately you know what it is.
Because like I said before, I think it's lots of different small things, right?
It's just like the system, right?
And then people are...
responsible for different components.
But it's probably a combination of both, that you have something from some kind of platform, some central skills that everybody can use.
But then there's stuff that you have in your own code base.
But then you have to think about how do you update it?
How do you distribute it?
How do you version it?
How do you test that it's actually making things better?
I mean, we call it so highfalutin, like, oh, it's engineering.
It's harness engineering.
Like, how do we actually know that what we're doing is making any difference, right?
So I also played around with that a little bit.
And that's what I mean by there's a lot of potential here for tooling, right?
Where I had my little sidecar also log all the time, regularly the status of the sensors.
So I could see kind of like, I could make some visualizations showing me during a coding session, what like some of these sensors, like, you know, the linting would suddenly be nine arrows and then they would go down again because the agent would respond to it.
I imagine you can also then...
take that even further and log what types of things are going wrong.
And then you have statistics about like, what is the most commonly wrong thing that the sensors catch?
And then think about putting that into your guides because it might even avoid the self-correction in the future.
Right.
So that's, that's an open question.
And then also all of these sensors, is it just going to get too much, right?
Both for the agents, like I can already see like the more rules I activate, it always finds more stuff.
Is it just getting too much?
Right.
And when those rules actually when the sensors are actually in conflict with each other, how does an agent make the trade off decision?
So that might get interesting as well.
And then also for the humans.
Right.
So if we have that garbage collection running all the time and all of these things, it will just create even more pull requests, even more reports for us to look at.
And then how do we prioritize those still?
Right.
Will we just get like another version of the.
signal-to-noise problem that we had previously already, it will just be at a higher level or in a more sophisticated way, question mark, right?
So there's lots of things to figure out to see how this turns out in practice, I think.
Seems like an interesting evolution of what we do as software engineers.
And I think there's been this, I would almost call it a fixation on software engineers, or we just write code and say, well, code is part of the job.
It is maybe an output of the job.
But I think any of us who've done it for a period of time realize that, you know, that's not actually the only thing we're doing.
We're not just typing, you know, for six, seven, eight hours a day.
So if a senior practitioner or any practitioner for that matter is listening to this, already running coding agents pretty much every day, and they walk away wanting to do one thing on Monday, what is it that you would tell them?
Think about how you can cut your mark down by 50%.
What would you do?
There you go.
Yeah.
Thank you, Brigitte.
I really appreciate it.
So the article is on martinfowler.com called Harness Engineering for Coding Agent Users.
And I would definitely encourage everyone to read it.
And thank you all for tuning in.
Until next time, it's the Topworks Technology Podcast signing off.