# Resilience Engineering: Leveraging Software Failures to Enhance Architecture

**Podcast:** The InfoQ Podcast
**Published:** 2026-03-31

## Transcript

If your team has AI running and a proof of concept, but you're still figuring out how to run it reliably in production, you're not alone.
That's the gap most engineering teams are navigating right now.
TuCon AI Boston this June 1st and 2nd brings together senior engineers, software architects, and technical leaders who've already made that shift.
They'll share the patterns that scaled, the mistakes that didn't make the blog post, and what they'd actually do differently.
No hidden product pictures, just senior practitioners helping senior practitioners.
Learn more at boston.ai.
Welcome to the Architects Podcast, where we discuss what it means to be an architect and how architects actually do their job.
Once again, as we have done several times in the past, we are going to talk about something that is very important for architects, but is not often explicitly discussed.
We are going to focus on how to use software failures to improve software architecture.
Today's guest is Lauren Hoxton, who is a staff software engineering reliability at Airbnb.
He was previously senior staff engineer at Coupon, senior software engineer at Netflix, senior software engineer at Sendgrid Labs, lead architect for cloud services at NIMBA Services, Computer Science at the University of Southern California's Information Sciences Institute, and assistant professor in the Department of Computer Science and Engineering at the University of Nebraska Lincoln.
Lauren has a Bachelor of Computer Engineering from McGill University, an MS in electrical engineering from Boston University, and a PhD in computer science from the University of Maryland.
He is a proud member of the Resilience and Software Foundation and the Resilience Engineering Association.
Welcome to the podcast, Lauren.
Hey, Michael.
Reliability looks very different if you come at it not from this perspective of an architect, but from the perspective of site reliability engineering.
How'd you decide to be interested in reliability?
And how is the perspective of a reliability engineer different from that of an architect?
I'll just start off by saying the standard disclaimer that this is just, you know, my opinions and not my employer.
I don't think anyone wakes up and decides to be a site reliability engineer, an SRE, like there's no path explicitly for that.
I was a traditional software engineer at the time.
I applied to Netflix on what was called at the time their traffic and chaos team, and that was building fault injection tools, right?
Like I worked on Chaos Monkey, I wrote the second version, Chaos Monkey.
And what I found on that team was that I became a lot more interested in how the system actually failed, like real failures than the sort of synthetic ones that we were injecting into the systems.
You know, we would try to intentionally make things break, like what happens if you fail request to this non-critical service, or what happens if you add latency here.
But like real incidents I discovered, like looking over at what the SREs were doing, seemed a lot more interesting to me.
And so I just sort of got hooked on them and I moved over to what was called the core team at Netflix, which was the central SRE team, and they were the ones that had to do incident management.
The incident management itself is not my personal passion.
I do it because it lets me get closer to the incidents and do like analysis type stuff.
But that's really what just sort of got hooked on it on learning how these systems fail in very bizarre ways.
Were you able to incorporate any of that reality into Chaos Monkey?
Or that was just not practical.
I mean, Chaos Monkey itself is relatively simple.
All it does is terminate instances.
So it just turns off virtual machines.
But that's only one kind of failure, right?
So another system we built on that team on the Chaos team was called the Chaos Automation Platform, where we would run experiments by selectively injecting RPC failures, right?
So we Netflix operates what's called a microservice architecture, right?
Where there's a whole bunch of different services that are talking to each other.
And so it would try to say, well, what happens if there's some failure from service A talking to service B, right?
Which is different than just, you know, a server or a container and server or a pod going down in service B.
And so there we were relying on an existing fault injection library that had been built at the platform level where you could inject failures into the RPC calls.
But really, like I didn't see a lot of feedback go back from the like actual incidences we were having into the tooling that was being built to support that.
Like the tooling worked based on like a certain set of failures that you could actually inject based on the libraries that things were built on.
But the truth is that like real incidents happened because of like a confluence of different things happening at the same time, right?
And so typically when you do like a chaos experiment, you're failing one thing at a time.
I've never seen people like actually try to do multiple ones.
And you don't know, but there's so many different possible combinations that you just can't cover that space.
So the real incidents are just too messy to be able to generally reproduce in a way that's generalizable, that like wouldn't just stop that specific one from happening again, which generally people are pretty good at preventing, anyways.
So you would say though that Chaos Monkey was still useful, it's just not useful enough.
Yeah, so Jap was the more general experimentation platform.
Chaos Monkey was useful in forcing people to think about a certain type of failure.
It was a forcing function for the architecture.
So you needed to be able to withstand a particular instance or pod going down at any point in time.
And so you couldn't maintain like state on that thing that might just go away, or you had to have like a cluster, right?
You can't just have one thing running because when you take it down, the whole thing goes down.
So what I had noticed when I got there is that the real value in the chaos stuff is do people feel comfortable turning it on, right?
Like if you say, no, we can't do that experiment to kill these instances, or can't do that experiment to fail calls of this non-critical service because I know the system's gonna break.
Well, there you know what your problem is.
You know, you need to architect your system to withstand that, right?
But once people have done that, ChaosMonkey can sort of test regressions to make sure that like you haven't fallen back and are now vulnerable again.
But generally speaking, it's work has been done.
People have internalized those rules, they've incorporated that into their architectural designs already.
And so most of the value I would say is in forcing people to think about okay, how do I actually architect my system so that it can withstand those failures, right?
So how do I build in fallback behavior so that when this you know non-critical service is down, I can serve stale data or I can serve some reasonable thing.
Do I have like, you know, retry set up correctly?
Things like that.
So that raises the question then, if it's only from real incidents that you can gather knowledge of how a complex system fails, how do you get this knowledge back to the architects so that both the system architecture that you're working on can be improved, but also you learn something for the future.
I think is the key question, right?
I would say the hardest problem in an organization that's non-trivial in its size is how to get the right information into the heads of the people who need it, right?
Because there's too much information.
Like you could spend all your time, say reading docs or something like that and do no other work and still not even absorb everything that you would need to know.
What I would say to that is you would want the architects to attend the incident review meetings, right?
Most companies, at least all the ones that I've been at, you know, after at least some of the incidents, typically the more severe ones, there's some sort of incident review meeting that is open to the entire company where they go over what happened in the incident.
And that is a great way to learn about not just failures, but actually how the system normally works, right?
And you will learn things about how the system works that you would have never known, even if you had initially designed the system, because like people use it in ways that are surprising.
The changes happen over time, right?
That invalidate initial assumptions about how the system works.
And you can't see that stuff normally.
But when it breaks, is when we have a chance to actually spend time looking into it.
And also there's the post-mortem document or the incident write up.
So reading that write up and attending the review meeting and actually being able to talk to people who are involved and have conversations about it, I think is like where the real value is.
Do you actually find that happening?
Or is this something that just architects are not interested in?
I do see it happening.
I do see like high-level people attending incident review meetings.
I don't know how much they're internalizing generally.
So one of the challenges is like, how do you know what people are learning, right?
And are they learning?
And I will say something that is very, I don't know, disappointing, and one thing that is very encouraging to me.
The disappointing thing is one thing that I will try to do in incident review meetings is there's usually a Slack channel associated with that.
And I will say if you learned anything in this incident review, put it in this Slack thread.
And there's really very little traffic in that thread.
Not when people actually post there, unfortunately.
Although I try to put in stuff that I've learned.
On the other hand, people keep coming back.
I would say that like the incident review meetings that I've attended are surprisingly well attended, including some, you know, sort of high-level engineers we basically call architects.
And so they've got to be getting something out of those meetings, right?
Like these are optional.
They don't have to come unless you know they were directly involved.
And yet they still keep coming.
So they must be getting some kind of value out of these.
Sometimes they contribute and like make suggestions about how you could architect the system differently, but not always.
And so I assume, based on the fact that like their time is scarce, just like all of our time is scarce, they are making time to attend these meetings.
So they must be getting stuff out of it.
I mean, from my experience, you often learn more from failure than you do from actual success.
As disappointing as the failure is.
I mean, one of the things that comes to mind is, you know, on December 5th, they had that CloudFare outage.
And to my understanding, what they were trying to do is actually improve the system and wound up destabilizing the system.
Yeah, I mean, I have found that, you know, I even have a conjecture about that, which like some of my colleagues call Lauren's Law, which is that like once you reach a certain level of reliability, then all of your like large failures are gonna be either because someone was taking some action to mitigate a smaller incident, and then something happened during that mitigation to make it worse, or it's some subsystem that was designed to improve reliability, had some unexpected interaction with the rest of the system, right?
We talk about like simplicity being important for reliability, but if you look at any real system, the ones that have gotten more reliable, they've added complexity over time to increase that reliability, right?
I mean, even if you look at like a car, like look at seatbelts or airbags or anti-lock brakes, those are all increases in complexity to the system.
There's a trade-off, and like it's a good trade-off.
I'm glad we have you know anti-lock brakes and seat belts and airbags, and uh, and I'm glad that we have you know health checks and load balancers and all sorts of monitors and things like that.
But these things, just like your immune system can attack itself, these complex reliability systems that are monitoring and trying to like take action, things can go wrong because you can't see the entire space.
You didn't realize that like this other thing would happen at the same time.
There's this latent bug we never hit until now.
Pretty much every large-scale Amazon incident that I've read the public write-ups, for example, it's always like that.
It's always some monitoring system or some system that's designed for reliability.
Because we talk about trying to reduce complexity, and that's good, but at the end of the day, we're always increasing complexity to increase reliability, and that creates new complex failure modes.
And that's sort of just life.
Well, that's interesting.
You talked about the example of the immune system.
Because in the human body, one of the most fundamental functions of the human body is to maintain homeostasis.
And the problem with maintaining homostasis in the human body is you have all these feedback loops going on.
It's a very complicated, nonlinear system.
When certain things get out of bounds and interact with other things, you get things like the immune system attacks itself.
So it would seem to me that these large complicated systems have some notion of homeostasis.
And because they get complicated and you get external pressures and internal pressures on various things, they can wander off out of homeostasis and you wind yourself up with these problems which are inevitable.
I think failures are fundamentally unavoidable.
We can recover quickly or well, like better, we can do better or worse, but really we cannot build a perfect system that never fails, right?
Like the world's just too complex.
No human being can understand all of the code that people are all changing at the same time and the changing traffic patterns internally and you know, changing things underneath you.
There's the world is just too dynamic.
And like you said, the way you deal with dynamism is through feedback, right?
Like that's how we build control systems.
But like feedback always has a risk of instability.
I mean, once upon a time I studied electrical engineering, so much of control theory is like stability, right?
Like, how do you ensure that the system that you're building doesn't go unstable?
One of the most common failure modes I've seen is what's called saturation, where something gets overloaded.
That is extremely, extremely common in the complex systems failures in distributed systems.
So like it could be that all the logic is correct, but the cloud's not fully elastic.
You know, it might eventually run out of resources.
You probably hit some limit somewhere, and then bad things happen.
I had David Blank Edelman on the podcast a couple of months ago.
And one of the things he mentioned, which sort of goes to your point, is that very often we should focus on how the system actually worked.
And it's sort of a miracle sometimes that it actually does work and doesn't fail more than it does.
It kind of is, right?
I mean, look how dependent our entire world is today on software.
And yet we don't have catastrophic failures constantly happening to us.
You know, like things pretty much work.
It's kind of shocking.
It is like legitimately surprising.
There was a, I don't know that you know the like Gerald Weinberg, as a famous software author.
He said that if like software engineers wrote software the way like what is it, the first woodpecker that came along with destroy civilization, right?
Construction engineers built building.
But like we've had plenty of woodpeckers and no civilization destroyed here.
I've read quite a bit of his books.
And I think the psychology of computer programming is a classic.
He had a very, very big gift to be able to simplify very complicated things and explanation and get to the point.
Yeah, he understood the human aspect a lot.
He wrote a book on like general systems thinking, which I like very much.
So, what in some sense you're also saying, and maybe this will appeal more to the engineering mind, is you're not really eliminating risks, you're trading them off.
That's right.
Yeah.
There's a guy named Todd Conklin who works out of one of the national labs, and he he's a safety guy.
And one thing I really like that he says is that you don't manage risk.
You manage the capacity to absorb risk, which I think is really nice, right?
Like you can sort of be prepared so that like when the risk happens, you can deal with it effectively, is what you can do.
And so that's one of the big ideas around like this research area that I am really inspired by called resilience engineering, where what you do is you try to in advance build up this capacity, this general capacity for dealing with issues, so that when things go wrong, when something unexpected happens, you are as well positioned as possible to deal with it, to mitigate it, even though you don't know what that's gonna be.
And so, I mean, a lot of things, we mean one of the advantages of the cloud in this sense is you could scale up, right?
Like you throw capacity, and often we do that, right?
Like, oh no, this like scale up is a very common mitigation strategy, and it is because it works well.
Um, because you can throw more resources at it.
Just staffing on call rotations is a form of generic capacity, right?
It's not the software architecture, it's part of the human part of the entire system architecture, right?
You have these resources that you can make available very, very quickly to solve.
You don't know exactly what the problem is gonna be, but you know they have the expertise to be able to solve that.
And so we don't think about like staffing on call rotations as an architectural concern, but I mean it is part of the architecture of the entire system.
It's just not part of the software architecture.
Well, I mean, if you look at any software problem, any large software problem, it's true, small software problems too, but you have to look at the universe of constraints, including the social ones as well as the technical ones.
My favorite example of this, I'm just thinking a little far afield, but people who advocated, for example, agile development.
That depends critically on having independent software developers who are able to articulate.
And that's only a part of the software world.
There are plenty of software engineers who want to be passive and just do their job and go home at the end of the day.
So you really have to look at all the resources, financial, people, risk.
You know, where are you in the lack of risk?
I mean, there's a big difference if you design the software for an airplane or you're designing the software for some video game.
Even just an airplane, if you look at the control software for the airplane versus the entertainment system, right?
Like entertainment systems for software.
I've seen many failures in that entertainment system for software, but that's fine.
Like, I don't want my ticket to be like five times more expensive to get more reliable entertainment software, right?
It's it's an inconvenience, but I don't want to pay for that.
But I do want to pay to not crash.
So yes, yes.
I always like to give this a very personal example is many, many years ago, I wrote some test software for some medical equipment.
And then I left the project and went on to other things.
And then one day I walk into my doctor's office, and lo and behold, he wants to administer a test to me using the equipment that I had written the test software for.
So I said to my side, I sure hope I did a good job.
One thing I'll say about that is in terms of thinking of the larger constraints, say, is that you know, one very common question you're always faced with in like the software world is like build versus buy, right?
Like, do I build this in-house or do I go to a vendor?
And one of the issues that comes up on the buy case is that if there's an incident that involves some interaction between your software and the vendor software, now you have to coordinate across two different organizations.
And like the further you are organizationally from the people you're working with, the harder it's going to be to resolve that.
And I I've almost never seen that kind of taken into account when when making that decision, thinking about, well, when the incident happens and like we don't know if it's us or them or if it's an interaction.
But I mean, it is it's just an order of magnitude harder.
Just like it's it's easier if it's just your team versus, you know, another team.
Like they're much further away in the you know, virtual word chart.
And so that is a constraint that is often overlooked, but it's a real thing, and it's it's just much harder to deal with across an organizational boundaries when an incident happens.
I remember early in my software career where we were building software, and I had to interact with the compiler people and the database people, and then our group was moved to a different building, and all of a sudden, same company, same language, not a different culture.
It became infinitely more difficult to get the information because I couldn't just wander neat them at lunch or wander into their office if I had a question.
Of course, this is days before real email and and what have you.
But that I think is underestimated.
Sure, it's funny.
I mean, there's the physical architecture of the organization, right?
Impacts the way the whole system works.
You don't think of architecture in terms of building architecture, but it impacts the human body, the system functions.
But there's a difference between robustness.
When you make something, try to make something as strong as possible, but it is resilience.
It is, yeah, it could adapt to all sorts of things that nature, you know, did not imagine to be able to do it.
Right.
I mean, you could say it is an evolving design.
I mean, in other words, it just because you know the evolutionary pressures go on the body, and the body sort of reacts in a way.
So there's sort of a design to it.
You can look at it and say, how does this work from an engineering perspective and then abstract the way design, even if there wasn't one original people?
Yeah.
And you know, one thing that we have in common in our systems is that our systems evolve over time, right?
Like you may have designed it initially, but like there are lots of incremental changes that happen over time.
And that might invalidate initial design assumptions.
And you sort of make, you know, you do the best you can as you're evolving it based on your understanding of the world and the constraints and stuff.
But like we we end up with things that are not necessarily optimal based on the problems we're facing today, but we're constrained by history, just like our bodies.
And you know, I can complain about my knees.
I don't think they're particularly well designed, but like that's how they evolved.
And as the systems get older and the technology around them changes, very often, like the aging human, they become less resilient simply because the world around them has changed.
Yeah, it became harder to change, right?
This is something that I think they recognize like in the 70s that like software becomes harder to change over time, right?
But we have to keep changing it.
Yeah.
I mean, the robustness resilience distinction is really important because I I think it's it's not super well known in our industry.
Resilience is often used in software is like a synonym for robustness, but they really are different.
Like a robustness is really like designing for the kinds of failures that you can anticipate.
And there are a lot of failures we can anticipate.
We know a lot of things that can potentially go wrong.
And there's a ton of architectural patterns that are designed to handle known failures.
But we are always going to hit something we didn't design for.
And so that's where the resilience comes in is how can we be best prepared to deal with the problems that we did not anticipate that we were not explicitly designed for, that we may have like in trying to design to deal with problem X, we are now actually more vulnerable to problem Y.
We didn't even realize that.
And so you need both.
You definitely need both, but our industry historically is really focused on robustness and really doesn't think in terms of like, well, what can we do to generally get better at dealing with the unknown, right?
Like engineers are not good at thinking about how do we deal with problems that we cannot anticipate.
Like prepare to be surprised, you know.
Right.
As Rumsfield said, it's not the unknowns that we know about, it's the unknown unknowns.
And the amazing thing is it keeps happening, right?
So there are things that I feel like happen to us over and over again, and yet we don't quite internalize the lesson, right?
In every incident, there's often like I never imagined that this kind of thing would happen, right?
And I could tell you the next time an internet happens, that's gonna happen to you again.
I never imagined this would happen.
You know, our field is famous for like not being great at doing estimation, but we seem to make the same mistakes over and over and over again, right?
And I don't think I've seen in my lifetime like a significant improvement in our ability to estimate software project completion time.
This seems to be a very hard problem.
We don't seem to be able to do this very well.
I think other industries have this problem too, and they've sort of owned up to it in a way, because there are things like price escalators or cost escalators in the contract.
I think part of the problem is that in software, unlike other forms of engineering, you're not doing the same thing over and over again.
In other words, you can be a civil engineer and make a career out of building the same bridge over and over again.
That's not how software works.
If I want another copy of Microsoft Word, I just copy the bits.
Intrinsically, you are doing something that has probably not been done exactly that way before.
So it becomes very difficult to find tooling to estimate costs because you're always pushing the frontier in some way.
Yeah, I'm always a little hesitant to compare with other fields just because they don't know.
Like I never worked in in construction, say.
But usually in the construction industry, they have estimation books.
And they know in winter weather, it'll take so long to, you know, even the delays that I've had in whole remodeling are usually attributed to more to time than they are to cost.
In other words, you know what the parts are.
You know, no one comes into a kitchen, let's say you're doing kitchen remodeling and decide to put a jacuzzi in the middle of the kitchen.
We do that all the time in software.
There is definitely something, uh, what did Fred Brooks say, like ethereal about the nature of software that like we're seeing?
On the one hand, we're constrained only by our you know imaginations.
But on the other hand, there are resources underneath, you know, like to sort of circle back.
Like you're running on physical machines and everything is resource constrained.
Like one of the insight I've gotten on the SRE side is that it's not ethereal magic stuff.
There's actual physical and virtual resources that you're always running out of that you can run out of.
Just to change sort of focus for a minute, I find that we still have not learned this in society, is always a great temptation to blame somebody or something.
And if you remember the conversation we had at the end of one of the talks at the San Francisco QCon, is somebody raised the objection, well, if you have this focus on not trying to blame humans, which is good because, you know, if you blame humans, then they won't tell you the truth and you'll never find out what really happened.
For example, an airline plane crash.
It's agreed upon that the airlines will be certain liability, but you're not going to blame somebody in the incident review.
Because if you do, you they won't tell you really what happened.
And you won't learn from it.
But in general, we seem to not have to learn this because people want to blame humans.
At the same time, well, sometimes how do you have accountability for this?
Because at some point there is some human responsibility somewhere.
So on that topic, I think it's a very human response to say something bad happened.
Somebody must have done something wrong, right?
Like this is sort of how we understand the world.
And there are some people in my field who like prefer the term like blame aware rather than like blameless that like your people are gonna blame.
Like it's just it's it's going to happen.
You know, it's just something that humans do.
One of the reasons that I am a big fan of at least the idea of blamelessness is I think that we're looking for systemic problems, not individual ones.
You can look at it in two ways.
You can look at it as hey, somebody did something wrong, right?
Like they didn't test well enough, for example.
And so what do you do?
You tell them to test better next time, I guess.
You sort of admonish them, like, hey, do better next time, right?
Like, what can you really do?
I guess you can fire them.
But if there's a problem that makes it harder to test, right?
Like maybe you can only catch an end-to-end testing and our end-to-end tests are flaky and they were failing, or we don't have good support for that.
Or they weren't given enough time to do the test.
Yeah, that's a great one.
Production pressure, right?
If there are problems in the system that are increasing the likelihood that errors happen, then if you don't attack those systemic problems, then you're gonna have the same sort of issues, as someone else will have made those mistakes, right?
And so if you don't change the system, the system's not going to change, right?
And so that means you have to look for the systemic issues.
And Blaine doesn't look at systemic issues, it looks at individual ones.
It says, What was the problem with this person that they weren't following the right procedure or was rushing or whatever?
That doesn't help you improve the system.
So, what I like to do is think about imagine every decision that was made leading up to this was rational, right?
Everyone based on the constraints they were working under and based on the information they had at the time, they made decisions that made sense.
And yet this incident still happened.
How could an incident happen given that everyone is making rational decisions based on their constraints and their local knowledge?
And I feel like you're going to learn a lot more about how incidents happen by doing that, by assuming that individuals are actually doing their job.
In terms of accountability, one of the reasons why I get uncomfortable with that language is that I, in my experience, incidents are frequently due to interactions across multiple components or teams or whatever.
And accountability is really about okay, like who's the fro to choke?
You know what I mean?
Like who's the person who's gonna be on the hook?
But like if you're focused on finding an individual, then you're not going to see the interactions, right?
And those are the ones I worry about a lot more.
And so I don't think accountability can resolve problems that are interactions across teams.
Maybe there's like bad information flow, they don't understand.
And so that's why I'm always a little like I don't know allergic to accountability discussions.
But I understand that like that is one of the tools that management uses, right?
Like we are in large organizations, it's hard to run a large organization.
Like this is this is one of the levers management has to ensure things get done.
And so the question is, how do we accommodate the need for accountability with understanding problems that might not be solvable through accountability mechanisms?
And my favorite example of this is airplane crashes because the pilot flips the wrong switch.
Okay, you have approximate cause, human being made a bad decision.
The question is, why did this individual flip that switch?
Were the two switches close together that looked like it?
Was the airplane in a mode that no one ever thought it was?
Were the dials wrong, giving incorrect information?
So there could be a lot of reasons.
Yes, the human made the bad decision, but why did the human make the bad decision?
Yeah, I think like trying to get into the heads of the people when they made those decisions, that's the ultimate goal, I think, of a good incident review.
Can we get into their heads to figure out why they did something that from the outside seems bonkers?
Like, why would you do that?
I suppose from the accountability point of view, if a person seems to be involved in a lot of incidents that can't be explainable, or people constantly use poor judgment or don't estimate things right, I suppose you could then exercise accountability.
But that's the exception rather than the rule.
That's the result of looking at it through a blameless lens.
I mean, there are sometimes issues of competency, but my hypothesis would be that it wouldn't just be incidents where you would see that.
Like if if someone is really not competent in a certain way, then I would think like a manager should be able to see that in their like day-to-day work.
You know what I mean?
That it shouldn't just come out in incidents.
So I would be uncomfortable using incidence as the lens to assess that.
Especially because like there are some services are more critical than others, right?
Like if it's like the front door service, then like anytime there's a big problem, like that, you know, that service might be involved.
Like some services have large blast radiuses inherently because of architectural decisions.
You might be trying to change that, but then you'll see people on that team happen over and over again.
And it's just because of the architecture of the system, and that happens to be a vulnerable part.
I think that does shine a light on maybe you need to make an architectural change.
But I wouldn't say, well, just because someone pushed a change to that particular thing and then it broke, it's like, well, why is it dangerous to make changes to that service?
Right.
Because there aren't that many people on these teams, right?
Teams are generally relatively small.
So it wouldn't surprise me to see some people come over and over again.
And often like the people I see over and over again, they tend to be more operationally sophisticated because they are operating critical services and they need to be able to respond quickly when they break.
And so I will say, as an incident commander, I actually am happy to see people I've seen several times before.
I I know them, I trust them.
It's like when this service that is non-critical has some weird behavior and people get brought in and they've never had to deal with this before.
They don't even actually really know a lot about how it works and stuff.
Those are much, much harder.
Those people don't have the scars.
They don't have the operational expertise.
It's like sort of blaming, you know, in a fire, blaming the fire department because they always show up for the fire.
Well, of course they always show up in the fire.
The cause of the fire is someplace else.
Yeah.
I mean, statistically, you know, people in hospitals are more likely to die, but it doesn't mean you should avoid a hospital if you're sick.
You did, you know, like so then this raises sort of, and maybe this is the sort of final sum-up question before I get into the questionnaire that I like to ask all the people player on the podcast.
Is why are these ideas not as widespread as they should be?
At least in my opinion, I'm sure your opinion as well.
Is it the soft resilience community does not have these ideas widespread, or they have not done a good job explaining them?
It's not a corporate priority.
And it's just really different from other engineering disciplines.
I kind of wish I knew the answer to that.
Because like you're you're sort of asking more generally why do certain ideas spread and others don't, right?
So there are ideas that we knew spread, like Agile spread, enormously, DevOps spread, and then there are other ones which didn't spread.
I don't really know.
I mean, if I knew what it would take, then I'm one of the people who is trying to make this spread.
These are ideas that came from a different field that we're trying to make spread, but sometimes it succeeds, right?
So lean came in from manufacturing, right?
That has spread very successfully in our industry, I would say, ideas around lean.
So I don't really know what it takes for an idea to spread.
These are like, I want to say sort of like squishy human stuff, but like agile is squishy human stuff.
DevOps is wishy human stuff.
It's kind of related.
So I don't know.
I I really wish I did know like why it's it's taking longer to spread.
I got hooked on it through John Allspott posting on like Twitter many years ago.
And he would post like papers and stuff.
And I'm like, all right, to like shut them up, I'll read the papers.
And then I got like hooked on it.
But it tends to have come from an academic-y background.
And like it's hard to transfer academic ideas, I think.
Although you see like success in transferring academic ideas in distributed systems, those that have made it over.
So yeah, I I don't really know.
I don't really have a great theory as to like why it's not spreading as much as I like, but we're trying.
I think it's we're doing better than we did 10 years ago.
I mean, you would think that economics would force this a little bit.
You would think the examples of the large companies, if maybe they would explain a little more how they did their incident reviews when they have these out of Cloudflare or Amazon or these things.
I think we're making slow progress, but I think it's not like necessary to embrace these to survive.
And so it's kind of amazing how I don't want to say how poorly organization can do, but like you you or organizations don't have to be optimal in order to be like going concerns once they reach a certain size and momentum, right?
Like they will eventually sort of decline and fall, but they can take a long time, right?
And so at the margins, I don't know how much of a difference this would make.
Like you wouldn't see it in this short-term success of the company, right?
But I don't know other fields, but people rotate very quickly through our industry in terms of companies, right?
Like if someone's been around for two years to me, that's like, oh, that's a pretty pretty substantial period of time.
You've been at a company where like my my parents, for example, were at the same company for their entire lives, right?
And so we're very, I don't know, it's very fleeting the experience within individual companies.
And it's like that for all of them.
You'd think they'd all be very vulnerable, but the momentum keeps them going.
And so I would like to say it's like a matter of survival to learn this stuff, but it really isn't.
All companies have a certain amount of like resilience already.
So one thing we do well, right?
When we hire, we hire for expertise, right?
This is one of the things that all companies do.
When you hire someone, you don't say, okay, like tell me what specifically you're going to build inside my company when you come.
Nobody says that, right?
I don't like the way we actually do coding interviews, but we are hiring people for general expertise when we hire them.
And everybody does that and everyone understands, and they pay more money for seniors and juniors because of that.
And that actually goes a long way.
And there's a lot of people behind the scenes doing this stuff implicitly.
I think we could do much better.
I hope things like this podcast like get these ideas out, but I think it's just taken a long time.
Well, I mean, certainly if people move from company to company, that's part of how these ideas spread.
I'm sure you brought these ideas with you.
Is there anything reflecting on the conversation we had that you wanted to bring up that we haven't sort of covered or talked about?
One thing I would bring up is just the idea of storytelling, using stories around incidents to you know inform people about the system.
I think that there's a pressure, once again, from leadership from above to like just give me like the bullet points, like what do I need to know?
But really, we don't know what other people are going to learn from any particular incident.
And like human beings just absorb a lot more content through stories than they do through, you know, a PowerPoint with like bullets on it, a graph.
And our once again, like our industry, you know, software engineers and art, we're not like trained to tell stories, right?
Like this is not something that we learn about in schools, say, but like so in my current company at Airbnb, we actually have a storytelling session that we do once a quarter, it's run by myself and another engineer who came over from Twitter who was doing it over there.
He brought it to Airbnb.
He said, Well, called once upon an incident.
We get like three storytellers once a quarter.
They talk about an older impactful incident.
And we get a lot of good tenants in that too.
And it one thing I hope is that encourage people to tell at least internally more stories about incidents and it will, if it's a way of spreading knowledge and like human beings, we're just wired for that sort of thing.
I mean, I've always found that when I give a presentation at a conference, even if it's a technical one, if I cast it as a story, people relate to it more than if I just give a dry presentation.
Yeah.
We love it.
Yes.
I mean, you know, I if you look at the social science, they claim that, you know, from an evolutionary perspective, perhaps, storytelling is very important in building the earliest human communities.
So to get to the sort of the questionnaire is what is your favorite part of being involved with software reliability?
I have to admit, I love a good complex incident.
I love the story of like, oh, actually, this change had been made like two years ago, and no one noticed at the time that it was there, but it set the stage, and then this other change happened here.
I find it fascinating.
I just really enjoy learning about the complexity of how all these different things interacted and happened to like, I don't know, hit through all of our defenses, right?
Just the perfect storm.
So many incidents are perfect storms.
Just learning about the specific details of that, and learning about, oh, like this team assumed that the other team had deployed already because they normally deploy on Wednesdays, but there was something that had delayed them this week.
There's just all these little details about like how the real work gets done in the system.
I love like learning how people actually really do their work and how how things actually happen.
And incidents are just a good incident write up has a lot of those details.
And I I just I love that stuff.
I I I read it for fun kind of thing.
It's almost like there's it's a murder mystery or a crime mystery.
Sometimes it's like a horror story, and we're like, oh my god, like you can see like the trap has been set.
Like the bug is there, and it's just like someone's gonna hit it.
Yeah.
Right?
You don't know it, and then they take this action.
Oh no, they maybe don't know what's happening.
Unlike, you know, in a real horror story, don't you know?
Can't you see Freddie Kruger's behind you?
What is your least favorite part of your job?
I think my least favorite part of my job is the administration stuff I need to do that I don't think advances the business at all, but it has to be done just for the company to go.
So, an example, we just did performance reviews, right?
And I I just I hate that stuff because I'm like, uh, I don't think there's any real value in doing this.
I don't know.
I mean, I understand why it has to be done, but like anytime I'm doing work that I don't think is actually valuable for the company, but I'm I I I just it's so hard for me to motivate myself to to do it.
Now I will say that being on call makes me anxious and kind of an anxious person.
So I don't know if that's I mean, I guess my least favorite is being woken up at two in the morning because there's you know, an alert has fired and like it turns out to not be a real thing.
That's that's probably pretty high up there and things I don't enjoy.
Do you think I just don't have to something encouraging over the top of my head that AI might make an impact here and trying to be the first responder for certain incidents?
And there's a lot of work in that area right now.
Um, there's a lot of different companies that are doing like AI uh SRE stuff.
I'm sort of taking a wait and see approach myself to see like, okay, is it gonna be useful?
Is it gonna save us time?
Is it gonna help?
I don't think it'll take over.
You know what I mean?
Like I don't think it won't be 100%, which I would love if it was 100% and we didn't have to stack humans on calls anymore.
But yeah, it's still kind of early.
You know what I mean?
Like now it's like there's a bunch of companies that are trying to do this.
We don't know how well it's gonna work.
I'm sort of being like agnostic.
Let's see what happens.
But I think there's promise there.
I think it could make the easy ones easier, but the hard ones are the ones that I tend to worry about the most.
But that's one less thing to worry about if it's exactly.
Is there anything creatively, spiritually or emotionally compelling about software reliability engineer or being an SRE?
Well, so there's something very synthetic about the way you you and holistic, which is different than like traditional engineering is very analytic, right?
You you break problems apart, right?
We do separation.
I mean, this is a very big thing in architecture, right?
Separation of concerns, for example.
You want to decompose the system in a way that is so that you can work on the individual pieces, say SRE is the entire opposite, right?
Because like when everything is working properly, then analysis works great.
You break things down.
But like when something is broken somewhere and the system is not working, now you have to see how does the entire system work to figure out how that goes.
And so I don't know if spiritual is the right term I I would use, but it's it's a very holistic view that I find to be very different than the traditional analytic approach.
And I find it very rewarding to think about that, to think very holistically of the entire systems, especially when you start to include the people in the system, in the overall system, not just the software functioning, the people responding to it.
One thing I'll say that I find rewarding.
So, you know, I'm on the incident command on call rotation.
There's an ad hoc team that forms an incident happens, right?
And the instant commander keep things moving forward so people don't get stuck, you know, make sure the different paths get explored and things like that.
And that can actually be very rewarding.
I mean, it's stressful, but like you're you are there to support other people to fix the system.
And like that actually can feel very, very rewarding.
That you are helping other people to help the customers get get back.
You're like the doctor helping the patient recover.
Yeah, but I'm helping coordinate other people to do their work.
Like I'm helping other people do their work.
And I personally like that.
I've always been, you know, when I've been in software engineering directly, it's always been like engineering tools.
Like I like helping other people work better.
And instant command is that, not with tooling, but with coordination.
But I find that can be very rewarding.
What turns you off about software library engineering or being an SRE?
One of the things I find frustrating is the traditional view about metrics.
That's how leadership deals with things because the world's too big.
Like one of the reasons we use numbers is to make it easier to deal with the world, right?
The world's big and messy and complex.
And I understand why leadership does that, but I find it frustrating to like boil things down.
So like, what was the time to resolve this incident?
What's the trends on that?
I don't like having to record those numbers and have to like do trends on them.
I I don't think that's insightful, but that's one of the things that gets asked for.
So when I'm asked to do things that I think are that I don't find are constructive and that take my time.
And those metrics are are one of those things.
Was this a C1 or a CEV two?
I find those like which bucket, whenever you're asking like which bucket something falls in, I really don't enjoy that.
There's no insight to be gained.
You're not learning anything more about something by asking like which label you want to put on it, which bucket goes in, A or B.
I mean, part of that, of course, is the fact that when you reduce things to numbers, what can't be measured gets neglected.
Right, exactly.
And also if you have simple metrics, sometimes really metrics require to be triangulated.
In other words, it's not the metric, this how long did the incident take, but what was the complexity of the incident?
You have to take several numbers and put them together, really than rely on single numbers.
And so John Olmspot talks about like there's a distinction between a complex incident handled well and a simple incident handled poorly, and they both might have the same resolution time, you know, and just looking at that resolution time doesn't tell you how well people did in responding to that incident.
Yes.
Do you have any favor technologies?
Oh, I have a soft spot for Clojure, but I've never actually used it professionally.
I just do it like hobby ish stuff, like when I write my own little scripts I do in in in Clojure I've enjoyed playing with some of the formal modeling tools.
So like TLA Plus and Alloy are these like lightweight formal methods tools that I've historically been interested in.
What do they do?
These modeling tools?
So those are tools that are used to build a mathematical model of a software system and then you use that to check it that some property holds.
So for example, I want to model a concurrent algorithm and check to see that there's never a deadlock.
You never have two threads in the same critical sector at the same time, stuff like that.
Right.
So I've had fun with those as like, but these are hobby things, right?
That fun, fun things I've done on the side.
That's really it, I guess.
What about soft reliability engineering do you love?
I love getting to see the entire system.
That is one thing that I really love.
Is that like everyone else zooms in on one specific aspect of the system?
And I love that we get to see the whole thing.
And I love that now it makes it harder.
And one reason it makes it harder is that there you're always gonna hit a problem where like you didn't know if that that thing even existed.
But like I love learning about how that stuff exists.
So I I really love that we, as part of our regular work, get to see glimpses of the entire system.
What about soft or reliable engineering team hate?
I hate that I cannot show you how many incidents didn't happen because of software reliability work.
Yes.
I can't do an ROI.
It's a little bit like the plumbing where you only notice it if something's not working, right?
And so it's not that's that's not appreciated.
And so that that is one aspect that like I um we don't look like a lot of the reliability work because it's around spreading information to different people.
We don't always have an artifact to show at the end of the day.
Like, look, we built X, right?
The work is often not physically tangible.
And even I don't even know, right?
Like I can say, well, look, I'm doing great things, but like you can't see it.
Sometimes I don't know if I'm having impact or not.
That's one of the things that like if I moderate an incident review meeting or help with a write-up, I have no idea whether that's had impact or not.
I will never know, right?
And that can be like a little disillusioning to say, like, I will never know if I'm actually having an impact or not.
What profession other than being an SRE would you like to attempt?
I was a professor once upon a time.
I don't know if I would go back to that.
You know, when I retire, I think I would like to just be a permanent student.
It's not a profession, but that's what I would want to do.
I mean, I I loved being in school.
I can see just doing that for the rest of my life once I don't have to work anymore.
Just taking courses and learning about different things.
Do you ever see yourself not being an SRE anymore?
It's hard for me to imagine that.
I've tried to go back to just regular infrastructure platform software engineering, but I keep getting pulled back in to reliability.
And like I I write about software reliability as a hobby on my blog, right?
So clearly this is where my head is.
And so I think about it too much.
It's just too much part of my identity at this point that it's hard for me to imagine unless I get like super burnt out and try to swing back to the regular software and chain again.
I I think I'm gonna be in it for the long haul.
When a project or an incident review or however you want to think of a project is done what do you like to hear from the clients or your team that's a good question.
My favorite is hey here's where I used this right here's where it was like useful to me that you did this you know we build on my team like you know we build some tooling we're not just in center responders I'm like if I see someone using that tooling effectively then that's actually the thing that gives me the most I don't know positive feedback like hey someone's actually able to use this stuff and do work with it's more than like someone saying like hey this is useful to me then seeing them actually use it in action is the thing that I think makes me happiest.
And you you see the world at least incrementally in a better place.
Yeah and I help with that it's funny like I remember when I was earlier in my career I was like oh like no one's I'm writing this code is never going into production and then later on background I'm like oh my god the code I'm writing, it's going into production every time someone flips like oh no.
But it does feel good to see people use the stuff that you build.
Well, thank you very much for being on the podcast.
I found the discussion probably interesting.
Hopefully the listeners will find it interesting as well.
Yeah, I enjoy it too.
Thanks so much, Mike.
