# The Architecture of Resilience: Systems Engineering at Scale

**Podcast:** The InfoQ Podcast
**Published:** 2026-04-20

## Transcript

In mobile application security, good enough is a vulnerability.
GuardSquare delivers the highest level of security for your mobile apps without compromise.
Discover how GuardSquare provides industry-leading security for your Android and iOS apps at GuardSquare.com.
Welcome to the Architects Podcast, where we discuss what it means to be an architect and how architects actually do their job.
Today's guest is Matthew Liss, who is responsible for American Express's data center, resiliency, and multi-cloud strategies.
and is responsible for all other infrastructure components, including digital workspace, and has operational oversight with site reliability, application support, and mission control for all lines of businesses.
Matthew brings more than 30 years of infrastructure engineering expertise.
At J.D.
Morgan Chase, oversaw the design, build, and management of the bank's platform infrastructure.
He was responsible for databases, middleware, and critical infrastructure.
including identity, observability, reference data, and cloud integration.
He has also worked at Goldman Sachs, ThruPoint, and Schlumberger.
Away from work, he enjoys spending time with his family, cooking, and traveling.
It's great to have you here on the podcast.
You think of yourself as a system engineer, but it seems your description of systems engineering corresponds to what I like to think of architecture and the role of an architect.
Were you trained that way?
How did you arrive at your current role?
Is that something you decided one morning, you woke up and said, today I'm going to be a systems engineer?
Yeah, well, first of all, thanks for having me and really looking forward to this conversation.
You know, I was always a tinkerer, I guess.
I grew up in an age where computers were not ubiquitous or common.
And I had an experience as a kid, and it's kind of instrumental in how my career happened.
We were living in Norway, and my parents, they're not Norwegian.
We moved there when I was a kid, six years old.
And they made a friend who ran the mainframe for the University of Oslo.
One day we went over there.
I was eight or nine at the time.
And he pulled me aside and said, I want to show you something.
He put me in front of a terminal, put his phone in it, and I played chess.
He basically had me play chess against the mainframe.
And that was when I tried.
That magic.
is something I want to be involved with.
So I was always tinkering, putting together computers, writing code, soldering electronics, even though I didn't even know what it was at the time.
But I fell in love with this whole concept of being able to think of something and then go make it happen and manifest itself.
And probably part of it is my father, the carpenter, he worked with his hands and manifested stuff.
I knew that wasn't really for me.
I don't have the patience that he had to do that, but I like building stuff.
You know, I started with the electronics, low-level software and electronics really allowed me to be a builder, I guess.
So that's probably the way I just fell into it.
But there was that moment playing chess against the mainframe that had me down my path.
I remember tinkering, you know, similar paths.
It was tinkering, software, push this, see what happens, change that, see what happens.
Yep.
But that is a far way from solid engineering practice.
When you build a platform, and maybe we should talk right now about the nature of platform, you have to sort of distinguish between that urge to tinker and that urge to make something that's solid.
If you build a table and one of the legs is not quite solid, you can still use the table.
But in software, if you build a table, And one of the legs is not solid.
You can get into a lot of trouble.
You know, system engineering or what I think is in an apprenticeship, no different than any other craft.
And you get good at a craft by learning from others, from making mistakes and gradually understanding what great looks like.
But it takes experience.
It takes apprenticing.
It takes being willing to take risk and learn from the mistakes and work through it.
I don't think of any different than someone apprenticing to become a cabinet maker, make it great cabinet.
Like over time, you learn how to do things in the right way.
And I'll use another sort of early career example of how I learned a really hard lesson, but one that stuck with me.
This was a summer job.
I was working for Schlumberger.
I ended up having a career Schlumberger, but this was during high school.
I had a summer job with them, soldering cables, basically putting together these really complex 40-way, 50-way cables with big connectors on each end and soldering them.
So I had no idea how to solder when I first saw 16.
You know, gave me two connectors, gave me a cable bunch to say, like, connect these together.
Like, you know, A goes to A, B to B, and so on.
And my boss spent like half an hour showing me how to solder and said, okay, go build a cable, come back when you don't.
Took me probably two weeks to make my first cable.
Probably should take me, you know.
When I was done and go good at it, it took the same thing half a day, but I was learning.
Anyways, I soldered this thing.
I tested it works.
I bring it to him.
I say, I'm very proud of my work.
And he looks at it.
He unscrews the connectors, looks at my soldering.
And then, you know, he reaches into his drawer and pulls out a bolt cutter and then clips my cable in two and gives it back to me and says, do it again.
And I'm in tears.
I'm like, what do you mean do it again?
Like it worked.
I tested it.
You didn't even try it.
So I looked at your work and it was solid.
You know, I could see from the way you did it, you didn't do it right.
And he said, you do this better, pay better attention, pay more attention to what you're doing and do it properly.
And he's simple things like, are you really soldering the joints?
Do they stick?
Or are they just going to stay around for a couple of days and then, you know, break that apprenticing and those lessons are really how.
at least how my career has been, is through iterating my way through making small mistakes on continuous basis, hopefully not too big, taking a lesson from them, also practicing learning from others, and gradually building up this, I hate to say it, but a bit of a gut of what is right and what is wrong, intuition, and then that begets itself.
And over time, you just, yeah, you build.
But it's not a, I don't know you can read a book.
for it for like a better word this is something that you build a corpse of knowledge of over time this is very interesting on multiple levels and you know eventually we'll get to talk about software platforms but this apprenticeship and this development of the intuition i ask myself do we have enough time today the way we develop software very often in a hurry you know with silicon valley break things fast or even When we think of, and this is a problem that I've thought about and talked to people, if AI starts to do the easy coding jobs, where will beginning engineers have their apprenticeship?
And I think this is a very important issue.
I think that's the most profound, I mean, at least to me right now, is the most profound.
I was lucky I could apprentice.
I could learn to do stupid little things to begin with that gradually became more and more complex things over time.
As I could.
And I think most people our age still do that.
But then you can imagine if you no longer do the stupid stuff because AI is doing that for you, how do you ever learn to do the more complex stuff?
Because you have to learn over time.
So it's a great question.
I don't know the answer to it.
And I think it's for any knowledge worker.
Imagine like legal work or any structured work, accounting and so on.
If you never do the boring stuff because that's now done by AI, then how do you learn to do the more?
I don't know, but I mean, I'm also somewhat optimistic is that we will pivot over time to do that.
I had someone who worked for me once who was a compiler expert and he was always frustrated with that.
People don't really understand the CPU works anymore.
They just write code and then it becomes assembly and they don't know what happens under the hood and they should really all know that.
And it's like, but yeah, but you know, they don't really need to because the compiler just does it and does it well.
And the fact is that.
You don't have to have people write assembly anymore.
It's probably a good thing.
Yeah, but I'll give you a counterexample to that.
The problem comes where the abstraction breaks.
Yeah, 100%.
I remember one time very early in my career that we were programming in Fortran and there was a performance hit every now and then.
And it turned out that the compiler had placed an instruction.
across a page boundary.
So every so often there was a page fault.
Yeah, and it had to load it.
And that caused the performance.
But if you just assume the compiler worked, you never would have found that.
It's a great point.
I think to your point about no abstraction is you want to abstract sufficiently, but you still need people to understand how the machine works.
I mean, it's no different than my car.
I mean...
I cannot pretend to repair my car today.
I could have repaired my car, at least on simple stuff, 20 years ago.
No longer.
It is too obtuse to me, the way that it's built.
And today, with modern Electron Zone, for me to be able to do anything meaningful with my car.
And so that abstraction is now beyond me.
But luckily, there's people at a shop that can do that.
And so I do think that software is no different.
But I do think, certainly, to a point of AI, it's very new.
I think that the way that we have been apprenticing in our field for the last 40, 50 years is changing abruptly.
I don't know what it'll look like on the other side of it.
And I think about, you know, I have two boys, both college age.
And five years ago, I was to both of them.
You need to do computer science.
You'll be employed for life.
It'll be great.
And I'm kind of happy they didn't listen to me.
Don't get me wrong.
I think there's still going to be a lot of great engineering workers.
We're very different.
And so what I was telling them, like, you should be a coder.
Like, you should learn Java or learn code.
And I'm not so sure anymore that that's going to be the job of the future.
There will be jobs in creating software, but they'll be different.
I don't know exactly what they'll look like.
At the current point, I've talked to enough people who use, you know, Cloud or other LLMs to build stuff, and they still have to read the code because they make mistakes.
You have to figure that out.
And I think actually right now, I think it's no different than if I'm a senior developer and I have a team of 10 junior developers, they will write code and they will make mistakes.
I'm still accountable to make sure it works and I still need to read it.
This is, back to your apprenticing question is, how do I become a senior developer if I never was a junior developer?
And so do we have a pipeline problem?
And do we end up not being able to have that person do that job?
And to my point about getting hired, would someone hire my 22-year-old to write code?
Or is it just use call for that?
The senior developer has a job.
No doubt.
Does a junior developer still have a job?
And it's also, I mean, without going down the rat hole, there's also a question who writes the unit test, who does the system test.
I think we'll come back to this.
But how does system engineer contribute to it?
Yeah.
So just to give context on myself, I built my team and what I've done for the, last 20 plus years in financial services.
And before that, I did similar work in telecom and oil and gas, but it's really about building platforms that are broadly used by developers to build applications on top of is the best way to put it.
And so I build platforms for other engineers that use them in turn to deliver business, you know, software to whatever they serve.
So in financial services, You can imagine that these platforms, to use your point about the table, I described as the three S's.
They have to be stable, they have to be secure, and they have to be scalable.
Those three are non-negotiable at all times.
They always need to hold true.
In production, now you could compromise on it pre-production, but once you have and you're trying to support a trading app, a banking app, an ATM app, and so on.
The expectation is that these three S's are always true, which means that you're now building platforms that are kind of conservative.
You're always threading needle between how much risk can I take?
How much change can I make?
You don't want to make too little change because then you don't keep current.
You don't make too much change as you're running risk.
To me, the system engineering concept is being able to think about the whole.
This platform has platforms on top of it.
And it's sitting on top of other platforms.
And so you think of this whole thing as a system, as an organism.
And if one part no different than a human body, you know, if your lungs don't work well, even if everything else is perfectly fine, your body doesn't work.
And so you have to think about your part of the ecosystem downstream and upstream dependencies and how you manage that as a system.
And so that system thinking of how do I build and do my thing really, really well, but understand how it fits into the bigger system.
with my role to play and what I need to be really good at.
And again, in financial services, it's very unforgiving, meaning you get it wrong.
It's very obvious because you will blow up and there's very little tolerance for the wrong kind of mistake.
There's often tolerance for the right kind of mistake.
You can make mistakes as long as you make the bank more money than you lose.
Or if you lose more money than you make, there's very little tolerance.
Many, many years ago, I heard this story.
They had a test system.
that duplicated the trading floor.
This is when there was not 24-hour day trading yet.
So there was a machine that interacted with the trades and then another one that overnight cleared them.
So there was a duplicate system for us to use to test.
And of course, these were the days before the days of the internet, we had to hook these systems up.
And the test account was the account of one of the biggest stockholders.
They made a copy of it.
It was just because it was so diverse and there's so many things, it was good for testing.
Somebody made the mistake of connecting the test server to the actual live stock exchange.
It was so much effort to pull back that trade and undo all what they had done.
Catastrophic, right?
Yes.
I have worked in environments where we've had these kind of things happen.
You learn from them, but you have to be careful with mistakes you make because you make the wrong kind of mistake and you are costing the place millions of dollars.
You don't make any trades, you don't make money, right?
So you have to be willing to take risk.
And I think on system engineering, it's risk management.
How much risk are you willing to take as you think about evolving a system and how do you get that balance right?
Are you familiar with Barry Bean's sort of risk model of software development, Spira model of software development?
At every stage in development, you ask yourself, what risk am I taking here?
And based on the risk, you decide what your next course of action is.
Do I do a prototype?
Do I implement something?
Do I need to do more research?
No, I'm not familiar with that particular model, but it's very much the way we think about it.
Like, what do you bring along from prototyping?
Now, I think of this as you have engineering candidates, you've got development candidates, you've got production candidates.
And you think about along those steps, like how much more do I need to know to launch this?
What validations do I need to be comfortable with that?
And it varies enormously.
I could say you will have a business that is very willing to take risks because they're on a more competitive side of it.
And other parts are like, this is the golden goose part of the business.
Like this is highly profitable and do not screw with this.
The interesting thing about being in financial services, you get all gamuts of that.
You get high risk and low risk environments and everything in between, and you are dialing up or down that.
If I think about SRE, the way we do that is ultimately you really want to think about how much failure can I tolerate?
If I'm failing too little, I'm clearly not taking enough risk, and I fail too often, I'm taking too much risk.
You dial up and down.
the changes and the chaos we're introducing basically through thinking about that.
And we don't always use formal error budgets and so on, but it's a good way to think about it.
Like, for example, we measure all customer journeys.
You know, can I use my points?
Can I pay with my card?
And so on.
And those journeys are very instrumental to think about the risk because then it's like, well, if I fail a journey more often, then I have a risk tolerance for it.
may want to dial down the change rate, or I want to test better, invest more in testing, and so on.
So there's a lot of thinking that goes into managing that from that perspective.
And thinking about system engineering, and back to my point you heard about gut, that gut is also informed by the data that, you know, you have to have data that shows you what is working and what is breaking.
You mentioned SREs, and you talked about sort of the customer journey.
The question is, how do you get...
what the SREs find or the customer got and seed it back into the architecture.
This I find is a big problem and nobody really has a good answer for this.
Yeah.
I mean, it's a great question.
It is.
So the perennial question is how do you have a perfect feedback loop?
You don't, but I will say that, and this is pretty recent, you know, my life being informed by journey is customary, but I think that has been very helpful because it focuses on mine, which is.
If you think about ultimately, why are we building software?
We're building software to support certain business outcomes, which support our customers.
And so if you put yourself and say, the most important thing for us, our customers, and of course, pay us money to use our products, right?
If you always are forcing yourself to say, when my software doesn't work as anticipated, how to impact the customer, that really focuses how to direct that.
Because then you say, if something breaks and the customer never noticed it.
It's not that important.
If it breaks and the customer noticed it, it's very important.
And if 100,000 customers noticed it, it's even more important.
And so I think that really helps in terms of reinforcing the feedback loop to go to the right places.
And we use that a lot in terms of our conversations around reviews of what broke, why do we think of that.
Of course, we always do postmortems and why did this thing break and we're back.
But then where we focus effort comes down to Is it impacting customer outcomes or not?
And that's a very helpful way of tightening that feedback.
It's still imperfect.
Don't get me wrong.
But I will say that that orientation around customer outcomes has been very helpful.
That raises sort of two things in my mind.
One is perhaps for the listeners, you might define an example of a customer journey.
So it's a little more concrete in their mind.
But the other thing that comes to mind is a lot of these systems.
are sort of on the edge of breaking, so to speak.
In other words, sometimes it's a miracle that it works.
And how do you deal with that?
When you talk about stability and you talk about reliability, which is very important to you, and you know when to push and when not, you have to deal with the fact that you are always, so to speak, on the edge of chaos.
Yeah, that's a great point.
So let me start with the journey just to express it.
I'll use a couple of journeys that we have here just to put them in context.
But every company will have equivalent.
So one journey is, can I pay with my card?
Think about that's the most foundational at American Express.
Can I use my card?
That's a very clear journey.
That's, for example, one we measure.
Or another one would be, can I look at my statement?
So these are different journeys, right?
And they have different systems under them.
And so your point about complexity...
And things on the edge of breaking, I think in any complex environment, because that's what systems are like, can I pay with my card?
You can imagine has a very complex ecosystem underneath it.
So when you go to a store and use your card, the fact that it in real time, pretty much it gets authorized between the merchant you're involved with, the network, the backend, and so on.
A number of parties involved in that transaction and a lot of systems.
And so that transaction flows from the...
point of sale, you know, through a network to a backend that knows everything about you, says, yes, you are authorized or no, you're not authorized and this is why.
And then all the way back and then you get a yes, usually that you can make the purchase, sometimes a no.
There's a huge amount of complexity that the customer never sees and never should see.
But to your point around things, you know, at the edge of chaos, I would just say that that example is where we pay a lot of attention.
to managing that because it is a very important journey to us, clearly, and one where we want to make sure that it always works.
So we have, in the cases where the journeys are more rigorous, we do pay a ton of attention to testing, chaos testing, scenario planning, all kinds of paranoid activities, as in, well, if this breaks and this breaks and that breaks, what then happens?
And then you anticipate also scale, as in...
We work fine now, but what if we got double the volume?
Then what happens?
Triple the volume.
What if it's Black Friday?
And so there's also a degree of anticipation.
You learn over time to think about, as I said, scale, security, stability.
Those three S's.
And so scaling is probably the thing that people get wrong the most often, anticipating scale.
But the way I think about it, you want your product to be very successful.
If your product is very successful, guess what?
You're going to get more customers, you're going to drive up scale.
And so you have to have built into your systems how they will deal with scaling.
And to be honest, like in my experience over the many years I've done this, it is usually scaling issues that have broken complex systems.
Because something that was working fine over time was getting closer and closer to some threshold.
A resource intention of some sort.
Yeah, resource contention, you know, network contention, CPU contention, memory contention.
In somewhere downstream, that is not obvious at all until it happens.
And then once it happens, of course, then everything starts downstream failing.
And so really thinking about scaling up front is one of the most important things to do in complex systems.
And there are other places you're willing to take risk of saying, listen, if this thing fails, I'm going to let it hit the bottleneck.
And then once it does, I will go fix that after because I'm okay with this thing failing every so often because it's a rapidly evolving business.
We do both of those things.
And to point around things always at the edge of breaking, I would say that's true for where you're either willing to take more risk and or a brand new system where you're still learning the edges.
But for the places where you are core to your business, I would say that you don't want, I mean.
I'm not saying it's ever perfect because, you know, these are very complex environments.
You know, often the complexity is not even the own system.
It could be with a third party.
So, you know, there's a lot to it.
But I would say that we, from a system perspective, think about dialing up and down that paranoia and think about where things break.
You see it every so often that companies are incredibly good at engineering, like, you know, plenty of great software companies and tech companies that occasionally get it wrong.
But they do a deep introspection.
They do the postmortems.
They look through it and say, all right, this is what we can learn from.
And that's also the other thing is anticipate scale and learn your lessons.
Don't make the same mistake twice.
And so it's also like when these things do fail, they didn't know there was a bottleneck here or this downstream dependency you didn't anticipate.
Well, then learn from all of these failures and think about how you engineer that out of your system so that particular thing doesn't happen again.
Some of it might be just human intervention.
For example, he talked about the situation where you may deny a charge.
You know, someone goes out of the country, they're in Afghanistan, and they're having a charge, and you have to figure out, well, is this really you?
100%.
And also, you probably have to make concurrency, like optimistic concurrency or pessimistic concurrency decisions along the way.
Oh, yeah, so 100%.
And you always are threading the needle behind being perfectly mathematically and the customer experience.
And you kind of usually have both.
You kind of have a perfect customer experience and a perfect technical outcome.
And what I mean by that is, let's use the example of using a card.
We could be incredibly fine-grained on every single attribute of managing every part of it.
You might deny more charges than we would annoy you because sometimes you would get declined.
even though you're legitimately using it.
And so we say like, you know, there's some risk we're willing to take here to make the customer experience smooth.
Because do we really want to annoy every customer when they travel and have to check in and say, I'm really in Afghanistan or really in Italy every single time?
No, we're going to use heuristics and model it out.
And to your point, concurrence is like, you know, it's probably legitimate in the most, you know, and we get it.
Especially if it's a $50 restaurant charge, the risk is not.
But a thousand dollar card, very different thing.
These are all things that go, it's the risk appetite.
It is managing that and managing the customer experience through that, that becomes the right way of thinking about it.
And to your point around resiliency, it is not just a technical resiliency.
It's also process resiliency, people resiliency.
Because again, if you think about a system in the true sense of the word, and you think about what you're giving to your customer, it is not just technology that's creating that customer experience.
everything around it too.
And so you have to really think of that part too.
And a good example with any banking application is worst case, you can't get to your online statement and this holds true of any financial institution.
Well, you can call someone and talk to a human, right?
And then therefore that is to degree part of the system and part of how you manage business risk is a worst case of things.
And then you think about, well, how resilient does it need to be?
It's like there's a lot of diminishing returns in how hardened, you want to make something.
And let's say it is four nines or five nines and make it six nines will cost you 10 times more than having it five nines.
Well, maybe it's just not worth it.
And for the 15 minutes, it could maybe go down a year.
You say, you know what, if it happens in those 15 minutes, someone can call someone as an example of how you think about system, you know, resiliency and system uptime.
One time I actually did a calculation of how many nines.
The electric company actually gives my house or business.
And it's really not that many, if you actually think about it, for precisely the reasons that you just outlined.
That's an example you could probably live with.
Occasionally, it's done as a drawing, but it's all life and death.
But the brain needs a very high degree of minds, right?
And so that's why we do this all the time in real life.
We think about these trade-offs and we accept them naturally.
But then when it comes to software, we expect there's perfection.
And ultimately, we surround ourselves.
We do natural risk management all day long as humans.
But in software, because of system doing it, we expect this degree of perfection.
You have to let that go.
You mentioned customer experience, but platforms also have another customer.
And that's the developer that is building on top.
And you must have to work with them.
And that raises all kinds of questions.
How do you trade off the long-term versus the short-term?
And they say, we need this maybe for a year, but in two years, things will change.
And how do you say no to them?
Or how do you develop a path from what is needed now to what's eventually necessary?
This is another type of customer experience.
It's a great point.
And by the way, we do also measure journeys for our developers, how things work for them.
But that trade-off is super important because like any platform provider, I have limited resource.
Oh, I don't have infinite resource, pretty better way to put it.
And so I cannot build everything everyone wants.
It's impossible.
And so it is really figuring out what is it that adds the most value.
I mean, if you think about it in a condensed way, what adds the most value to the most developers in the least amount of time?
It costs me the least to maintain.
I think of it a couple of ways.
First of all, I don't want to be too early and I don't want to be too late.
What I mean by that is being in my space, there's always new open source projects, new variety of different things that are starting to pop up that look interesting.
And I'll use Kubernetes as a good example of this.
I started working and providing container platforms pre-Kubernetes.
And I knew that Kubernetes was coming out of Google, but it wasn't really ready.
And I invested in...
other platforms to do it, but I had to pivot over.
And so it was one of those things where I made a deliberate decision and say, you know what, it's okay.
I can learn from this, spend a year or two getting from the containers.
And then given it's so early days, I will pivot.
And so it's important to recognize when you're in that stage of innovation, but when it's more mature.
And when do you decide to make that investment?
And when do you just observe and stay away?
And to be, that's the real art.
There's no science to it.
I keep getting it right.
I keep getting it wrong.
I mean, Sometimes I think something needs way more material before I know it.
Every single developer is using something and they're all upset that we don't have a platform solution for them.
As long as we build things that are robust and no one wants it yet.
What I ask my teams and myself to think about is, have conviction that this is the right thing to do.
And I use analogy of a puppy and a dog.
You can fall in love with a puppy, but are you ready to care and feed for it for years?
and walk it and do all things.
And if the answer is no, well, then it's probably not time to bring it into the house.
And only when you have the conviction that if this is successful, and platforms are like that, if I decide to do something and people love it and they start using it, I can't get them off it.
Not easily.
No, because now it's work for them they don't want to do.
They're already using, let's say, whatever, container platforms.
And I tell them, well, I'm going to stop supporting you and you're going to have to migrate.
They're pissed off.
They're like, you need to support me.
And that is years, not days or weeks, because they want to focus on writing business software and not on migrating off whatever I have.
And they have finite resources too.
They have finite resources too.
So that's where it's really important to think about, do I have conviction around this thing?
And then it comes down to the art is, when is it the right time to pull the trigger?
And because of a finite resource, I have to always think about like, well, what can I not, if I do this, there's something else I'm not doing.
Am I making the right trade?
among it.
But it is a very particular challenge.
You know, my customers are captive.
They can't go use someone else's infrastructure.
They have to use what we provide to them.
I mean, yes, we provide to them cloud-based infrastructure, public cloud and so on, but it still funnels through my team.
Whilst if you are a cloud provider, you know, paying customers, and if you do a good job, you have more paying customers, and if you do a bad job, they can go to someone else's cloud.
And so that really focuses the mind.
When you are in enterprise IT, like I've been for a long time, You have to really drive the discipline more forcefully into your teams because you don't get natural market signals like you do if you sell your software to someone with a real wallet.
But you must have management that understands this because your developer customers can always appeal over your head and this becomes all political.
But you must have a management that understands all these trade goals and these things that you've just spoken about.
Our job is to really demonstrate as the best we can that we're building what has the best value for the company.
I know we have a sizable budget and are we doing, be good stewards of that and managing it.
And then, you know, 2.0 is saying no.
And yeah, we do say no quite often and say, this is just not, only you want this.
No one else has asked for this.
And so else it might be a great idea for you.
I don't have bandwidth to go do this because this doesn't take precedence over the things that more people are asking for.
So go do it yourself.
And then if you do it yourself, you have to do it in the right way.
You can't blow the place up.
And so we have to sometimes allow for a degree of innovation happening outside of our team and just accept the fact that sometimes someone's going to build something for them only.
That, yeah, they might not be database experts, but they need a very peculiar database and no one else is asking for it.
I'll let them go build and stand on that database stack because it's just not the right time for me to do it.
But let's see.
two other, three other teams are tasked with the same, then we might take it all.
And you might find that it's a special for one team, but as things evolve, more teams want the same thing.
To be fair, open source has really helped with this because it allows people to more democratically build together and collaborate.
And because open source, you know, if you take standard database technologies like Postgres or MySQL and so on, It is widely available.
And so it helps enterprise like us collaborate better because we all have access to the code base and can iterate along.
And so that has actually helped tremendously with that problem statement of how we can better do that.
Because think about, you know, in closed source software, a team like mine is the only one that has access to it because we're licensing it from a vendor and standing it up.
And then it becomes a lot more tension filled.
And if we don't have the bandwidth, then it's going to really upset us because only you can build it.
And you're standing my way versus what open source allows us a bit more organizational flexibility as to who does what.
The other thing that seems to be implicit in what you're talking about is, you know, that everybody understands these sort of things, which means that there's a cultural element to all of this.
And you must have the culture inside your organization to understand these things.
and also the culture of the users of the platform.
Because it's not just knowing the contracts of the API.
It's the implicit contracts, and it's also the understandings that the culture brings.
Yeah, I think the most important job for me at this stage in my career, I mean, lead a big team, is setting the culture.
Because great culture builds great teams, and great teams build great products.
And you can get lucky.
I've gotten lucky.
I've had teams that have really been successful, but it wasn't because I was good.
I was lucky.
And I happened to get lucky and assemble the right people, then build great products.
But to do so repeatedly, because my job of building cloud for thinking, oh, I'm building as many components to it and many teams with these components, you really have to focus on the culture to get that right at scale.
And so a lot of what I describe is like.
figure out how you empower teams to make decisions as autonomously as they can, but do so safely, managing, you know, as I described, making the right decisions and working through it.
So it's just a reinforcement of that on a continuous space of setting that culture.
And for me, the most important thing is really come down to how do you set the macro guidelines and then you let people run within those sort of guardrails as freely as they want.
and give them as much freedom as you can safely.
And again, there's a bit of a heart to it because people will make mistakes.
You'll always accept the fact that sometimes people are going to do something wrong and you're going to say, well, that wasn't too smart.
You blew this thing up.
To use the example of you did trades and prod.
Well, learn from that.
Don't do that again.
And it's any culture where you're allowed to make mistakes.
And again, back to my apprentice example, there's enough other people with experience around you that Stop ever making the worst mistakes, but allow you to make the small mistakes that allow you to grow.
My role is really to reinforce that behavior as much as I can and allow teams to blossom in that because if you build platforms, they need to hang together.
It's like a meal.
You go to a restaurant to eat a meal, not to eat a carrot and a potato and a steak.
It's assembled together and you expect that it's that experience.
of the meal you're going for.
They're not looking for the components.
They're looking for all this to work together in a cohesive, coherent way, which means, like Conway's Law, if you have an organization that's dysfunctional and don't talk to each other, well, your platform is not going to hang together.
And as a consumer of that, you're going to be like, well, that was not great.
Like, I'm trying to use this database with this container, with this messaging broker, and I have to sew it all together.
All the observability looks different.
None of this is easy to troubleshoot.
And so that's a very frustrating experience as a consumer.
And so you have to, again, back to your point about putting yourself in the shoes of a developer.
I'd like to create a team and I have one here too.
I call it developer zero, the first developer.
And so I have a team that they're not real developers, but they're critical people who consume our platforms and their freedom to roam where they see fit.
And their job is to give constructive criticism.
to any other platform team saying, well, this sucked.
But they're deliberately outside of those teams because I want them to consider just like any other developer.
So they don't have inside access to documentation or to special APIs.
All they see is what the developer sees, but they know the people and they'll go and say, hey, by the way, when I tried to use that database with your message broker, the documentation doesn't describe at all how I had to do it.
It was actually like this.
So that also helps.
you know, really find things before my skilled customers, which are the development community.
That's an interesting point that you mentioned about the documentation explanation, because when people who know very well what they're writing about often make implicit assumptions in their writing that they do not realize is obvious, and it's only someone else's eyes that can say, so what you really wanted, the reviewer, I think you have a good idea.
there because you have people who are technically competent yet don't know the insides.
So they understand the documentation.
They understand what's being explained to them, but they can call out the hidden assumptions being made that are not clear.
Yeah.
And it's very important to your point.
It's like that old expression, you don't see the forest for the trees.
And I find that all the time as our platform teams are building various components, they get all the complaints.
Well, why are they complaining?
This works perfectly.
And it's well-documented.
It's very well understood.
Like, well, to be honest, perception is reality.
And you might think that, but your customers don't.
And so to a point around, they just don't see it until it's pointed out.
And I do think having an explicit function around it.
I mean, but I don't think this function is not what you call UAT because it's a lot more loose and it's a cultural thing.
It's not a formal thing as in, I'm not asking this team to check every single release.
I'm just telling them go consume stuff at will.
go figure it out and go direct.
And you spend your time where you see fit and where you hear the most noise.
That's where you should spend your time.
But they're not, it's sort of formal gate.
Because that was slow as, if I told them and said, listen, you have to go test every single time something releases everything, we'd never get anything out of the door.
And more importantly, they wouldn't get the important things.
Exactly.
So I want to get back to the question that I sort of teased at the beginning is, we now have talked a little bit about platform.
engineering and the importance of it.
And we are increasingly coming into the world of agentic AI, where the software could be written by AI, it could be written by humans.
How does this change architecture or systems engineering or platform engineering, however you want to conceptualize it, when your customers may not be human or the level of platform engineering goes up?
Because the AIs may be writing software that's sort of like a platform that humans use.
What does that world look like and how does it change things?
So first of all, I'll get back to that.
Right now, I don't view Argentic any different from a human or cultural perspective than a senior developer overseeing junior developers or a senior platform engineer overseeing junior platform.
Still the same problem statement.
I'm still accountable to make sure this stuff works.
And if it's built by humans or built by agents, it still needs to function.
And I have to make sure the system hangs together and I'm doing the appropriate tests and validations and so on to have confidence around it.
No different than before.
So what does change, though, is the speed.
And how do I feed this?
So first order for us is the complex of operating these platforms.
And back to your point earlier around your only one mistake from chaos or very close.
And these are, as I said before, very complex ecosystems that hang together and clearly a place where we have humans operating it today.
And you can imagine that agentic systems, because they can look at a vast amount of data way faster than a human can, should be able and can triage complex systems quicker than a human can.
And so we're feeding these agentic systems all the telemetry observability state and so on.
And again, as I said earlier, but everything we see shows us a lot of promise that this will be profoundly quicker in finding issues than we can as humans.
Or making mistakes faster than we can make mistakes.
And make mistakes faster.
But this is a bit of like you have to apply to both sides of the equation.
And that's kind of my point, which is assume you have agents writing code and assume that they will spawn mistakes.
You also need agents observing.
the systems that also can go just as fast.
And so that's the way I think, like, as long as you think about avoiding both sides of that trade, it doesn't really change the dynamics, but you have to keep up with that.
It's so different than cyber fraud with a Gentic, but equally cyber detection, how you manage that with a Gentic, it's an arms race.
And so I think of it like, as long as I can, you know, from a systems perspective at the same speed.
observe and manage the system when things go wrong, then that equilibrium hopefully holds true.
But you're very right.
Like, I don't know if it will.
So that's something I'm acutely looking at is make sure that I operationally are focused on the same level of speed.
And there the complexity comes into data because we serve as observability today for humans to look at.
So they'll go look at a dashboard.
They'll look at some logs.
They'll troubleshoot through.
you know, a system trace, but all of that is built today to feed humans.
And we did not build it to feed APIs and systems.
And so now we have to scale up all that same way for genetic reading.
It's not reading this stuff, a hundred, a thousand, 10,000 times faster.
I now need to build the underlying platforms for that, that I have to scale way more than I ever anticipated.
So that's back to my scaling point.
Now to go scale those to work at the same speed as software.
And to be stable, And to be secure, maybe operate in a zero-trust environment.
In other words, everything happens faster, so all your three S's become even more critical.
And this is what makes this job fun, right?
Because there's this constant evolution that means that you always have a different dimension to deal with.
And you always have pressure in the system to go faster, to continue to be secure, to continue to scale.
And so that makes this a never-ending challenge.
I often say to people in my software career of about, I guess, over 30 years now, I've only done three things.
Trade off, space and time, insert levels of indirection, and try to get my customers to tell me what they really want.
Yeah, that's a great way to summarize it.
It's not really more complicated than that, although getting it right is very, very difficult.
Right.
This is the point in the conversation where I'd like to take a little more human-centric approach and ask you sort of the questionnaire that I ask everybody that appears on the program.
I call it the architect's questionnaire, but it's just as much as the system engineer's questionnaire.
What is your favorite part of being a systems engineer?
I mean, I love building stuff and seeing it being used, that it manifests itself into something in production.
And I always, you know, always use this thing of running code wins.
If it's running in production, that's when it's real to me.
And I love that satisfaction.
And it's no different than my father got a satisfaction out of delivering a cabinet to his customer and they put on the wall like it's that, like the thing is actually being used.
That to me is my favorite part of what I do.
What is your least favorite part of that role?
I guess the least favorite part is, and I'll use the term architect here because I'm not a big believer in...
in not thinking about the whole end-to-end, like as in getting into production.
And so I think that distinguishing between enterprise architecture and building, I don't love that because I think of it as a continuum and you ideate, you build, you put in production.
So when people call me an architect, I'm like, yeah, I don't really think I must be an architect.
I'm a builder that understands how things need to be designed.
Is there anything creatively, spiritually, or emotionally satisfying?
about systems engineering or being a practitioner?
Yeah, I think at least I get motivated out of, and I think it is quite human, is to not just be abstract, but be able to see your thoughts come into, for lack of a better material form.
I always feel like people in general are motivated by, at least a lot of people I know, get motivated by thinking of something and they're seeing it happen.
I find it incredibly satisfying.
And it gives me a real sense of accomplishment that I don't think much else can achieve from a professional perspective.
Because just thinking about it, but not seeing it happen.
And I have had parts of times in my career that I've had those kind of jobs, and I never found them particularly satisfying.
But to me, it's both.
But equally, I don't enjoy building someone else's vision if I didn't have any part of it.
So it's really being on both sides of that that I find very, very satisfying.
What turns you off about system engineering?
Your job is to be invisible.
And if you do a really good job, no one ever knows you exist.
Who was that masked man?
Yeah.
It's just like your power example.
You're just expected to come out of the wall.
It just works.
You have no idea who does the job or how they do it.
They're not a preacher.
You never call them and say, thank you for giving me power.
You just assume it's going to be there.
And to be fair, if they do a great job, you never know that they're even there.
Being an unsung hero, for lack of a better word, or you call it silent running, can be very frustrating if the people that you are building for and are funding you don't appreciate everything that goes into it.
And so they take it for granted.
And only when things fail, they say, how can you let this fail?
Well, because you took away all my money, so I couldn't build what you wanted or resiliency.
And then they're like, well, don't make that happen again.
Well, then you have to give me a month.
And so it's that kind of like where you're working.
for clients that don't appreciate what you do and don't want to fund you.
Now, sometimes I've been in places where maybe they didn't appreciate it, but they said, I don't care what you do, but I trust you.
And so just make it work.
That's fine.
Yes.
But where it's not fine is where they think they understand that, well, why do you need someone?
It can't be that difficult.
They trivialize the complexity goes into building a system and they over-trivialize it and then say, well, that cannot be that difficult.
I cannot cost that much.
You can do this way easier.
So you should make all these trade-offs.
And then we try to say, well, these are the things that come out because they don't want to hear it.
And then when things blow up, it's like, it's not their fault.
It's your fault.
That is frustrating.
Yes.
Yes.
I've been there.
I know exactly what that's like.
Do you have any favorite technologies?
I don't know about favorite, but I've had profound experiences.
So I spoke about the one where I first saw a computer playing chess.
I had a...
Similar experience the first time I logged into a computer that was not in the same building I was in, so DARPAnet.
And that was a profound sort of aha moment for me, as in, this is really possible.
And I had a friend who has been doing neural networks and doing AI forever, you know, since the 70s.
And so it's a bit of like, well, this is never going to happen.
It's always been an idea that, and now that it's happening, it's like, it's profound.
So what I love about...
My job, I guess, is not one technology per se, but when you hit these magical moments of saying, this thing that just seemed like ideal for so long is now actually happening, even if you anticipate it maybe happening, when you just first see that, it's so magical.
And then, of course, six months later, take for granted.
But those few months of that magic is something I love about being in technology is you get them every so often.
And it's amazing to be able to be part of that small part of that journey.
What about systems engineering do you love?
I love the team aspect of it because by the nature of a system, there are multiple components and multiple teams and multiple people involved.
And so it is really like an organism.
It is like, how do I get this organism that is incredibly complex with all these people and systems and technology and software and hardware all to interact to deliver a systems outcome?
And it is probably the most complex problem you can imagine.
And I love that complexity.
of all those moving parts, and having all of that magically deliver these outcomes that if you looked under it, you'd be both fascinated and horrified at everything that goes into making it happen.
It's kind of like an airport.
I'm fascinated.
How does an airport actually operate?
All the different things that go into running an airport, have all these thousand planes take off and land on time, and all the different specialized jobs that go into it.
It's incredibly fascinating how it just works because we...
That's a system.
And so I love that when you take an incredibly complex system and make it appear very simple, even though you know, of course, underneath it, it is incredibly complicated.
What about systems engineering do you hate?
Well, beyond the funding, and it's not just funding, it's the lack of being open-minded or appreciating that complexity.
It's probably just unrealistic expectations from clients around how easy or quick it is to do things.
And so it's more a speed problem.
And you alluded to it before, which is clients always asking for why is this not done and why can't it already?
And it's kind of like doing things right takes time.
And you can either be scope bound or you can be time bound.
You can't be both.
And so if someone asks you to say, like, I want this thing, well, and I want it done this way, I want it to be stable, secure, and scalable, well, guess what?
It's going to take me...
time to get it to that point.
And so there's a degree of that that is frustrating, which is you want it both ways.
And I can't give it to you both.
And you don't want to make any compromises.
And just recognize our job is making trade-offs.
And what do you want to compromise on?
And where people don't appreciate that, it's frustrating because you feel like it cannot be successful.
It's like the old saying, someone says, I want it fast, cheap.
And correct.
And you say back then, well, which two do you want?
Yeah, which two?
You can only get two out of four.
Yeah.
What profession other than your current role would you like to attempt?
At this point, I just spoke about apprenticing.
You know, I have people that I guess in many ways apprentice with me now, although I wouldn't call it not formally.
But that degree of being able to impart my lessons, I like doing.
I enjoy that.
So I just certainly envision spending at some point more time doing.
mentoring, teaching, you know, maybe even a more formal structure.
That would be exciting to me.
And one thing I never did, but I wanted to do, but, you know, of course I didn't go down.
I actually was, you know, when I was in college was called maybe being an architect architect, as in a building architect.
I love well-built buildings and, you know, the combination of art and science that goes into them.
And if you look upon a really beautiful building and what it took to both envision it and to build it.
you know, to me is something, I mean, again, I can't, too late in my career to go back and do that.
And the reason I, you know, maybe this day and age I could, I'm horrible at drawing and drafting.
And so I was like, I will not be able to pass any draft.
Saltwater does that now.
Albert Saltwater does it back then.
You had to do it by hand.
And so like that turned me off.
I said, yeah, that's not going to work for me.
But yeah, if I could redo my life in this day and age, maybe I would be a building architect.
Do you ever see yourself not doing systems engineering anymore?
Not in this.
I love what I do.
Not as long as I'm working.
I love what I do for all the reasons, the complexity, the teamwork, the cultural aspects, all the different things that go into it.
I've done this for a very long time and I do it happily and see no reason to not continue to do so.
We spoke a little bit about this before, but when a project is done, what ideally would you like to hear from the clients or your team?
Well, when it's done, I want to hear a couple of things.
And again, back to use my dog analogy, we raise the puppy to a dog, but we still have to maintain.
the feeding, the care, and so on.
And so really what I want to hear is, first of all, that we met the expectations of the client.
We delivered as an MVP or the first iteration of it.
And then secondly, what we did is sustainable.
That what we did was actually something we can maintain because building a platform that can't sustain might feel great when you deliver it, but then it's a nightmare afterwards because maybe you can't scale it or it's not secure.
And now you have to do all that work after the fact.
It's so much more difficult than if you did it up front.
So I want to hear that, yeah, we successfully delivered, but we also delivered in the right way.
Because delivering in the wrong way, and I've done that way too many times myself, feels great day one and feels awful day two.
Well, thank you very much for being on the podcast.
I enjoyed this conversation very much.
It's something we don't explore because, as you say, people take platforms for granted.
But without platforms, we wouldn't have software.
I really enjoyed the conversation too.
And, you know, to all the platform engineers, platform architects, platform designers, systems engineers out there, I appreciate everything you do.
And I use many of your platforms myself and are very happy, you know, if you're even like a power company or, you know, or you build clouds and so on.
We are all building on top of each other's platforms and rely upon them.
And I think that's what makes this so exciting.
And it also allows you to appreciate other people's platforms even more when you are in this job.
So I'm really thankful that you let me speak a bit about this and my journey along this path.
Well, thank you very much.
And maybe we can have this conversation sometime again.
I would love to do that.