The VOID

Episode 7: When Uptime Met Downtime

Courtney Nash Episode 7

We took a bit of a hiatus from recording last year, but we're back with an episode that I think everyone is really going to enjoy. Late last year, John Allspaw told me about this new company called Uptime Labs. They simulate software incidents, giving people a safe and constructive environment in which to experience incidents, practice what response is like, and bring what they learn back to their own organizations.

For the record, this is not a sponsored podcast. I legitimately just love what they do. And I had the sincere privilege to meet Uptime's cofounder and CEO, Hamed Silatani at SRECon EMEA in November, where he gave a fantastic talk about some of the things they've learned about incident response for running hundreds of simulations for their customers.

They recently had their first serious outage of their own platform. And so Hamed is joined by Joe McEvitt, cofounder and director of engineering at Uptime to discuss with me the one time that Uptime met downtime.

Intro:

This is The Void Podcast, an insider's look at software incident reports. When software falls over, as it does, it's people who put it back together again. Each episode, we hear what it was like from the perspective of the people involved.

Courtney:

We took a bit of a hiatus from recording last year, but we're back with an episode that I think everyone is really going to enjoy. Late last year, John Allspaw told me about this new company called Uptime Labs. They simulate software incidents, giving people a safe and constructive environment in which to experience incidents, practice what response is like, and bring what they learn back to their own organizations. Uh, this is not a sponsored post or podcast. I legitimately just love what they do. And I had the sincere privilege to meet Uptime's co founder and CEO, Hamed Silatani at SRECon EMEA in November, where he gave a fantastic talk about some of the things they've learned about incident response for running hundreds of simulations for their customers. They recently had their first serious outage of their own platform. And so Hamed is joined by Joe McEvitt, co founder and director of engineering at Uptime to discuss with me the one time that Uptime met downtime.

Hamed:

Thanks, Courtney. Thanks for having us. I must say, it's a great honor to be here and talking to you for many years. I've been following your work in conferences or through through videos. So being here is awesome. Introduction about me, software engineer civil engineer turned to software engineer. Early on in my career, I wiped out an Oracle database that listed all the products that our shop was selling. And that was my introduction to how things fail. And I realized that actually it was very stressful, but I enjoyed it, got a lot of energy. So since then, most of my roles were either in incident management, incident response, Or application support. That's me, briefly, background. we loved it so much that managed to convince Joe and we set up Uptime Labs. Just about three years ago. We are in business of creating incidents. create many of the incidents every day. Most of them P1. And the idea is create a safe space for people to experience incident response and work together. It's really stressful in real life. And it's a stressful in our, in the, in as well, but no one dies. No one gets fired. So it's a great place to practice.

Joe:

Yep. And I'm Hamed's Irish counterpart. My name's Joe. I've been working with Hamed for about 15 years. I'm more of the more of the software development background. So Hamed and I work together in the same company. I was shipping stuff and then Hamed was an operation side fixing stuff. So that that's

Courtney:

Perfect pairing.

Joe:

started. Yeah, yeah, that's how we started working together. I've sort of been always been obsessed with. I love software and I love technology, but I always also equally fascinated with the people's social side on how we work together on It always just used to bamboozle me when I join a post incident review session and we start talking about what went wrong and stuff like that. I always was fascinated and obsessed with how you can improve. And like Hamed, just think it's bonkers that a lot of people go on call for the first time and expect, did you just figure this out? Here you go. It used to be a blackberry. That's how I started. Right. But you just, let's figure this out. You'll figure this out, this complex system across multiple regions and stuff like that. Whether, you know, sitting behind someone for a week. So we've always been obsessed at Uptime Labs. There must be a better way, to prepare, prepare yourself to deal with, deal with incidents. So yeah, that's our, that's our backstory.

Courtney:

I love it. Bonkers is accurate. We've, you have a lot of children. I don't have as many as you. I just learned this. But it's as bonkers as somebody just handing you a baby and leave the hospital. You're like, what the, I don't know how to do that. So yeah, here you're on call. Good luck. It'll be fine. So wonderful. I am really thrilled to have you two join me because. This one, this is a slight variation on the podcast You had an incident, but it's not publicly written up currently. So I was given the privilege of reading your internal writeup of it. Which is cool. Cause I don't always get that for me, I got to nerd out on that. But also because. You are as you'd said, Hamed, in the business of helping people learn from incidents. So the way you all wrote this up and the way you approached it was really different and refreshing. And that's definitely what I want to dig into. And so the title of this episode is When Uptime Met Downtime, which you chose actually, Hamed, I couldn't write a better title if I wanted to. Especially because people don't have a report An incident report to read. I would love it if you could give us the very quick TLDR, if you can, summary of what happened, what went down at uptime that one fateful day.

Joe:

I started, I started the chain. So we have a morning practice where we do our checking our systems, health checks, et cetera. And I noticed an internal workflow not working. So like a total minor issue. And I published it into our, into our incident incident chat room to say, Oh, you know, I've noticed a minor incident. I'm not triggered. A set of cascading,

Courtney:

So are you saying that you're the root cause, Joe, you started the incident?

Joe:

I was. Yes, exactly. Root cause was me. And I was totally relaxed too. I started the day, I like, I literally went on another call. I sort of, I wasn't, I wasn't directly on support that day, but it started cascading set, set, set of incidents or set, set of, issues that cascaded and cascaded. And ultimately we had a full downtime outage within probably 90 minutes of that initial trigger point. And it took us a further, in total, it took us four and a half hours. To resolve our platform. so yeah, that was the, on a lovely Monday morning as well, to top it off.

Hamed:

I was just going back on what Joe said. I never figured that out until today. When we have to put the pin on the start point of our time to recovery measure.

Courtney:

It's a fun one. Yeah.

Hamed:

was it when you noticed the one or when the customer's actually affected? And I'm saying that, Courtney, with tongue in cheek.

Courtney:

I know you're taunting me with that one, but I did, I, when I read through it, I was like, okay, they put it to a certain, and this is, this is something I've talked about and other people have talked about, is when you're trying to track durations of things or any of that, like, when did it start? It's a very metaphysical problem, but also sometimes a very real problem. But you're, it's almost like you poked Schrodinger's cat by looking in there to see what was happening. And it's, and we're not going to just for context for people listening we're not going to get into the gory technical details of the incident because that's actually not what's interesting. And I'm sure there are a few people who will be like, I really want to know. But the reason I wanted to have you folks on to, to talk about this was to Talk about the other pieces of it. And Joe, you said you'd always been fascinated by the people and whatnot. But I am curious, if you can tell me, was this sort of the first Really serious one for you all. Like I remember the first incident of a customer impacting thing was the, the first real go at this for your own software.

Joe:

Yeah, it was probably three years into our journey and, you know, we've had a few, but this was definitely the largest, most critical incident we've had. And we, we had a good streak going as well. So yeah, it was it was a couple of long times as we had something like this. So four and a half hours is definitely our longest outage we've ever had. And hopefully it'll be a while before we get another one like that.

Courtney:

Yeah. I think the thing about the duration that people don't always think about when you think about how long an incident is, if you will, they're thinking about obviously and not, and rightfully so the customer impact or the financial impact or whatever, but the longer these things go on, the more stressed out, the more tired, the the, it is, as you'd mentioned, when you first did ran across that yourself, Hamed, it's there's a whole physical component to it, as well as the mental side of it. I wanted to talk about a little bit more of one detail in there in terms of that timing and that process and what that felt like people often have neat and tidy ideas about incidents, right? A problem is identified, people investigate, you troubleshoot, you resolve it, right? This nice linear process. As you all know, even from running these more than most people probably know from running incidents for people as a business it's rarely the case. And in the writeup that you all did internally, I noticed that as we were talking about, it was about four and a half hours end to end, but halfway through it, I feel like it said a second incident was declared. And can you talk to me a little bit more about what that was? Like was that, was there confusion? Was it the same incident was the scope of it increasing? Cause this is something I think is more common than people realize that you have these sort of incidents within incidents, or sometimes you have incidents or things that are, you can't tell yet what it is. And I love that little detail of that in there in the timeline that was Oh, Hey, by the way, second incident declared, I was like, Oh God what was that? Tell me a bit more about what happened there. Yeah,

Joe:

flagged it. Okay. It was an internal issue, not customer facing. Right. So off we went. And then within a couple of minutes, one of our engineers flagged it saying, hang on a second, there's a change, there's a set of change sets that are in production that are unexpected, and he just flagged it up. Okay, he just flagged it internally. Now, we had done initial smoke tests, et cetera, that morning, so, you know, some of the core functionality was working as expected. But very quickly, behind the scenes we started making corrections. So the user could see a change set on a project which we'd been working on for a couple of months. It was actually a large change set. As I discovered, as the time ran on, the engineer and investigating realized, wow, there's actually a massive change set that has been released. into our environment, and that started a cascading effect where he started to make these small changes and slowly, but surely, it started the cascading issue started. And what what happened was that there was a set of changes pending changes sitting there as soon as he fixed an issue, the new set of changes went across into our environment. Totally unexpected, right? We're totally unaware. And this is when things started to get worse. suddenly the issue started coming in from our live, live customers to the point where it got doomsday, where we had a full, full outage. 90 minutes in and that's really when from a personal perspective, it's like, okay, this is way bigger and more challenging than we initially thought. And that's when we had to take a step back. So it was like a two legged incident, if that makes sense. Yeah.

Courtney:

It's a classic situation where the sort of debugging or trying to figure it out makes it worse, right? We've seen that one quite, quite a bit, actually.

Hamed:

It was one of those days, when you realize something and then like very cold stone shapes in your stomach this is actually worse than I thought, and then. That thing is bigger and bigger. It was definitely one of those the incident was so Joe was dealing Joe and Engineers they were dealing with really the hard part of it Which is figuring out what it is, but I think it was even harder for me was a very tough day and at many levels I felt conflicted with myself. Bear in mind that I have an engineering background, been fixing incidents and dealing with them for many years. in this business, I'm put in the position of CEO and just managing the attempt at I want to be inside. I know what is going on, but not be there. And then dealing with the pressure of sales calls being canceled or delayed, it, put me in a position to like, experienced everything that over the years I argued management should do or should not do. I like face every single of those decisions in that, 48 hours during and then afterwards, which yeah, it was very tough. I can expand on it if you want, but

Courtney:

I actually would like you to because one of the things that a group of us have been talking about in the community, it's in the resilience engineering and learning from incidents community is I think even I'm guilty of this too. A lot of us for a long time abstracted away. Like we have the sharp end as we would say, I make a little cheese wedge sign with my fingers whenever I do that, right? The people at the sharp end where everything's, happening Joe and crew, right? And we talk a lot about that and them, and I think we should, that's really, there's a lot of benefit of that. And we'd always be like, Oh, the blunt end, right? Womp, Charlie Brown's teacher voice, like these, the business, the management and stuff, but they have their own set of hard trade offs and implications of these things. And and we don't actually talk about that a lot, but a bunch of folks have been talking about what is it like to be empathetic to that? What does that look like? It's a part of the incident. It's a reality for the organization. And I think you're in a unique position because you are aware of those dynamics more than probably a lot of managers and execs, maybe especially at large, much larger companies. When should the, CEO get in the incident chat room, questions like that. And things I think you're deeply aware of. So I would love to hear a bit more about your perspective of that, especially the conflict, as you say, of having gone from being a sharp end sort of frontline person to the person in charge. I think it would be very helpful for people to hear what those trade off decisions or, considerations look like.

Hamed:

Okay, great. So you asked for it. You need to stop me whenever I go to take a lot of time. So I start from morning when the first message I got was, oh, so it was from Joe. I suspect there is something wrong. So there might be some issues. I'll let you know later on. And then that week was very important for us. We had three sales demos during the week, important deliverable and one of our features that we wanted. To progress and show and one, one meeting on, I think on the same day on the Monday evening, a customer session I had with a senior customer. So I just wanted to be physically present and see him using the product. as soon as Joe said there could be a problem, my thinking was immediately switched. Okay, should I keep this engagement? Should I cancel them? And I thought, let me ask Joe. Can this be fixed before afternoon?

Courtney:

The classic management question, right? You're the 1 asking that question.

Hamed:

But before I talked that into slack, almost like my previous self hold my hand. So if you ask that question, is it going to help him? Is it not going to help him will it add stress? They either way are going to fix it as soon as possible.

Courtney:

yes.

Hamed:

literally I had to pull my hand away from keyboard, bite tongue, sit down, and I was desperate to get more information. So that was another edge do I ask how things are going, do I what do I do? So that, that was like the second difficult thing to step back actually, Joe, I, maybe I slipped once or twice. Would you help me?

Courtney:

Did he give in Joe? Did he ask? You can tell us it's a safe space.

Joe:

Hamed's like a brother to me, so it's no filter. So, one of the things we did do, we did do was, We deliberately created space. So the investigation team that were running, we really created like space for them. And then we had the, you know, which is good practice. We had the different communication channel for updating stakeholders. Hamed was never in the investigation war room at a time. And so that helped and we did practice that even though it was really tempted to go in and help. We just have this philosophy of leadership. Hamed maybe you want to elaborate on that about even if they want to help something just being there at best make things. Yeah. You want to talk about that?

Hamed:

That's, I think I borrowed from John's, John Allspaw senior management being present on the incident bridge at best is not helpful.

Courtney:

At best.

Hamed:

That I remember, but what I did on that day, which later on, I think I was proud of was, okay, I can't help and I shouldn't really get involved to fix this, but how I can help the team is just being brave enough, picking up the phone with the customers and people who are engagement with them and letting know. It won't happen this week and postpone those engagement. And then the other thing was. I think it was in the second day Joe, we were talking about, so the service is restored. There's a bunch of work needs to be done to just make sure, A, we understand exactly what happened and get into stable state. Do we prioritize that or do we get back to that important delivery work that we promised? And at that point, there's a lot of promises being made. And I think that was there was another time which I seriously felt conflicted because I so badly wanted that feature out, then step back and said, it doesn't make sense distract the team with the delivery work. Let's just get as long as it takes as much time as it takes. We do this incident, we learn from it, the things that needs to happen. again, we start this production machine later on in the week or next week. and that was again, a difficult call for me. The last one was when we started to understand, okay, how this chain problem. Start manifested itself. Joe talked about a change set being deployed to production, which they were not meant to that change set went to production over a weekend because one of our, one of our good engineers being really thoughtful and helpful, he wanted to deliver a piece of work. And he used his weekend to complete the work, get it to production. Now, what I was thinking in my head, should I like ban everyone from deploying code in the weekend? Should we, should I tell him, have a conversation with him? That was like the most natural one. But then I thought actually what happened here is, we really learned that, or it was more a reflection on myself. If as a business, we want to move fast, we need to actually encourage what he did. He wanted to take initiative to put, deliver, do completed work, deliver changes to production. The fact that it resulted in incidents, I think that's a systematic learning for us. Why it resulted. That was another thing that, okay, how do we act on the back of this? which we encourage everyone, the work you did was great. Keep doing it. Everyone else, if you want to do, we need to go faster. We don't want to slow that down, but we need to learn what went wrong here or what things went wrong and avoid that. So yeah, we celebrated that person's work, was, which was against my natural, not natural, the first response that came to my mind, what On the back of it. All of this, you always say that incidents are an opportunity to learn more about your systems, how they work. For me, it was also an opportunity to learn more about how my brain naturally works. So it gave that spotlight into my own personality And, what are the tendencies and how I can change things that come to my mind.

Joe:

I'll just riff on just two points, just when Hamed was talking, It's just two things that I'll never forget, But when you have incidents, right, they're literally memories. He almost, it's like a time machine. You just remember where you were, when it happened, et cetera. And there was two moments. The first moment was, there was a point when the cascading issue was happening. I just got the sense of losing control. I had this mental model, how does mental model, how things will work. And we, yeah, okay. So this change went into here, let's make this fix here. And suddenly, you know, cascading issues, these pending changes that I had no idea about started coming through. And I just, I just felt like losing control. That was quite scary.

Courtney:

yeah. That was a quite

Joe:

scary sensation about losing control. You know, as an engineer, you know, you've got this logical breakdown and in reality, it just wasn't like that.

Courtney:

And you built the thing, right? Or you were part of building this thing. You should totally know how it works. yeah, so that's it's it's scary. It's humbling. I think this might be the first one in the list that you have in the writeup you all did, and this is an internal thing too, which is, I think it was great because it's better than a lot of, Internal or external reports I've seen. Better is a very normative judgment, but I'm going to run with it anyways. It was it had a lot more helpful and interesting information. We'll go with that instead of better. How's that? And you had a number of surprises called out. And I love this because to me, it feels a little bit like a call. I don't know if it was intentional or not, but it's a feels to me a little bit like a call back to our former colleague, Richard Cook, who used to say that all incidents are fundamentally surprises. And but you went through what were the things that were surprising about this and the one of the ones I wanted to highlight, I think it might be, might have been the first one in the list. I reordered things as I was editing this, was a"feeling of paralysis" during the incident. And I don't know which of you this was now that you've both told me your stories. No ability to assist meaningfully. Okay. Oh, and technical troubleshooting resolution. That was probably you, Hamed. Yes. Is that where that piece came from?

Hamed:

Yes, I because I thought, I've done this for my life. I should be there. It should be part of it,

Courtney:

Okay. I couldn't

Hamed:

make any difference.

Courtney:

But Joe, you said something on the, as it were, the sharp end. That's the same thing, which is feeling you didn't say helpless. What was the word you used?

Joe:

Out of control.

Courtney:

Out of control. Yeah.

Joe:

Out of control. So there was one point when we had all our working theories and, you know, we're working on these working theories about, you know, what to do and we're coming up with our plan. And then suddenly we started getting these new information coming through. And then we, we had a full outage, which is totally unexpected, right? You know, we thought we were moving towards green, you know, Instead, we're going more towards red.

Courtney:

And how could you not know that? Yeah. Yeah. Why didn't you know that you're out?

Joe:

I wasn't the only one, by the way, the team were, we were all standing there. It's just agape. gape. virtual, right? This is all virtual. We're virtual companies. This is all virtual. And that idea of losing control that mental model of the current state versus what we need to get to was, you I can never forget that point when suddenly I hit the web, I hit our platform and we're down, right? So we moved from a minor to a major to critical. So that was, that was a scary, a scary thought. And then I had a consciously, had an out of body experience, but remained calm

Courtney:

Yeah.

Joe:

regroup and the iteration started again. We just, okay, what's, what's going on, right? I mean, the loop starts again, right, for the

Courtney:

Yeah, and this is what I loved about the, because this is the reality of incidents that never makes it into incident reports. You might talk about what you were factually surprised about, right? That'll be maybe a surprise in an incident report, or people talk about how they got lucky or whatever. But this like physical, emotional, Experience is so common. And so I'm just thankful to you both for writing about it and then sitting here and so honestly talking to me about it. So one of the other surprises was. And you've already largely alluded to this, so I don't know if we need to, we don't have to necessarily get into more details of this Joe. But if there is more, you said"a chain reaction of seemingly unrelated issues triggered a snowball effect." And this is definitely, I think, the cascading piece of it and you were talking about, the, that phase that is, That people are commonly in of you've got four operating theories or something, right? Like you've got a choose your own adventure of what might, where do we start trying, what do we do if we want to have a good theory about what's happening before we do more in theory, or it might get worse, or it might get worse, even if we don't do anything, but What talk to me about the sort of unrelated issues triggering a snowball. And if you, if it's there any more, you can add to that

Joe:

Okay, so when we started the investigation, we had this change set. Okay, there's changes come through the one expected. So we sort of had like a, like a, like a ring fence around the technology set, right? The area in our, in our architecture, which is impacted. And suddenly by making those small corrections, suddenly we started to see issues. Okay. The other side of the architecture, right? Like the other side of the

Courtney:

outside of what you thought the scope of control or impact was, UGH, yeah. And not only that, they were

Joe:

really bad, right? Really bad. Okay. So it's bad when you're, you know, when the platform was offline and that was like, this doesn't make sense. This is unrelated. Right. So, so we had a, then you have to pause, right. And take that space. Okay. Let's. Let's investigate. Right. So suddenly, I don't know. Obviously, the obvious technique is escalating, bringing other eyes to it. In fresh eyes and stuff like that, the committed how, so I brought in all our other folks from the team to bring in. So I'm increasing the size of the, of the, of the investigation team, fresh pairs of eyes. And then suddenly we start getting more and more like spot. Okay. Well, I can see the error. I can see the problem, et cetera. And now we're looking at another set of changes. And then suddenly we realize this pending change problem. That was like a moment that there's basically a whole change sets were coming through that were blocked. and then when these changes went in, suddenly the scope of change is like tripled, right? And then we start moving then into, into, I had this really scary point where I said, let's just roll back. Can we roll back? I could see that the look cause we're on video call. I could see the look in the engineer's eyes saying, we're rolling forward. So that, again, that was a.

Courtney:

Can we talk about that a little bit more, please, in as much detail as you're comfortable sharing, because I hear all the time ah, just like incident management stuff, like always have a plan to roll back. Always have a plan. Sure. The best laid plans, blah, blah, blah. What can was the fact that you couldn't roll back also a surprise or it was like, it was a known thing, but you hadn't had to run into it before.

Joe:

No, no, that was it. That was a learning for me in the incident.

Courtney:

Okay.

Joe:

so I think a certain point I said, okay, I'm going to time box this. And, you know, once we get our head right, it's okay. We can always revert back. got our exit strategy, a plan B, especially when things started going awry, right? Getting worse and worse. I was like, okay. Time out, time out. Okay, let's, let's step back. Let's just, let's just see if we roll back the changes. But it was, the problem was, I think when you talk about roll back, Rollback's great when you have the small, distinct changesets and you can reason around change quickly and rollback, right? It's the total advantage of continuous delivery and practices like that. The situation we're in was we had a larger than we wanted changeset, that had let's just say multiple projects were pushed in. Right. So it was like a lot, like a large change set pushed into prod the ability to reason, right, to the engineers was easier to fix forward than it was to roll back. Right. And remember, this is all under stress, stressful conditions. Right.

Courtney:

yeah.

Joe:

So that was the decision when the engineers would weigh up the pros and cons. I said, Oh, Joe, it's easier if we roll forward break down these issues that we have, rather than trying to do, a wholesale change set rolled back. So that was the decision we made. But that was certainly another memory that a moment in time when, when it happened where we can't roll back was one of those gotcha moments.

Courtney:

And probably felt even scarier.

Joe:

we're back to learning from incidents, right? Like if you're not convinced small changesets are not a good thing. If you do a situation like this, you learn very quickly there's the power of incremental change.

Courtney:

Yeah, a lot of times what I see is there's some kind of production pressure or sales pressure. And I don't mean pressure in the wrong way. Just in natural pressure that led to that, right? Like, why did it make sense at the time that you had a big set of changes all roll out at once? And I'm sitting here thinking I know that Hamed had a couple of demos and a feature was there a bit of looking back at that as well, like what were the sort of other forces in the system that, that ended you there in that spot?

Hamed:

Yeah, definitely so it's very easy to justify and cut corners to get new features out because, you think about it. Okay, that's I want to try to win new business.

Courtney:

Yeah.

Hamed:

Your startup, every sales counts massively. So constantly communicating to people that our advantages are speeds, moving things out as quick as possible. And forget about the consequences, what it means. So basically what, for me, the lesson during this incident was by, by moving, I think, too fast unsafely. Okay. I get to enjoy the reward of getting features out early and forget about that, this cost is going to come back and get us at some point. And then when you're given the bill, you get surprised. Oh that's all.

Courtney:

Yeah, you used a really interesting word that I think within our community of resilience engineering, learning from incidents we use a lot. I would love for you to talk a little bit more about what you mean when you say unsafely, what was unsafe about that from your perspective.

Hamed:

Ability to release quick and deliver feature quickly to production, it needs some infrastructure in place. You need to have some sort of capabilities in place to be able to do that. From how the team work the skillset you have in the team the testing and deployment. Infrastructure that you have you have in the team, the practices that you follow, like Joe touched on, the big change set versus a small change set. So it's a whole a combination of practices, technical advancement that needs to be in place before, before you you can achieve certain speed without that. Yes, you can. We tried. We did, but. There's a risk for it because, if we have in this example, I pick up and that was like one of the learnings from the incident that Joe picked up, we learned that we've got to be very meticulous that we don't let changes pile up. Before having that mentality in the team trying to move very fast can have consequence. And that's just one example. There's a lot of that. The other thing for me was. the complexity of the the system

Courtney:

I think this is fascinating because it's you're a small startup, you've been around for three years, some people might be like shruggy face, but how complex could it be right? So can you talk a little bit about, yeah. Oh, you were like, how can complex y'all. Yeah.

Hamed:

was like, simple. Like I had it, all of it could fit in my head and fast forward two years. I'm trying to understand what happened and just list of systems and tools involved and that our CD does this. And then we have the set of Kubernetes here and then we do this. And I always just. Wow where we get to this place that like, even it's taking a few days to understand how things connected and that definitely complexity didn't help during the incident. And I think in general, it doesn't help with being able to deliver code fast and safe because these systems, they do so many things in your delivery pipeline without, it. And if one of them does something slightly different on picking and understanding that is impossible.

Joe:

Yeah, like, it's the abstractions. So with like Kubernetes, AWS, all these abstractions we have, right, which are brilliant for fast adoption and being able to, like, you're standing shoulder giants. You will do some as a startup, right? We follow these conventions. We will get to move fast, right? But all well and good until something goes wrong. Right? And suddenly digging into each abstraction, understanding where is the issue, what's happening. And this was a cascading thing, right? The cascading issues we were having was at different abstractions. And it was just non trivial to debug. So it's fine whenever it works great, but when it, when something goes wrong, that's when, you know, that's when you're really challenged on these abstractions.

Courtney:

Yeah, and and I'll throw the old AI automation loop in that one too, I just was talking with Hamed. I got to, I got the extreme privilege of being on the This Is Fine podcast with Colette and Clint, and we were talking about. With how complex systems are in general, and then you layer lots of automation, or lots of AI into it. And then it's, again, I think it's same thing. It's great, all in fine and good until it's not. And you can't tell why, what's happening. And, so it's a, we're making that problem worse real fast as an industry. So it's interesting to hear you all acknowledge, running headlong into that as well.

Hamed:

Yes, and brings a question that still I haven't answered in my mind that, it worth it? Yes, I get the benefit of the speed but definitely going to be moments of that things doesn't work and fall apart and those are becoming harder and harder. So how do you mitigate that?

Courtney:

You're in a unique position in that your business is this business of incidents. You might have some pretty good customer goodwill for a while. I would imagine for that or you could just tell them that they're in a meta incident and now they have to help solve yours. So good luck with that.

Hamed:

I tell you, I need to be careful. Probably our customers will listen to this, but smaller incidents for us is pretty much goes unnoticed because there's an incident drill going on and then we have an incident and that drills all of a sudden become a meta drills. So people think

Courtney:

Yeah, I

Hamed:

something, but it's a little bit more challenging

Courtney:

You have a certain advantage there. They're like, oh, the incident just got way more interesting. You're like, you have no idea. Okay, so we just talked a bit about some of the other themes that you all had desire to move faster while maintaining or improving safety was 1 of them. One of the other ones that you had that I would love to talk about, cause I think it's also related to this reality of complexity and like this, even in a small startup a few years one of the ones that was towards the end of the themes was knowledge gaps within the team regarding specific areas of the infrastructure. And I don't know, Joe, but I would love to hear more about that because it's a concrete example of something that a lot of time we talk about in the abstract or in nerdy research papers or whatnot in the resilience space And here it is. So I would love to hear more of it.

Joe:

A core part of our platform is sort of this engine and we got this really. Yeah. quite sophisticated sort of technical architecture that we've been building layers on over the, over the three years, right? So it's like the engine of our simulation, right? For the incidents. So when you're in an incident, you see grafana, you see real stats, real metrics, you touch a website, it's really down, you're getting 503s, right? So there's all that orchestration part it's really cool stuff, and it's like Hamed started it, right? With his, you know with his thing two years ago, and we built upon that. But when the incident was happening, and we've all these mental models, Right. Of how, how things work we've got, I've got the diagrams, trust me. I was, I used to be an architect, so I've got the, as is architecture. This, this system calls this, you know, a calls B right. Yeah. Yeah. Well, actually. A calls proxy, and then proxy calls this abstraction, and this calls this abstraction. And then we need to get the keys, right? And we need to get the secrets. And suddenly I'm like, oh my goodness, okay. not only that, in, in the incident, you have different opinions as well. So, you know, we, two of our senior engineers during the call were saying, well, hang on a second, if this works like this, You can even see there wasn't a shared understanding within the incident itself.

Courtney:

so it's a mental model engineer A, mental model engineer B, and maybe a Venn diagram that is some of that, but not even necessarily 2 people working on the same product.

Joe:

right, right. the sort of meta thing I got from this was again, imagine an early startup culture, right? It's way faster to have a single person work in a space and optimize that engine, right? To deliver fast, right? It's way faster. Until something goes wrong and that person's not there. Right. Okay. You've this abstract knowledge all built in on one person with the bus factor. Right. It's way harder and disciplined to work in a way as a team on these core pieces and knowledge share. You know, it's. happens all the time. You give someone else the piece of work and get them to do to make the change in that system, it takes longer to deliver. Right. But you're getting the payoff of knowledge. was a, this was a big thing we learned was that, okay, our key pieces that we have, we need to spread the knowledge and, and that's only by doing not, not about having whiteboard sessions and more diagrams. It's

Courtney:

run books or documents, but the hands on the systems. Yeah.

Joe:

So yeah, that's something we were very aware of.

Hamed:

Actually, this is a good point, Courtney touches back on your question. What do we mean by faster and safer? And I hold my hand up because it's very easy. Oh, X did work on that part before, hasn't he? It's much quicker for him to do this part, which we need to deliver. So let's get X on that. And then X becomes expert there. Y becomes expert in something else. And Z, Works in something else. And as Joe said, until something goes wrong and these people aren't around or even between them, they can't agree on what is going on. So that was like, okay, I've got it again. False economy that exited, let him do it again, or let her do it again.

Courtney:

Yeah, it was was Amy, the Kafka expert is my scapegoat for that one from a previous company, actual true story. But I think Joe, the piece that's really interested is cause I always do really try like with the void. What I'm always trying to do is connect. Like theory to reality to practice, right? And there's all this research out there about expertise and how you only get it by doing. And so to hear you say that sort of, warms my academic heart a little bit because that's how the brain works. Like you don't get better at something by reading about it. You might learn about it. You might understand something about it. But you certainly don't just watch 14 YouTube videos and then get in a Car and drive. Although I did once watch a lot of YouTube videos and teach myself how to use an excavator I'm not recommending most people do that. However, I May or may not have done some bad things But at least it was on my own property so yeah, just to the, I just want to harp on that point a little bit more, because I think it's also my pet peeve with automation and AI systems where you're, you're abstracting away the work that a human's supposed to be doing. And again, it's probably all good until it's wrong, because then that person probably has had no experience hands on with that system and what it's supposed to be doing. And so it's much harder for them to even A, understand what it's doing and B then, go in and try to reason about it and fix it. And so that hearing that in, in the wild, as we say is I think really important for people to understand. And that a run book is not the same as having your hands on something. And it is disciplined. You're right. And you would, you, Joe, to let four people on your team rotate through something or spend more time on it, have to go to Hamed and say in order to build this degree of safety or whatever in our system, you're not going to get XYZ until two weeks from now, or whatever that looks like, right? Culturally, that has to be acceptable in order for that to work.

Joe:

Yeah, yeah. And you know, you see other examples practices where you can do the chaos engineering. I know Courtney mentioned that before as well. And we're trying to mock that discovery exercise with with the team as well. Those practices work. And the other thing we do this is another theme we had. So Hamed definitely have it. You should definitely talk about the way you run the PIR before you go. But this one of the themes we had that simplicity. again, you know, we did take a step back and say it doesn't need to be that complicated. Yeah, we are, we are now, but you know, is there an opportunity now where we are, can we make this simpler? Can we simplify it? Right, and again, that's more investing in you know, I look at the positive, you can, you can call it the glass half empty is technical debt or the glass half full and say, this is technical agility and resiliency and just make it, make it easier, you know, so maybe there's things we did in house that now we probably could, could abstract, right? Things, yeah. But, but that idea of simplicity, like, can we make this more simple. So, like, there's things, for example, we, maybe we don't, we're a startup and we built stuff maybe we don't use now, it's time we can churn and clean that up. So again, that was another key theme was let's make this, let's make our, our architecture a bit more simple.

Courtney:

So Joe mentioned how you ran the PIR and I didn't see this in the docs. I see it. I do, I lied. I see it in the doc in that I see a very thoughtful post incident review. The learnings from that in this document. But there's not a meta. Here's how we did this. I would love to hear a little bit more about that, if you don't mind Hamed.

Hamed:

So I'm going to share something. It's a safe space, but touches on PIR as well. And it was first time ever in my career, I picked up on it and really had a profound impact on me because when Joe, I think it was day two, he called me to explain what we know about it and how's the situation is and still some lingering problem. I read a sense of guilt, and it was almost like,"Sorry about what happened" and it was

Courtney:

Guilt from Joe?

Hamed:

From Joe it was in his voice like, and that all of a sudden hit me hard, never in the last 15 years I noticed it, but then it reminded me, yes, when I was engineering manager, when I was doing that job explaining how incidents progressing or what happened. I had that sense of guilt in me as well going to CTO as if it was like,

Courtney:

Your fault. We

Hamed:

are at fault. And I just picked it up for the first time in this incident, probably because the rules were So at that moment, I promised myself the first thing I'm going to do in the PIR and when we meet people is just to thank everyone for the amazing job they did and recognize that it was very stressful time for everyone, and they all did in it especially, special thanks to Joe as well. So that kind of sense of guilt in the incident was the first time I picked up on it. Coming to PIR, I don't think I did something extraordinary. I think two questions we spent a lot of time. One was. What was so surprising in this incident from every, everyone's perspective, because it was different. So we went around the table and everyone explained what was surprising for that person throughout the incident. So we captured that, which some of it reflected all of it reflected in the internal report that's there, and then the other question was, What made it difficult to fix? from this specific incident? So I think that was the only, thing that I did, unless I've done things or run it in a way that I didn't notice, yeah, it was, yeah.

Joe:

retrospective.

Courtney:

Yep.

Joe:

Hamed set the scene as a blameless setup, and then that openness came then. generally, follow up then after the post instant review, around, okay, what have we learned from this? What are we going to do? What are we going to change? and that's patterns I believe any organization can copy. And also the timing of it as well, Hamed, the cadence of the timing is very important when you do the PIR, it's about Goldilocks, it can't be too soon or it can't be too late, right, just getting it right.

Courtney:

Which is not always easy to do for sure. Joe, do you feel like there was a feeling, not because of organizational pressure or anything, but do you think people felt guilty or did you feel guilty?

Joe:

Yeah, because you're, you're very proud of what you build and ship on, you know, when you have a four, four and a half hour outage on your, you know, as a co founder of Uptime Labs, I see the impact, the business impact on the opportunities we missed for those four and a half hours. there's incredible amount of guilt. but. Generally, first of all, the, the leadership coming back, Hamed coming back and giving you that said, look, what did we learn these the tribe, as people join, you don't lose that knowledge. It's really, really important. So, so yeah but it's human, it's very human nature, right? It's, I think it's very human, human nature to feel bad.

Courtney:

You're showing up for something you care about every day. People, the age old thing that Allspaw and others would say is people don't show up to do this work to not do well and to screw up and to make mistakes or to cause problems. And so even when it's not your fault, as it were. Because it can't be with these kinds of systems. Yeah, you feel a sense of responsibility and ownership. And I think to tell people don't feel that way is silly. But to help them, I think, as you did, Hamed, work through that feeling and get to a place where you can accept what that is. And then, there's so many metaphors of incidents and parenting. I feel like I could have a whole other podcast about that right now, because, like I'm like, I'm just thinking about you not being in the room and I'm like, how many times do I just have to leave my 13 year old alone and not touch it? Anyways another podcast, another day. But working through those difficult, real human emotions is another sort of often obscured piece of the process that's so important. It sounds like you all did a stellar job of it. So on that note we have gone quite long, which is great actually. And so I want to thank you both for joining me. I will put some links to anything else you want me to put in there. I'll put a link to Uptime Labs I will put in a plea for you to work with yourselves and whoever else that you can write these up for us to put them in the Void someday, please. Because yours are such a full of, all the things I want to tell people about. And so being able to show that to them is really nice. So even if you have a United States government redacted version all the details gone, it's just like a lot of black boxes. And then what you learned, I would still be okay with that. So that's my unfair request. Thank you for joining me so much. And I hope things stay uptime without a lot of downtime for a long time.

Hamed:

Thanks Courtney.

Joe:

The one thing guaranteed in life is incidents, right?

Courtney:

Yeah, a hundred percent.