.png)
The VOID
The VOID makes public software-related incident reports available to everyone, raising awareness and increasing understanding of software-based failures in order to make the internet a more resilient and safe place. This podcast is an insider's look at software-related incident reports. Each episode, we pull an incident report from the VOID (https://www.thevoid.community/), and invite the author(s) on to discuss their experience both with the incident itself, and the also the process of analyzing and writing it up for others to lean from.
The VOID
Uptime Labs and the Multi-Party Dilemma (Part II)
In Part II of the Multi-Party Dilemma (MPD) drill retrospective, we reconvene to dig deeper into the implications and nuances of the simulated incident exercise hosted on the Uptime Labs platform. Eric Dobbs (incident analyst), Alex Elman (deputy IC), and Sarah Butt (incident commander) continue their debrief with Courtney, reflecting on how team behavior evolved under stress, the importance of expertise in managing non-technical aspects of an incident like saturation, and how deeply held assumptions often go unspoken until tested under pressure.
This episode emphasizes the complex social and cognitive dimensions of incident response, such as how people coordinate, communicate, and construct shared understanding. It highlights the value of analyzing drills not for failure points, but for what they reveal about real work, adaptation, and human coordination.
Key Highlights
- Incident Analysis as a Practice:
- Eric Dobbs emphasized understanding how people make sense of unfolding events, rather than judging decisions in hindsight.
- The goal is to study the “why it made sense at the time,” not what was “right” or “wrong.”
- Drills Expose Hidden Assumptions:
- Even experienced responders bring unspoken mental models into incidents.
- The drill revealed assumptions about communication flows, authority boundaries, and vendor interactions that were not made explicit in planning.
- The Value of Human Expertise:
- Everyone involved in this incident brought an unparalleled level of expertise to the work.
- Often this kind of expertise goes unnoticed or is taken for granted, however this kind of knowledge is precisely what makes for smoother, better coordinated (and sometimes), faster incident response.
- Importance of Framing:
- The way questions are asked in retrospectives can shape what is revealed—e.g., “What made that hard?” is more productive than “What did you miss?”
- Reframing incidents around constraints and tradeoffs leads to deeper insight.
- Team Learning and Culture:
- Safe, high-trust environments enable better learning during drills.
- Psychological safety allows team members to admit confusion or raise alternate interpretations during real incidents.
Resources and References
A few moments later. So we had an incident while recording our incident. Retro. For those of you who didn't watch the part one of this, go back and watch the part one because this one's not gonna make any sense. We're gonna drop you right into it. But Eric had a power outage. I got a migraine. Sarah I got a migraine. We decided to break, but we're back. Lucky you. We got the band back together. So, buckle up, strap in. Here we go folks.
Courtney:Here we are again. Podcast recording two. Electric Boogaloo.
Eric Dobbs:I, what I remember of what Sarah was saying before I dropped was, commenting on the fact that the comparison of me claiming I would be buried under those circumstances. Sarah's handling it fine. She comes back with, yeah, but I had Alex and there's reasons that that matters and, and then a lot of detail which just goes on, like, and I hope we have the recording because the detail reveals yet more expertise about Sarah, understanding of overload and being able to share the work with, with Alex.
Sarah Butt:I remember it, I think the two things that I had said functionally were, Eric had said, I would've been overwhelmed. And I said, keep in mind there were two people here, not just one. And I had not only a deputy, but a really fantastic deputy. Um, and there was a lot of unique benefits to the fact that it was Alex and I working together. And we have, even though we've never run an incident before, we have a really strong friendship and we have a lot of common ground. And so there's, there's two pieces to this. is that I know Alex. And I know that he knows how to handle saturation and monitor his own saturation. And, I trust him. So I threw stuff at him initially knowing that he would like, trusting that he would prioritize the right things and offload or otherwise shed load however he needed to. And so I basically just sent it to him and then knew that is, well, unless he said I cannot do this, it was gonna be taken care of and he would manage his own saturation because he is a very mature incident responder. The second thing is that Alex and I have, a strong friendship and that we've not run an incident together before. We've written papers together and traveled together and all sorts of stuff. And we come from similar incident response backgrounds as far as training. So I don't have to say a lot of the normal sort of. trees that I would do with a person that I didn't know. I can be pretty t. I can be like, I need a can. I need this, I need that. Go, go, go. we share a language. So I'm not sitting there going, this is what a can report. I'm saying, Alex, I need a can. Alex, can you pop to the main channel and just head off, uh, Ty us in Bez and say we're aware of an issue. It appears right now to be affecting primarily amia. We've currently engaged lead all of that. Just get them in a box. This basic stuff go. Alex put Bez in a box. Go. I'm not explaining what it means to put in an executive in a box. I'm not telling him how to do it. And I think that piece helps tremendously. Courtney's laughing and I don't know why.
Courtney:Uh, I'm like, well, nobody needs, uh, to be told how to put an executive in the box, uh, out of this crew. So, yeah.
Alex Elman:also, if you can mention what your experience with Bez was during the incident, because I thought Bez was very disruptive on my side, but you mentioned that, you didn't see much of be.
Sarah Butt:I saw him like, maybe once or twice he popped his head in. was once when he popped his head in and he got after the customer support person for posting the status page too early, which regardless of whether or not he'd been involved, like that entire conversation, I was gonna kick it to biz comms'cause I just didn't want it in the middle of the troubleshooting. and then he popped in once get sort of angry about the data center and that's when I think I said, because I didn't, I didn't have the bandwidth to look at the biz comms channel, the entire incident. So that was when I actually said, Alex, can you put BES in a box for me? I had no idea that Alex was already like, actively, continually putting the guy back in the bullpen. Yeah, like the box was being repeatedly broken open and Alex was just reconstructing it actively. and it's funny because I think it speaks a little bit to different people's demeanor during incidents as well. I was talking with, John Alba about this incident, and he's like, well, when we talked about it, we're like, Sarah, you're like, Tigger. And, and Alex is like, not like ior, but like, it's just the energy level. Alex is very calm, he's very composed, you kind of hit it with a lot of energy and, and wear a little bit more of the emotion. So the most that you hear Alex say is like this very calm, like executives are, they have a lot of questions.
Alex Elman:Sarah Bez seems pretty incensed and is asking for status update on number of customers, percentage of customers impacted.
Sarah Butt:Like it's something like that. It's very measured. So I had no idea that Bez was having a meltdown in the other channel because Alex just completely buffered that out. I.
Alex Elman:And I, I was so, so worried that Bez was being disrupted because every time I would try to satisfy his demands, he would go quiet. And I didn't know if he was going quiet because he went elsewhere or because for the moment he was okay with what I said. But I'm glad to hear that it maybe was effective.
Sarah Butt:Hamed, what do you think?
Courtney:Okay,
Sarah Butt:I, I don't know who was playing Bez or helping the AI behind the scenes with Bez, but do, can you like, put a be Bez hat on? do you know what Bez would say about his experience during the incident?
Hamed:so Ed was playing Bez. I was Tinus, Tanya, and, Hamed.
Courtney:There was another uptime labs person behind the scenes, not you, that isn't here with us today. And that was who was running Bez, is that correct?
Hamed:it was. There were two of us,
Courtney:Yeah.
Hamed:because it's a very involved, involved drill. but I spoke to Ed about, about it. how was, how was Bes experience during, during incident? he, so he, he, I'm quoting Ed here. he was, it was pretty, if one hand he, I think he enjoyed the information that Alex was sharing with him at the same time he was, he was feeling impatient because of, of the magnitude of the issue. So all along he had this struggle of staying back, coming in and the data center issues like really blew him off four hours, all our businesses down. And I think that was where he had a bit of a burst. If I'm not wrong, Sarah.
Sarah Butt:yeah, you're right. Actually, you're right. I remember him coming in. That was exactly when he burst in. and, and what I did, and it's one of my sort of tricks when I need to manage someone in leadership who, a has the opportunity to be very helpful for me, but b also has the opportunity to potentially really derail an incident is, um, I find a helpful task for them to do that is uniquely suited to them. and one of the things that executives are amazing at, because they tend to have a whole different set of connections than the standard support channels and stuff that you use, and I do it often if I need to, is Hey, you know, whoever the senior person is, I need you to go. Get in touch with this company. Do you know the account executive? Do you know someone in leadership over there from a prior piece in your career? Are you the person who signed the contract I have literally sent executives and said, where they've said, I don't know how to get in. Told, I said, you and procurement go figure out a different way to get in touch with this vendor because I cannot get in touch with them yet. Or I want every possible end road. So as soon as he hit, I basically like tried to u-turn him back and say you have two options here. You can help escalate, or you can go get information from Alex, but you cannot be in the middle of the technical bridge just like churning. You just don't get to do that.
Hamed:And, and I, I think so. Now remember it, that was a critical moment because Ed brought as well be like for the first time in that incident, felt that he could be useful. So he had a task, he had a purpose, he stopped disrupting and that, that was interesting. It's, we have, we have drills many, probably over, 40 times. uh, already it was the first time that actually someone gave a task to Bez.
Eric Dobbs:This is, this is the thing. I'm so excited to hear this. Sarah, you're, the, here's another thing about expertise, right? So I geeked out before we lost power last time, about what a powerful experience it is to witness expertise in action. Sarah, I don't know if you have any sense of how deep your own expertise is. That as many times as they've had people run this drill, you're the first one to show up with a project for, for Bez use like a, a concretely useful thing give Bez something to do.
Sarah Butt:no.
Eric Dobbs:and, and I hear you saying is you have, you have experience. You have played this game before.
Sarah Butt:so it's, it's interesting for me because like to some extent, I, I don't, and you, you can hear this on Incident Ferst, uh, there's a webinar that I did with, Beth and a few other folks who had also run this drill. And I, I did talk a little bit about the law of fluency, but, I think the other thing that's interesting is this is actually something that we train, this is in our advanced incident commander training. And so I learned this from, several other incident commanders who are, kind of came before me and trained me in all of that when I was, you know, probably four or five years ago when I was first starting with this employer. but it's a pretty standard strategy we use and we use it for two reasons, but it's not entirely just oh, I wanna get the executive off the bridge. Like that's, it's not that at all, but, There's, there's a whole interesting discussion, and I don't know if Incident Fest got into it at all with, the webinar. There's, there's a really interesting discussion about like, what's the role of an executive on a bridge in different ways to use them? And, and I don't have time to go into all of that, but what I will say is executives, nobody gets on a bridge intending to be disruptful. And often people don't recognize, how intimidating they can be just based on their title. Like I know when I was in leadership, when I was leading, uh, SRE for an organization, people had to pull me aside as I was learning and say like, Hey, you don't realize this, but when you get on an incident and you've got people three or four levels, you know, like it's a big deal. It's disruptive. You don't think you are, you think you're just getting on to help. And I think that's magnified by a hundred x when you have an EVP on the incident. So executives. Are humans and they get on as humans just wanting to help. one of the things that I think is valuable in an organization is how do you figure out the best way for them to help? And they often actually do, if you give in the right structure, have unique ways to help. There have been times where I have literally LinkedIn searched during an incident if I needed to get in touch with a vendor. I don't know if I've done this with my current employer, but certainly with past employers, I have LinkedIn searched or a company. people who previously worked at the vendor I'm trying to get in touch with, who are executives at our company called them cold and said, Hey, I have an incident. It's for this part. I know it's not, you're not responsible for this piece of the business, but would you be willing to step in and call whoever you know that that company to help us try and get moving? it's normally done in parallel with the official paths. But I, I do think like there is of ways that executives can lift internal and external roadblocks and they're totally willing to do it. In my experience, a matter of making a path for success in how they land on the incident and how they interact with your responders.
Eric Dobbs:This is,
Courtney:amazing, y'all.
Eric Dobbs:solid gold.
Courtney:I, so I wanna keep going though on the themes if we're okay. Eric, and let.
Eric Dobbs:to, I was about to intervene in the same way, Courtney. Here's the thing, so, and, and I'm gonna go a little meta. I'm trying to facilitate a discussion where we learn stuff about the incident, where we learn stuff that's not about the incident. We learn the stuff about incidents in general or about our business in general. So where we were in the plot and where we were in the, in, in the document was I was trying to introduce what even is a theme. And there's this common problem that a theme is an abstraction that's hard to explain without giving a concrete example. So I entered by saying, you know, themes are some pattern that came up out of the incident where I heard from more than one person the same kind of topic seem as a word is carrying a lot of weight. Like what makes these things the same? Well, I heard about saturation from Alex. In one context, I heard saturation from Sarah. In another context, I saw them, skillfully navigate their mutual saturation at this most tense point in the incident. So this was a specific theme I drew out, and it is, it is inevitable that if as you're trying to talk about the abstraction, you have to give an example. Because the abstraction doesn't make any sense. And once you're looking at the example, you get lost in the weeds of the example, because that's the nature of, like, the example is if you get a good theme, it is so compelling you can't not talk about it. So one of the themes I was hoping we would discuss, and it's sort of out of order for the agenda that I had planned for the retro, but totally fine. I'm happy, like we've pretty thoroughly covered the theme of saturation and seen it from many different views. this is so much more valuable than the things we might fix in the air conditioning, in the data center for this specific incident. So if we are inside the simulated company, we don't even have control over the air conditioners. They're in a vendor's hands. We can't work those levers except maybe changing our support contract or setting different expectations with the vendor. Like we could pursue or contractual avenues, but we don't, we don't have our hands on the air conditioner or the maintenance schedule or any of it, right? We, and certainly none of us, even the vendor can control the weather that is certainly contributing to the air conditioning problems. So, but what we can do is learn how to coordinate and learn how important saturation is in any incident. But in this one in particular. And it came up so specifically and in so many deep ways in this, this is the kind of gold that's available in an incident. If you look a little more closely, then how do we fix it? How do we prevent the the next one?
Courtney:Yeah.
Eric Dobbs:so
Courtney:Here.
Eric Dobbs:that's the thing we're looking for when we're trying to draw out themes. Let me finish through the list. this, this item came up in a few places for both Sarah and Alex in particular. I enter this as an analyst saying, you know, this looks like a pretty unusual situation. This is, you know, sales are down, you've got the executive, you're expecting a big sale, unusual loads of traffic. And both of them were like, yeah, this is nothing unusual about this. This is business as usual when you're in incident commander. and we'll, we can, we can possibly dig into that. a, a fourth item of interest and this is sort of the most, anchored and specific for this incident, is there's a, once we know that the, the source of the problem, the trigger of the problem, the thing that's going to give us the most leverage to get out of the problem is about air conditioning. Once we know that's the problem, we have two paths that we're pursuing, waiting for the, the vendor to fix the air conditioning or getting out of that data center. Both of those paths have risk. And there's a, there's a debate in the midst of the, of the incident, this is, this is probably narrowly, if we were looking for fixes, this is the place that we would get lost, as a, as a team trying to learn in the weeds of the, this trade off decision about wait for the air conditioning to get fixed or get out of the data center. and so those are the sort of four items that were the most obvious. But I wanna, I, I need to briefly just for visual impact. I dropped in the bottom a list of what looks to me as a skim of maybe 25 that are well substantiated with evidence. They wouldn't, they didn't come up from everybody. The four that I named are the ones that everybody I talked to and everything I saw reinforced those four things. But any of these 25, I think there's plenty of evidence, in the material we have of rich themes that could be worth talking about. So I'm gonna back off of those 25. I'm gonna use the, the navigation to get back up to the themes. I would normally do, if I hadn't let us get derailed, digging deeply into saturation, is invite, some discussion from the group about which of these we wanna talk about most. we, the group already sort of deci decided that saturation was important to talk about. I need to hit this one other detail. it's in the insights. I had already sort of scrolled it into view while we were talking about it. this is particularly great because it's happening while the debate about, and the parallel tracks about waiting for the data center to get the, the, air conditioning fixed or executing the business continuity plan. Sarah asks Alex, this is one of the places where she fluently has too much on her plate. She's delegating the thing, a thing to Alex, but the way she delegates it is
Sarah Butt:alex, there's a, BCP doc. Do you have, bandwidth to read it? They sent it over to me. It's, it's this one.
Alex Elman:Yeah. Once I get this CAN out, I'll read it. I'm almost done.
Eric Dobbs:Now, the backstory here is that she's asked about four times in Slack for somebody to, to tell her like, what's the story with this ba with this plan? she's already been pursuing it. Nobody's answered.
Sarah Butt:I'm sitting there and I'm like, nobody is acknowledging, like I want someone to say like, act, I've got it. I'll be there in five minutes.
Eric Dobbs:She's finally like, Alex, we need details about the plan. Have you got the bandwidth to deal with it? So she's delegating with deep awareness that Alex needs to be not saturated in order to take in the importance of it. And she's checking that first Alex responds fluidly almost there. I need to finish the thing I'm doing and then I'll get on it. So the fact this is, this is the sort of, it's so subtle, you could completely miss it. Sarah checks before saying, damnit, I need this thing. Although, like I, Sarah, I'm projecting into your head, you've asked several times. I think at the point you're handing it to Alex, you're feeling impatient for an answer about the thing.
Sarah Butt:Yeah. I mean, yes. Like we needed to be moving in that direction then and I was sort of, I think if I, if I remember right, I, I used this sort of typical strategy of like, I push on a person one, two, maybe three times, and then I start to round robin a little bit and pull in people who might be useful. And that's when the disagreement has started between, Tanya and Hammed, which I will just add a, a slight bit of interesting commentary of not knowing the arc chart. I thought that Hammed was like the chief customer Officer, which is part of what I, I necessarily want to lean in into his recommendation to enact the BCP right away because I didn't think he was on the sharp end of the system. I didn't realize he was actually Tanya's leader and deep into the platform side. But regardless, um, that starts happening. And as soon as that happens, it's like, I want to hear that information from them because I wanna understand from the sharpened. Why they're nervous about this BCP, but someone's gotta give me the freaking BCP, like someone's gotta tell me what we're doing. I just, just basic steps. How long is it gonna take? Like I just need information. And so I look around and again, it's me reaching for that deputy of like, where's the trusted unit of adaptive capacity I can grab at? And so I grab for Alex.
Eric Dobbs:So there's this, there's this reciprocal example of it from Alex back to Sarah. Moments after Sarah's asked. Alex is saying, I'm working on the can. I'm almost done. There's a moment when Sarah is sort of voicing and typing a set of pretty complicated questions, trying to manage the parallel tasks. And there's sort of parallel questions in this complicated, blurb that she's typing into Slack and Alex is impatient to deliver on the request she made of him. He knows how important the, the, the business continuity plan is, but he's savvy, he recognizes she's overloaded. Doesn't just drop it on her, but he does at some point voice. I, I've got a thing for you when you're ready, kind of thing. And she says, you know, hold while I get these questions typed. So it's exactly the same thing. Alex is signaling he's got a thing for her, just enough of a signal that it's not going to blow the stack that she's managing. But also enough to get in there so that it's so that he's elbowing his way in with this thing she asked for. such a savvy skill in the coordination to have just the right amount of interruption there. Sarah's skilled at can't handle it right now, and then when she gets through typing the question, she's like,
Sarah Butt:Alex, sorry, I put you in a buffer. What was that about the BCP?
Alex Elman:Yeah. We have a BCP.
Eric Dobbs:The, the fluency with which they are negotiating, each other's overload, the fluency with which they are aware of each other's overload and signaling. They can handle it. They're both recruiting each other as resources to try to manage the complexity of this high tense. a high pressure moment in the, in the process. and this, the, the expertise on display in just this little exchange extraordinary. And y'all, this was a drill.
Alex Elman:I wanna stop for, for a movement to point out the sort of, artful facilitation that Eric is engaging in, in this retrospective. So Eric introduced us all to the four themes that he identified in his analysis, but he invited that, he certainly hasn't identified all the themes. And then he showed us about 25 observations that he made. They're not themes yet. They're observations. Some of them, we can call it proto themes. And then he set, let us sit with that for a bit. And then he zeroed in on, on an important observation, went into detail on it. And I think what Eric's doing here is he's trying to lead us to another theme, not telling us what that theme is. creating themes is very difficult. It's something that, takes a lot of time and experience. And so it can be easy to, to look at leading on a deputy that she has and trying to use a counterfactual and saying, well, without the deputy. Sarah would be underwater. So that's a thing. we can't use counterfactuals. We have to talk about what happened. And so Eric, what are, you know, four different ways that Sarah can deal with that saturation normally, and how did she use deputy with that?
Eric Dobbs:Oh, thank you Alex. and thank you for the craft of, of your own question. I'm, I'm not gonna do meta on what Alex just did about what I'm doing as meta.'cause we can go all the way down. We can keep going with
Sarah Butt:I love incident.
Eric Dobbs:The specific, so good. So the specific, are four responses to overload. and these apply to, the pattern of overload, is everywhere. It's in bacteria, it's in the way humans communicate. It's in our individual experience of a to-do list. It's everywhere. It's it, it's in a Kafka queue, which
Courtney:I am, I'm having like parenting, PTSD while you guys are talking about this, of like, where I have to be like, no, hold the, oh, you shut up over there politely and kindly to my 11-year-old while I'm talking to my husband. While my thir, I'm like, it's every, it is everywhere, you know?
Eric Dobbs:Overload is everywhere. So the pattern of overload has four, mechanisms, four responses. There's four ways that you can adapt to overload. Let me, we maybe pair, there is a model of overload that suggests there's these four things. It's possible, the model's wrong, but I think it's pretty good. Two of them are urgent things that you do without thinking, two of them are. more effective, but require some anticipation. So the first two that are, that are like the default ways that overload gets handled reducing thoroughness on the collection of things you're trying to manage just dropping things from them getting done. My new hypothesis, this isn't in the, the research I've read, but my claim is what humans do by default is reduce thoroughness without even thinking. Before we even start shedding load, we reduce thoroughness and try as long as we can to manage the to-do list that's in our brain. and and at some point our brain has lost capacity and we just forget the list of things we were trying to work on. That's where we were dropping load. But the default mode is reducing thoroughness, shedding load is the other one that happens. Both of those tend to lead to, suboptimal outcomes. The other two that are, that are sort of better, I'm going to throw a judgmental language on that, are to recruit more resources, get help from other people or other, systems, or to defer work until later. So those four strategies are what you do when you're overloaded. the beautiful thing is this example of saturation. What we have on display in this moment of communications back and forth between Alex and Sarah when they are at saturation is we can see both of them deferring recruiting resources. And it's, it is precisely the engagement in those more effective strategies. That is the expertise they're demonstrating. They're not just dropping it. they're not just reducing thoroughness. They're communicating and signaling, saying, Hey, I need your help. Hey, you asked for my help. and in a, in, in an incredibly savvy way, and mutually managing overload, that's really too much for both of them, but successfully coping with it. This right here is actually it for every, for every incident everywhere. Managing overload is the thing that is common to every incident everywhere. There's a paper, we can put this into show notes. It's the theory of graceful extensibility. and the, one of the core assertions in the theory of graceful extensibility is that it is precisely managing the capacity to adapt, managing the risk of saturation. Those are, those are inverses. There's a risk of saturation, and there's a adaptive capacity. And managing those things is how you do graceful extensibility. I'm gonna, that's as much as I wanna say about the paper, because that's a whole deep dive.
Courtney:We'll put a pin in it and yes, we'll put all of that in the show notes. and I mean, the, the other thing I wanna highlight about this for, for folks listening or watching along, is none of those themes are technical things. And you're, you're already saying this, but I'm gonna, I'm gonna reiterate it one more time for everyone at home. and I mean, we have some candidates for recommendations below, right? Like what people would more likely consider to be action items to come out of a retro. but the themes have nothing to do with the technical stuff. And even if you go and do all of those action items and you change your BCP or you, you know, have different contracts with your vendors. This will still be there. And, and that's what you're saying. And it's like, so if you have this laser narrow focus on action items, you're missing the bigger picture.
Eric Dobbs:Absolutely, and if I can just reinforce that around the specific content of the multi-party dilemma. The problem we're in as an industry is that we just spent a decade tying our systems together technically. Across company boundaries where we are separated And so the communications between the humans that keep all the software running inhibited or, or held up by the fact that we've gotta go through support communication channels to get to the air conditioner the case of this specific incident. that's happening everywhere. So this notion of coordination across company boundaries is the hidden problem that all software has right now, unless you're in the rare situation of being completely in control of your hardware and software. and I actually, I think that's close to zero in the, in the business world. So I think that the topic we're on the, the, the, the specific subject of the drill is one of the most important things and how to coordinate another one of the most important things. And it is really hard to do when we've got these legal entities slowing down our communications when it matters most technically.
Sarah Butt:Eric, let me jump off of that and just mention two things off of the, multi-party dilemma work that Alex and I did, which, we should probably also, link in the show notes. There's a, paper and a presentation, but, comes out of the presentation that Alex and I did. Two things. One is a lot of the research and the case studies and such that Alex and I looked at for the MPD was around external relationships. But when we look at companies that do have a significant amount of this in-house, we see this internally as well, normally between teams. As the systems got more complex as we needed different bits of expertise, it became easier to have people specialize. And so we created boundaries within our own companies, whether that's there's a networking team or there's a security team. So these dynamics don't necessarily just happen. Externally, they also happen internally sometimes in different forms and flavors, and sometimes you have different ways to potentially address them. The other thing that we talk about in the paper that's interesting is we talk about, nested or hidden dependencies where your vendor has a vendor that you had no idea about until something happens. And so, like in this case, I am sure that the HVAC vendor is probably not directly the colo, it's probably someone else. we see this commonly with, just the, there's like really, really big vendors that, a lot of people tend to build on. And I won't go through and try and name them, but you might sit there and go, no, no, we use this vendor, not that vendor. And it's like, but this vendor uses this vendor that uses this vendor that uses this vendor. And so sometimes you see these incidents and I think we're seeing more and more of them now. And um, like CrowdStrike is probably a good example where right around the one year anniversary of that, where a bunch of people would've potentially said. No, we're, we're not CrowdStrike. We use someone else. We use this and that, but their vendors used. And so we end up in these very tangled complex systems, and it's not necessarily a bad thing, like the systems were going to become complex. We had to start to outsource expertise and break it up. I know we talk about this piece in the paper and why this sort of became this way, my goodness, does it make it complicated?
Courtney:Yeah, it's a third party turtles all the way down. so
Eric Dobbs:And I think we're, we're only weeks away from a pretty high profile incident that, that that sort of shared between Google and uh, and CloudFlare.
Courtney:Yep.
Eric Dobbs:Everybody uses CloudFlare. So there were cascading failures for all of the brand names. All of the 500
Courtney:Yeah. And all of us were like, wait,
Eric Dobbs:that.
Courtney:CloudFlare used we were all like, what? Nobody, nobody knew. Like, yeah. It was
Eric Dobbs:Yeah.
Courtney:Interesting.
Eric Dobbs:And not in none of these cases, not to single out any of the companies that we've named there. It's, this is endemic, this is the entire industry. and it, we can name specific, companies where there have been incidents and their name got out, you know, so, but it, but this is happening everywhere.
Courtney:Yeah. Agreed.
Eric Dobbs:Much smaller impacts. I'm conscious of the time. Are
Courtney:Yeah.
Eric Dobbs:Close to the end of our recorded time together?
Courtney:we're getting close. But, like maybe lay out the rest of the structure and let's pick what we're gonna choose for the, our remaining time here.
Eric Dobbs:Yeah. So a, a quick sketch of the document. We hit the themes. There's clearly there's more. Here other of these themes of interest, and the one I have the most evidence already in the document to talk about is fixation. So maybe that's a candidate because it's available, but also fixation might be well covered. The doc doesn't have the thing about unusual pressure or about the disagreement. Does anybody want to, and, and in fact, this is the one that's most narrowly focused for the incident. Is there anybody in the room who wants us to get to that part? what do people think?
Sarah Butt:think that one's gonna be the most interesting because I know, and this is maybe just my personal interest, but like Alex and I have been able to talk, Eric and I have been able to talk the perspectives that we've not been able to hear yet. Much like how you get to a retro and you haven't heard people's perspectives are the people that Hamed is currently representing. And those were the folks that had very big opinions on this, to the point where Hamed was playing both sides, I think. So I'd love to hear more about that piece.
Hamed:So the, the argument between Hamed and Tanya, should we do the fade over or should we not? for me, that was the most interesting part because I was playing both of them. I was Hamed and I was Tanya essentially. I was like arguing with myself, but it felt so real. Each time I was hammered, I was genuinely like, was thinking and pushing for it. And I was Tanya. almost like I chopped off and brought another personality and very passionately like, no, this is risky not to do it. And I could, I could think why Tanya would be, would be against it. So from, from that perspective is like very interesting, conflicting, experience. Tanya being a, kind of a more hands-on person. She had worries that all the time we do BCP exercises. It's mainly as a checkbox to tick compliance. Everything in months we have to do and we need to submit a report that we have done it. It has taken this amount of time and there was a desire in business, and by the way, this is all coming from a real life experience that I lived. I'm not gonna mention which company it was, the companies I worked for that, okay, we have to do this BCP exercise data center failover. It was called Flip to Report to regulators, but we don't want disrupt the business. We don't want it to be going beyond certain time. So it was like a soft BCP exercise and a hard BCP exercise and the soft one was very controlled. So Tanya's worry was everything they've done so far was soft. And we dunno what's gonna happen in like where one of the data centers is like really not accessible. Can data move around? Can not move around. What's gonna happen the, to, a to, to the syncing of the data? so that was one of her worries. And then the other worry was about we out of last three times, only one of them was done in the timeframes that we were talking about. Hamed's perspective being Tanya's manager orchestrator of basically he was responsible for this execution of bcps. Was, was he, he was a little bit conflicted because. He runs this BCP, so business can rely on it. So knew that Tanya might have a point, but it was also his responsibility to make sure that business have a credible BCP plan. So he naturally overlooks the, that we had in the past and it tends to more on now. We have done it at least once in the timeframe we practice. Even last time took a lot longer, but we fixed the issues so we should be able to do it. And covering all of that is our vendor looks. It doesn't look like trustworthy at the moment. I can't believe what it says. So that was like two arguments I was leaving in the moment.
Eric Dobbs:I, I love these perspectives. And, there's a, the piece that's jumping out at me, I guess I'm just gonna have to ask them as to the persona. I am that in the pressure of an incident, Tanya feels safe, challenging her boss. That's that, that suggests extraordinary psychological safety. and I, I wonder if you can comment on that from, from either or both of their points of view.
Hamed:so I believe that moment so I can, I can talk about experience, yes and no. So, it could be psychological safety, but it could be that Tanya knows that this is the path that is gonna be to failure. leading this path. I'm going to be in a lot of trouble.
Eric Dobbs:Uh, okay.
Courtney:Uh, it was ruinous to go.
Hamed:Speak.
Courtney:So she was willing to gamble though, because the risk to her of the other avenue felt like it would be worse for her outcome and her team's outcome. Is that accurate
Hamed:Yeah,
Courtney:or not? Gamble? Actually, she wanted to not gamble. You wanted to gamble. I did the other way. Yeah. So the,
Hamed:but Hamed was the
Courtney:yeah.
Hamed:Who was going to do it. And if he failed, would be under huge amount of stress to work out
Courtney:Yeah.
Hamed:and fix it. And again, probably she will seem responsible, oh, why this happened, why that happened, why this is correct. from her perspective, like no way to win. I might as well just really say what I think.
Courtney:Be safer. Yeah.
Sarah Butt:I think it's interesting to hear, from both personas the responsibility that they feel, in this time of uncertainty because I think back on it and I know, that I felt responsible as well as the person that ultimately was, was Like at the end of the day, making the decision, for lack of a better word. and that's something I, I often feel in, in all reference Richard Cook and how complex systems fail here, where he talks about how practitioner actions at the sharpened in these novel and unknown situations, like they're always gambles and it's really easy to look at this. I mean, this is the thing that we miss in retros so often. And it's really easy to look at this because it ended up working out and go, oh yeah, that's a great decision. It would've, it could have been a bad decision where we got lucky. It could have been a great decision where things went well. A good decision or a bad decision looked at in retrospect. it, there's really no way to do that. Like you do the best if you can. I, I, I talk a lot about judgment being a timestamp decision because you don't know what happens in the future. And that's where I think these retrospectives get so interesting because we've got three people on this call functionally all saying. feel responsible for the entire incident if this goes south. But every single one of us wanted to do the right thing, but there was were always going to be in a gamble. And it's so easy if you are just reading a document after the fact or coming in after the fact, or not living the place of having to make that decision not understand that. I mean, you really are making a gamble and hoping that you've made the right choice and our practitioners carry that.
Courtney:Yeah. I think, if there were to be a meta, meta theme of all incidents is the principle of local rationality, right? And that you, everyone who's involved in this is a, trying to do the best that they can. No one's showing up to an incident trying to make it worse. even the executives, when it feels like they might be, and secondly, everyone is making the best possible decisions they can make given the information they have at hand at the time. And, and that is the best, right? And the information you have at the time at hand is never. All the information. and, and then the third party multi-party dilemma piece, amplifies that significantly.
Eric Dobbs:So like we have so much evidence about Sarah and Alex deftly hedging throughout the incident on the gambles that are there. That's extraordinary. and, the, the, the critical, one of the other critical insights why bother learning from incidents is that because the, that sense of responsibility that Sarah points to all of the people involved have a, an urgent sense of responsibility during the incident is where we get to see the lived prioritization. Of what the people in the room think is the most important thing for us to be dealing with, or the most important fear for us to be dealing with. So one of the things that, one of the reasons incidents are such a powerful lens to understand how the business really works is that it is literally where the rubber meets the road. In our business, we have all these ideas. We have policies and procedures, we have trainings, we have, quarterly expectations and OKRs and everything else, but it's during the incident where people really make those prioritization decisions for the business. And so inspecting them is a place to see really was most important according to the people closest to the, to keeping the business running. What is that thing?
Courtney:Alex has to make his own prioritization right now and escape us to go to another meeting.
Alex Elman:This was so much fun.
Courtney:Thanks, Alex.
Sarah Butt:By Alex,
Alex Elman:Bye. Yeah.
Sarah Butt:Can I ask, uh, Eric a question?
Eric Dobbs:Yeah.
Sarah Butt:Eric, I'm curious, like, let's say you're facilitating this retro, If we take out all of the, whoops, we lost power and people lost internet connection and had migraines and all, you know, all of, all of the things that we've had happen we're still probably well over an hour we're, it feels like, just barely scratching the surface. if you were facilitating this at, at, you know, in, in quote unquote real life as a retro, what would you do here? Like, would you keep going? Is there multiple retro meetings? what's your advice to people who want to try and start some of this format and are trying to figure out how you time box it with these really rich discussions?
Courtney:It is such a good
Eric Dobbs:mean that's incredibly, there's an incredibly difficult, how do you stop an interesting conversation? Richard Cook has a, a really beautiful metaphor he uses in many of his talks. At the end of his talk, he says, my. my psychiatry fr as my psychiatry friends like to say, I see we are reaching our time together. We are reaching the end of our time together. So, I were a doctor could refer to my psychiatry friends. You know, maybe I could, maybe I could make, I, my, as my therapist would say, I see we're reaching our time together. This is a, so that's a grace that, that's a graceful, entertaining way to, to just say, look, we have to draw the line somewhere. I know everybody's got a busy schedule. so, you know, thank you so much for your time and what I hope you take away from our time together is that, there's ton of value to be gotten by having exactly this kind of conversation about the messy details of the incident. we, because of the way I managed it, I'm sorry we didn't get to the action items and I'm sure that's gonna irritate some of my people in the room. If we're in a real incident, I will be happy to schedule another half an hour for us to talk about that, or we could possibly take it up asynchronously. and I you to,
Courtney:or the folks who already know this stuff are gonna go do it anyways.
Eric Dobbs:this is, this is the thing. can't stop engineers from fixing stuff. the, the, the most important fixes that I've seen in the three or four years that I've been doing incident analysis, the most important fixes happen before you even get to the retro.
Courtney:Yeah.
Eric Dobbs:That as, as people are cleaning up the mess, they do the thing that is most urgent. And that doesn't even show up in your action items.'cause it was already done by the time you got to the retro. so the, the, the lowest hanging fruit that's probab and it's, hmm. Yeah, I insert Lorin's law. Some of that stuff that you do in those minutes becomes the source of your next incident. sorry, Courtney. Another, another thing to add to the, the
Courtney:is. It's going in the notebook. Here it is.
Eric Dobbs:for the leaders in the room who are worried that we haven't left with action items, rest assured that there will be action items your teams can't help themselves. And, nevertheless, I'm happy to take on the invitation for another 30 minutes to talk about action items, if that's really important for folks.
Courtney:We're not gonna do that for this podcast. it's okay. But I, I would, if I can, if I can be so indulged, I wanna ask the final question of Hammed. you as the executive of this company that this incident happened of, and you were given this report, what do you do with this? Or what do you expect to happen out of this from the team or your teams that were involved?
Hamed:First of all, I'm really overwhelmed by this conversation. Like the amount of learning that I got out of this conversation, I just dunno how to react for it. With my engineering background on top of it. So hard for me to think what, what would I do as an executive? But it's, I would suggest everyone read this report and watch this conversation. There was so much learning.
Courtney:Everyone at the company.
Hamed:in the company, me,
Courtney:Okay.
Hamed:and I think it goes, it goes beyond engineering teams as well because if you think about, the points that, the learnings that I had from this conversation about. Saturations and how to be aware of it and how to be deal with it. That doesn't just apply to tech. The learning about how, how make, how make executives in a stressful situation of incident a useful, and I think by being useful, they will feel a lot more comfortable. That's, again, very, very rich yeah. I'm just thinking is there any other way to learn so much about what is going on in an incident? I. Such a short amount of time than, than what we did here. Like I was just thinking if, if I was going to watch this podcast, I can't think of any other way to learn so much in this short amount of time.
Courtney:I wanna put a, I wanna put a pin in that too, because a lot of pushback I hear at higher levels or in organizations is like, oh, this, this incident analyst stuff, this stuff, it's like, it's so time consuming, it's so whatever, whatever. And for the actual analyst, I will argue yes it is, but that's their job, right? Like if you have somebody who is a dedicated incident analyst, indeed they will be spending the vast majority of their time doing that, and that's what they should be doing. And that is great and well worthwhile to the other people involved. Okay. We're, we've cracked a lot of jokes about how long this could go on, but for the other people involved, you've got one, maybe two hours, maybe, you know, we did a lot of meta talking and, you know, had a power outage and, but, so maybe it's an. How valuable is that? 90 minutes to your company, right? what other activity, like you said, could you possibly do that, would deliver that much value in 90 minutes of four engineers time? And, and to me, I just can't even unsee that that math makes perfect sense. but you know, I'm not an executive at a large tech company, so, but I, I just wanna underscore that point that you're making is, there is no other way that is this effective and this efficient to have the kinds of insights that come out of this that could dramatically change the way your teams manage incidents, handle them, prepare for them, all of that. There is nothing better.
Hamed:Yeah, exactly. But, so I will ask last, last question from Eric,
Courtney:Okay.
Hamed:uh, Eric, analyzing this incident. Did it make any difference that. You went through this drill yourself, like you lived it, you knew what happened. Did it have any bearing on how you approach analysis?
Eric Dobbs:You know, I have one brain, so I can't completely separate. I can't rewind the universe and run the experiment to see what I do if I hadn't run the drill myself. were enough differences between what I understood when I was doing it myself and what I saw Alex and Sarah doing that it was pretty easy to focus on their experience. although I have access to the recordings of my own, I haven't watched them. So, the one place that I was, uh, trying to stay faithful to the idea that I haven't, I don't really know what's going on. I'm, I'm more from the outside, is that I really immersed myself in the data and evidence from Sarah and Alex's, uh, running of it. And I haven't, I'm going to, now, now that we've gotten here, I can go back and look at my own and, and, and genuinely compare myself to the two of them. That'll be humbling,
Sarah Butt:I would say I think it's apples to oranges though, because you ran it solo running this thing solo is a completely different beast than running this thing as part of a team.
Courtney:but that's kind of exactly the point
Eric Dobbs:Yeah.
Courtney:is we always say no two incidents are the same. You will never have the same incident again, you know, yada yada. But the exact same set of technical details, procedural details, business details. And given one person is completely different than two people, if you run this exact same incident with some other two people, you're gonna have a completely different set of outcomes because every person brings their own individual set of skills and experiences and everything to it. And that's the other piece that we tend to not see about incidences is we see it as this black box in which physics happens and, and expect same incomes, you know, inputs and outputs and outcomes. And you are not gonna have the same people on call, you're not gonna have the same responders. So I think it also really does highlight that fact of how, even when the details are so similar in, in your organization, like, I've seen this before, but you haven't seen exactly that before with the same people and, and so there's, yeah, there's a whole other path to go down about that.
Eric Dobbs:Yeah.
Courtney:will be curious, Eric, if you go back and look.
Sarah Butt:down, and I think it's interesting because that was actually done for incident Fest because we had three pairs of people run the identical drill. It was myself and Alex, and then two other pairs. and I think it, it's like that's probably a whole separate podcast or discussion or something, but like, how cool is it that we had the opportunity to do that? Because I've certainly never been able to watch peers that I really respect an industry also run same thing that I saw. And I think the learning that we can have from how we all handle it and the things we've picked up along the way and the expertise we bring is amazing.
Courtney:And next thing you know, you're gonna be, uh, uptime labs on Twitch. That'd be great
Sarah Butt:That would be awesome.
Courtney:if you actually do that. I just like put me in a web credit somewhere that's always don't do it. Okay.
Hamed:hard. They considered it. They looked, looked to it, but.
Courtney:I know some nerds who'd watch it, so, you know.
Sarah Butt:I would,
Courtney:Um.
Sarah Butt:would watch that all the time. I, I listen to like, broadcast dispatch and stuff because I think it's so like soothing.'cause I, I just, I love hearing the details. I would, I would do that all the time. I'm a huge nerd. I, Courtney is gonna need to erase that from the podcast. I'm a huge nerd.
Courtney:No, we're all huge nerds. And everyone who's listening to this is a giant nerd too. So, uh, we're all here together, thank goodness. All right. Thank you all. This is the weirdest and most wonderful podcast I've ever done, and I hope we get to convene on some of these topics again someday. For those listening, there's all kinds of stuff in the show notes we've talked about a lot, and, so thank you all very much. I wish you all a happy Friday without power outages or migraines or any other incidents.