Uptime Labs and the Multi-Party Dilemma (Part I) Artwork

The VOID

The VOID makes public software-related incident reports available to everyone, raising awareness and increasing understanding of software-based failures in order to make the internet a more resilient and safe place. This podcast is an insider's look at software-related incident reports. Each episode, we pull an incident report from the VOID (https://www.thevoid.community/), and invite the author(s) on to discuss their experience both with the incident itself, and the also the process of analyzing and writing it up for others to lean from.

All Episodes

The VOID

Uptime Labs and the Multi-Party Dilemma (Part I)

July 29, 2025 • Courtney Nash • Season 2 • Episode 2

Watch on YouTube

In this episode I'm joined by a group of seasoned incident response professionals to discuss a simulated incident drill conducted on the Uptime Labs platform. The conversation centers around the Multi-party Dilemma—the challenge of coordinating incident response across teams or organizations with different missions, contexts, or incentives.

Eric Dobbs, our incident analyst, joins to break down the drill and provide deep insights into the incident dynamics, team interactions, and what true incident analysis looks like when it's done well. Participants Alex Elman and Sarah Butt, who served as deputy and lead incident commanders respectively during the drill, recount their roles and experiences, highlighting realistic stress responses, decision-making, and coordination failures and successes. Hamed Silatani, CEO of Uptime Labs, provides context and insights into the behind-the-scenes work he and his team provide as the other "characters" driving the narrative of the drill.

The episode uniquely showcases the value of structured incident analysis and the benefits of using drills to expose hidden assumptions and improve resilience in complex systems.

A few key highlights include:

How detailed incident analysis leads to an understanding of the context and rationale behind responders' actions, rather than identifying errors or assigning blame.
The real goal is to learn how the system and people actually function, not just fix a broken component.
Themes revealed by the analysis and subsequent discussion
- Saturation and the value of trust in delegation (especially between Sarah and Alex).
- The role of deep expertise and how it often makes work appear effortless.
- Importance of recognizing the real work done during incidents—often messy and improvisational.

References/Resources

What Experts See That the Rest of Us Miss During Incidents
Incident Fest (Uptime Labs event)
Law of Fluency
Handling the Multi-Party Dilemma (Sarah & Alex paper)
Embracing the Multi-Party Dilemma (Sarah & Alex conference talk)

0:05

Greetings fellow incident nerd and welcome back to an unusual version of the Void Podcast. In some ways it'll be like all the other ones we've done. We will talk about an incident, an incident report, and the people that were involved in it. It will be different in two notable ways. First, the incident is actually a drill that is hosted on the Uptime Labs platform. You'll hear a little bit more about that as we get into it. The short version is it's a way for people to practice incident response. You can do it either individually or as a team, and I really like it as a way to not just practice incident response, get better at that. But I've found watching some of the drills and talking to people about it, that it also helps reveal a lot of assumptions about how incident response happens versus the messy realities of what actually happens. And so that's the setup. So that's what's part of what's unique. And then the second part is that we get to see what. Incident analysis really looks like on top of one of those drills. So I've invited Eric Dobbs to come and do an analysis of the incident that Sarah and Alex do on the Uptime Labs platform and. This is not something that most people get to do. Uh, most people don't get to watch someone run an incident unless you're on an incident response team. And we definitely don't get to see what really high quality incident analysis looks like unless you're like one of the lucky people who's actually like, worked with Eric or is at an organization that recognizes the value of this work. So. These whole shenanigans were my idea. It's not a sponsored thing for uptime. I just really wanted to peek behind the curtain of what world class incident response and incident analysis looked like and bring you along with me for that. So let's get into it.

Courtney: 2:02

Thank you all so much for joining me on this very unique, episode of The Void Podcast. There's a bunch of us, so we're gonna go through intros really quick before we get into the meat of this. So I would love to have Hamed, if you could start off by introducing yourself first

Hamed: 2:16

Thanks. Thanks, Courtney. Thanks for having me. My name is, Hamed Silatani. I'm from Uptime Labs. For purpose of. This conversation, I'm representing four different For the, for the incident review is a bit odd, but the way it works behind, behind the scene in our platform, there are multiple people that play different characters in the role. And today I'm going to represent more than one person. see how it goes.

Courtney: 2:45

We'll get to a window into your multiple personalities. So, Looking forward to it. Okay, so next up, one of our willing participants. Alex, will you introduce yourself briefly and, and paint the picture of this scenario of the sort of we're trying. We got you all into here with this one.

Alex Elman: 3:06

thanks for having me on. Courtney. I'm Alex Elman. been. Serving in an incident response capacity for, the past 14 years at, at various tech companies. And the dynamic that we're gonna be discussing today in this incident, which I served as the, the deputy incident Commander on, is something that I, I've seen and, and Sarah, who you'll hear from, has seen the numerous incidents that we've been a part of is this dynamic called the multi-party dilemma. And it, it shows up incidents. between, parties who don't typically, work with each other and, and might be guided by different, like, respective missions and. These dynamics show up in incidents could result in the two parties, working across purposes or having different agendas. and it has huge implications for resilience, which is why it's an important dynamic. it shows up in complex adaptive of systems. So outside of software, it also can show up in aviation, military medical domains, but specifically, showing up here in, in a software incident.

Courtney: 4:12

we'll have Sarah introduce herself next and tell us a little bit more about the incident and then we'll get into the meat of the matter with our, our analyst, Eric.

Sarah Butt: 4:23

All right. Yeah, absolutely. Hi everyone. My name is Sarah, but I'm excited to be here. like Alex, I have spent, the majority of my career, almost 15 years working. For tech companies specializing in sort of all things incidents and, um, what that means for an organization. In this drill, I served as the incident commander and the scenario that I wanna start to paint for you is the drill, um, situation that Alex and I found ourselves in. So we get dropped into the uptime labs environment, which is. Very realistic. You're essentially in a Slack platform. And then Alex and I had a voice bridge going as well. And what we've been told is that we are, part of a e-commerce company that has had some low sales. They've got this very important, uh, online event that is supposed to drive a lot of revenue. They're gonna do a 50% off sale. And we're sitting there watching the CEO sort of joke, we can't jinx it. All of these sorts of things. And suddenly we hear that. People in the EU are just not able to load the website at all, and so that's the. Place that we start the incident is, quite frantic custom work support folks, a very hot CEO. a lot of concern about what's going on, a lot of confusion, frankly, a little bit of that first, 15 minutes of an incident chaos. And, uh, from there we work through various troubleshooting pathways, which I'm not going to get into because I don't wanna spoil Eric's thunder. and ultimately are able to bring the incident to resolution.

Eric: 5:53

Thanks for inviting me. this is, uh, uncharted territory and, and incredibly fun for me to be participating in this. So I'm so excited to be here. I have been working for the past four years as an incident analyst, I think there are 10 people in the world doing this work in the software business. I, I might be exaggerating, but there are very few of us. it's a small club, and I am, I steered my career in this direction on purpose. and one of the challenges of being an incident analyst and the, the activity of learning from incidents is that the work is all proprietary behind the closed doors of, of corporate, Software companies and it, you know, the legal teams and the PR teams don't really want to tell their customers about how the sausage actually gets made. In fact, all of our software companies have, I was an extraordinarily tangled mess inside for all of the beautiful UX that, that people interact with. However frustrating those interactions can be. the, the what's on the inside is I'll be more messy and we can ever see it. one of the fantastic things about this opportunity is that is a drill nobody's company's material is on the line. So we could maybe even show. Like what the mess looks like and, and be able to, talk more openly in the world about the kinds of things you can learn if you take a different approach to when things break. So that's the piece I'm super excited about, not least of which is that the, the content of this activity is this multi-party dilemma. that's close to my heart because I think the, of the most important features of the software business is that we, we. adopted software as a service over the last decade. So all of us have software running on somebody else's hardware, and we are all of us in a multi-party dilemma in every incident, whether we know it or not. So this material is really important to get out in public also.

Courtney: 7:59

and if you ever want another job as podcast host, um, you're hired. So, okay. So before you start. Part of this, part of what is so exciting to me about this is the ability to look behind the curtain, right? As you've said, because as what I get to do, I never get to do that. I, I am, you know, the, the librarian on the outside collecting all of the reports, and so to be able to peek behind and see all of this as, as a rare treat and, and the rarest of the rare treat. Part of this is to see the kind of work. an incident analyst does, as you said, it's a fairly small club, not exclusive. Others could do this and we hope that more people do and that more companies do this. But I'd love to hear a little bit before we dive into the work you've already done on this, Eric, of tell folks what do you do? To get ready to analyze an incident. What are the materials you collect, what are the things you're looking for? What are you pulling together? And talk to me a little bit about what that process looks like.

Eric: 9:00

I think that's gonna be really useful context, although I'm gonna actually deviate from your specific question and first paint a cartoon of the typical incident in a software company. So the, the typical scenario, this is sort of dominant paradigm about incidents in a software company, is they're bad and you wanna avoid them and prevent them. And when they happen, usually because somebody screwed up. there's a, it's, it can be hard and the company culture depends, but it can be very hard to, to learn what really happened because you have this bias, and this is sort of culture wide. It's not even the software world alone. when bad things happen, it's because somebody didn't do the thing they were supposed to do. And that, that's the, that's the bubble we're trying to burst in the resilience engineering community and the, and bunch of sibling communities. I'll try not to do too much of that intro. so the conventional retro, is to sort of sit around for an hour talk quickly through what happened. and often the talking about what happened stops pretty short. As soon as we have the first person who made a mistake, and we'll dig into what should we have next time, because we definitely wanna not have this experience again. It was painful. We lost money. Customers are mad at us. Uh, there's a lot of good reasons that we do it this way, but the, the core mechanic out of a typical retro is to get the code we're gonna fix so that this never happens again. And the subtext of the whole activity is that, we're really trying to soothe the, the emotional stress that an organization feels and does not admit we don't really have, clear control over the systems we're running. Trying to do differently as an analyst is to admit that the business we're running is a mess and that nobody's really got a clear view of it, and that the incident itself is actually a crack that lets me peek into how things really work. Because most of the time we're too busy to look at how things are actually working, and the, the, the sense of urgency or the sense of regret gives us a moment of opportunity where the business is willing to invest and we could do something different with that time. I'd spent a bunch of time before the interviews reviewing the slack transcript and a recording of their drill. to look at the words they actually said, who they were talking to, what kinds of questions they were, asking, trying to get a picture in my own head of what was in their heads at the time they were taking the actions.'cause at the end of the incident, I know, I know what broke and it can be really distracting to only look at what broke. We, we operate with that urgency, especially during an incident. we're choosing the stuff that feels most important in the moment. I'm trying to get a sense of what that was. So then I have an interview with the participants to double check that what I read and what I think was most important actually was, and invariably the conversation with the people who were in the hot seat takes me down a different path than I thought I was gonna be down. I'm looking at the, the evidence and I learn all kinds of things about how the business really works, I learn all kinds of things about the deep expertise that everybody brings into an incident, and that is, that expertise is gold incredibly difficult to see until you actually do this kind of work. So then having an hour, you know, I've got an hour of video, from the recording and some retro reflection that happened after the drill. The drill's about half an hour there's, uh, a little bit front, know, set up recording and there's a, some reflection afterwards. It's about an hour of footage. and I have. I don't know, an hour and a half that Sarah and I spent, and an hour that Alex and I spent talking through what happened, and all of the Slack transcripts. and I try to then process, process that to figure out what is in common and what is in contrast between the testimony I learned from Alex and Sarah about their respective views. The premise of doing that is that none of us can see the whole system. if I can get some detail about the different views, I learn more about the system and if I as an analyst can synthesize as coherent as possible, a, a subset of all of that information into the interesting bits of, of, shared experience and the interesting bits of. experience. when we have the retro, we have a much richer discussion about what really went on during that incident. But also, this is the critical bit. This is why it's a magnifying glass and not just a thing to avoid. I get to find out what normal work looks like and why this was different. And finding out what normal works looks like, turns out to actually be where the real gold of the activity is.

Courtney: 14:22

something we talk about about incident analysis and like you could be one of the best technical people in your field doing incident analysis is a very different of skills, a very different lens. Anyone who is listening to this might not be thinking like sheesh, that sounds like being like a detective or an anthropologist or maybe a sociologist. And the answer is yes. It is all those things. and why, that's why this is such a treat. you know, and how it differs from just going and talking to people and asking what happened and writing that down and, and moving on. Okay, but let's get to the meat of the matter. you have all of this information gathered. You have a wealth of written and auditory and other data and and narratives and people's perspectives. We're here now. You're running the retro. I'm gonna step back. I'm gonna let you take that in the direction you would normally take it and try not to

Eric: 15:18

so,

Courtney: 15:19

too much. So

Eric: 15:20

Now we're pretending that we're, we have, after all, I've done this, all this work. I've shared the document with the people who are participating. Everybody was too busy to read it. So we come into the meeting cold and I'm sharing that document as a, as my slide deck to run the meeting. and what I'd like to do is take just a couple of minutes to review our goals for the retro. And because the kind of retro I do is unusual in the software business, I'm gonna take a little extra time, to, to frame. this is different from your typical retro and although I've already done that. This highlighted section is what I'm going to read to you. so apologies for reading the text that you can read for yourself, but this part's important. so welcome and thank you so much for being here. I know you all have busy schedules. You could have chosen something else. Let's do our best to reward everyone for having made that choice. The first step is to set explicit ground rules for this retrospective. goal for this meeting. Is for each of us to discover new insights into the complexity of our continually changing business. It is a particular importance to suspend judgment about mistakes or errors, and instead look closely for why those actions made sense in the context of the incident. I am pausing for dramatic effect. It's really important that particular point, so let us agree at the outset. No one of us has a complete or correct understanding of how our whole system works. When things break, when we make mistakes, or we have the opportunity to peek through those cracks and get a new glimpse into what's really going on. gain insights that we can apply both individually and collectively to improve everything we do, just the particular things that broke for this particular outage. There's so much more valuable to be in mine from this experience than the conventional post incident action items. So the thing I'm gonna ask you to do that's hardest because you're well conditioned from your other retros, is to, to tell me how we could have fixed it or how we, how we could have prevented it. We wanna understand what happened first, uh, before we get to, how we fix it. the, the next step is to review the narrative of what happened, and I'm sort of, eschewing the convention of giving you exact timelines and I've, I've abbreviated a little bit. but I have the major plot points in the correct order. And what I really want to do besides establishing sort of the story arc that we followed through this incident is to also validate, especially with the folks who were there. that I have not misre misrepresented I understand of how our incident proceeds. There's an opportunity here for our co collective understanding to be refined, and that's part of the point that, please don't feel like this is me lecturing you about what you went through. We're trying to have a conversation. the day began as Sarah hinted in her intro with high expectations and an anticipation of high traffic. There's a 50% off sale that was intended to boost sales numbers, which have been down in a recent slump. we have an immediate partial failure that raises confusion.

Sarah Butt: 18:39

Reports are flooding in. Ah, okay. Here we go. Um, Bob gave me the exact user impact. Uh, Alex, can you go ahead and grab, uh, grab that incident management dashboard and just take a look?

Alex Elman: 19:01

Yeah. Doing that now.

Eric: 19:03

There's a significant impact, but it seems to be in entirely within the eu during US business hours based on our observability tooling. in the progress of troubleshooting and trying to understand what was going on, the responders were briefly fixated on a hypothesis that it was a recent deployment that would've triggered the incident. Many changes had been deployed in the system. The most recent had been five hours earlier.

Sarah Butt: 19:32

Alex, let's think. What, what can affect all services in a region like network, DNS, um, something like Akamai on the front end.

Alex Elman: 19:46

Well we're, we're in the online boutique. We're seeing a unbranded NGINX 5 0 3 page. So we're still getting network connectivity to the web server. But something maybe is affecting, like there was a change that affected n Engine X and I didn't see anything in the change log that jumped out at me. I'll mention that.

Sarah Butt: 20:08

Yeah. I'm kicking myself for not, I, I asked earlier. I said, I don't know where we are. Are we on AWS or similar? I'm kicking myself for not knowing that.

Alex Elman: 20:20

Saying the website's working again.

Sarah Butt: 20:23

Yeah, I think it's something at the DC level and it came back up. Alex over in biz comms, if you haven't already, can you let them know that we seem to be currently out of impact? We're monitoring, we're gonna be working with the data center. There may have been a disruption at the infrastructure. Uh, god dammit. It's down again.

Eric: 20:36

In hindsight, there's a clue here. I'm gonna not dwell on that for a minute because the little dramatic tension in the narrative is useful. one of the, most jarring moments in the incident, discovered that there was a interaction with our vendors and symptoms in our data center. And in trying to contact our vendors, learned that we had multiple support contracts. The way we learned this was by reaching for the first one we found. And discovering there was a four hour response time on that support contract.

Sarah Butt: 21:10

As are you in a spot to reach out to your executive contacts at the DC vendor to escalate? not waiting four hours. if we can fail out of there. And I didn't think of it. Freaking Lord, help me.

Eric: 21:31

Our site is hard down, and has been for about five or 10 minutes at the point we learned this, And we cannot expect a response from the vendor for four hours. Needless to say, we were a little alarmed in that moment. And it turns out happy news. We discovered not long after that that we had a second support contract that that was for the production environment with a faster turnaround. we didn't end up having to wait the four hours. But there was definitely some immediate response to the news of four hours and, uh, some immediate efforts in contingency planning. That level of work, that sort of, Timeline in the metaverse was cut short when we did get contact with the vendor, who's running our data data center. The next key insight in this journey that we learned is that there was an email notification that had gone to a spam folder.

Sarah Butt: 22:26

There's a email that just came in from the data center provider, so I guess we were on first party hardware maybe.

Eric: 22:32

And, as of as I have come to learn, I think the only person who got that email was also our CTO. So, if others got it, maybe they didn't even know that it went to their, spam folder, but we're, know, 15 minutes into the incident when we discover this, email, that would've been really good to know much earlier. So it's in a spam folder. It's gone to one recipient, a very busy recipient in our business. and that's a contributing factor to the drama of our, of our initial diagnosis. Once we understand that we're dealing with an air conditioning issue, which was this, the content of that email, our vendors having air conditioning problems in the data center where our services are running now in front of us about whether we can for the vendor to fix the air conditioning, or we should engage our business continuity plan and get into a different data center. So that was one of the important pieces, in the, the story arc.

Sarah Butt: 23:33

Danielle, can the DC vendor give us and ETA there anything else they can do to cool the room they are not already doing. Are we the only tenant there? I need answers to these questions. I wanna understand because Hammad and Tanya seem to have different levels of risk. Tanya is the platform lead. Hammad is on the customer support side. Makes me nervous, but let's get those questions. Alex, sorry, I put you in a buffer. What was that about the BCP?

Alex Elman: 24:12

You're responsible for coordinating the flip and communication with stakeholders. Platform network engineers handle the DNS updates and rerouting, and so the first thing you need to do is notify stakeholders, the IT teams or managers and customers. Then we need to stop non-critical services in the data center. Okay. They only have to sync, sync critical retail data, reconfigure the IP address in the DNS console, activate a application systems in the new dc, and then validate that the transactions and orders are working.

Sarah Butt: 24:45

I'm just trying to make a risk assessment because here Alex, let's, let's you and I take a minute and talk through this. Okay. DC has put us on an extra 10 minutes, so we're down for 25 minutes. We're functionally down in the water right now, so we're at as basically a zero sum game. That being said, it sounds like we've had intermittent inability to fail the DC over, in scheduled BCP uh, runs, which we know always goes better than real BCP runs. I haven't gotten an answer. Oh, we just got it. Okay. We can't shut it down partially, so we're sort of hosed there. Alex what's your thought?

Alex Elman: 25:25

I mean, considering the state of the site right now, I think it's prudent to just fail over it. It's already been well over 15 minutes.

Sarah Butt: 25:35

Okay.

Eric: 25:37

In the end, we did engage the business continuity plan. this case it went smoothly contrary to the worries of the folks who were against the business continuity plan. we finished in that spot with some discomfort because our business continuity plan has demonstrated some, inconsistency in the past. And we don't wanna just say, saved business continuity, move on and, and, uh, pursue other things.

Courtney: 26:04

Sarah and Alex going into this, how real. Does all of this feel from your perspective, you know, not, not Eric's re you know, sort of doing the retro, but like, does like hearing this again, does this evoke what you felt like during that incident and what was going through your brain and your body at that time?

Sarah Butt: 26:28

Yeah, I, I mean, I think it's, it feels incredibly realistic. it feels realistic enough, and Eric can attest to this, that I'm someone who, Gets, I get drill anxiety, but I also just like get this adrenaline rush of really loving incidents. And I came off the incident, so sort of convinced that I'd run a real incident that I did like a an hour and a half recording to Eric that night of a debrief of my own, um, and my performances as IC of that, just to get it out of my head because even though. My brain knew that this was a drill, like my body still felt like I had gone through an incident. And, um, with my employer, we have this great debrief process that we use to sort of help our incident commanders, process through and like regulate down a little bit and stuff so you don't go and stare at a wall at night. And not sleep. if you're of the high strung variety like myself. and because I didn't do that and it felt so real, I had to like mimic that, um, which gave Eric, a bit more, material. But yeah, I mean, I think it felt very real. And I, I look at the narrative even now and I'm like, but I wanna jump in and say something like, there there is that moment. I'm like, but it, so I think it feels incredibly realistic.

Alex Elman: 27:39

for me, it didn't necessarily start out that way because, I was, in a slack workspace I wasn't familiar with, I wasn't familiar with the infrastructure, with the I wasn't familiar with things like checkout service. I. But as the, almost, immediately as the, as the incident really took shape, it evoked for me the same experience as being, at a company and having to respond to a, a dark corner of the system that you're not familiar with, that has always existed but has maybe not had problems. And it had very much that, that same sort of character to it. I have to quickly get up to speed on this. I had to examine this long list of, of changes that I didn't make. I have to. Coordinate with multiple people. I have to get up to speed. All of that felt very familiar and is the character that most of the incidents I'm part of take on. So that is very general. You, you can generalize that across experiences and that's what made it feel very real for me.

Eric: 28:36

And that that's definitely one of the key, a key finding. Uh, there's so much available of material that we're not going to get to in this short retro, but I'm glad that, that, I'm glad that surfaced. where I would like to go next, I guess before I dip into themes, I, I, let me just double check. I. that I have paraphrased the narrative sufficiently for our conversation. Do you think there are important pieces your experience in the incident that I have missed or I emphasized one more than another? Or anything along those lines? I'll just open it to the floor. Anybody got stuff to add to the story I've told so far?

Alex Elman: 29:21

One. One of the unfortunate downsides of having a really. All of the analysis work you did, Eric is hidden in just the fluency of just having this like narrative, this understanding, this deep understanding of what happened. But the downside is that what gets lost in that is it sound like it was so obvious or easy when you describe like, oh, and then they just realize it was an issue on the, on the vendor side. and, and maybe my memory of this is, is not, not quite. But, I remember we were on a couple of different, threads, looking into the change log, trying to figure out, maybe there was a deploy that seemed kind of suspect. I don't know if it was, the, the email from the vendor that we found in the sam the spam folder, or it was somebody else recognizing it, but recognizing that the vendor might have an issue. In that data center completely pulled us off of the threads we were on and into this new direction. And, not obvious, at least to me at the time, from the information that I had. Sarah, do you, how do you

Sarah Butt: 30:26

Yeah, I actually wanna jump, I'm, it's interesting that you and I picked up different sides of the same thing.'Cause I worried that in my retrospectives and such that I had potentially, biased, the narrative because I was very frustrated with my own perceived fixation on the change piece, but I do think it's important to call out that we were actually running multiple investigations in parallel at that point. So we started, and I think the first thing that we asked about was big infra thing. Like I think that the ask was, network load balance and something else. And we had Daniel off looking and I know there was a point he had come back relatively quickly and said like, load balancers are fine. Alex and I were talking about the fact that it's a, a unbranded 500 NGINX page. So like network connectivity, at least like to a certain point is fine. so I don't know that we jumped immediately to change. I think there was a lot of discussion pretty quickly about change and that discussion. Did last for a little, like, I wish I had pulled us out of that a little sooner, but we did have like multiple swim lanes in parallel on the infra side as well. and then, yeah, I think Alex's memory of how we pivoted is, is actually pretty correct. and I think several people started to realize what happened at once. So we got the email, That, Tinus found Tinus Hamad. Is it Tinus? Am I right on my, my CTO's name? perfect. One of you, one of you, what happened in my brain, I snapped back to two prior incidents I've seen with HVAC failures in a data center where we had to fail out, and a drill that I've run where we did this. And so for me, as soon as that came in, I think there was a moment that, if I'm remembering it right, like we were sort of mid can, or I was asking questions or something and, and I actually told Alex like, I'm gonna have to hard pivot the bridge right now. And I think as that was happening. It was either Tanya or I think it was Tanya, someone else, like light bulb went on at the same time. Um, so I feel like as soon as we knew that that email happened, several of us very quickly got to the point of like, we didn't necessarily know that the data center was restarting the server. And I still have a lot of questions actually about the like, mechanics of were they actually restarting the server, but we knew that there is an infra level equipment failure. We're going to have to get out of there or we're going to be sort of at the mercy of them getting H, the HVAC back online. And in my head I'm also thinking through all of the repercussions that can come from sudden heat failures, such as blown equipment and ways that heat doesn't dissipate and all of this. And I think that was the moment that we pivoted the bridge pretty quickly. But we did have infra level swim lanes going prior to that point.

Eric: 33:22

that you all have included those. I absolutely have evidence for all of that that I can bring forward into the document and, in my haste to try to get to the, to the major plot points. I missed some and so thank you very much, delighted with your contributions to our, improved narrative. Hamed, are there any things that you would like to add to what Alex or Sarah have said before I move on from the narrative? I. Or any of your many roles.

Hamed: 33:57

So I'm gonna wear my Bob, the customer service hat. Now. I think initially what worth mentioning is as a business, we were completely, we we're caught off guard I started, my team started receiving a lot of calls and complaints from customers, I didn't had initially from internal, oh, there's something going on. which kind of. It me off because instead of like focusing, putting a status page job, managing customer calls, I was just trying to figure out or asking for the pressure. Sarah and Alex, what, what is happening? What do I tell? What is it? Do we really have an issue, do we not? for me that was like a diff difficult part in the big beginning,

Eric: 34:48

am I paraphrasing this well? As I typed this into the doc, you, you were unable to focus on your primary communications with customers because you were so busy trying to understand what was happening. You didn't even have a status you could tell them

Hamed: 35:00

yeah.

Eric: 35:01

Hamed: 35:03

Yeah. Normally, uh, what happens is, is an incident, raised through internal mechanism being alerting or someone notice it. This one was like a deluge of customer calls. What happened? What is going on? What's happening to my order? I didn't have anything internally looking wrong immediately, so

Sarah Butt: 35:24

I think to say the quiet part out loud, I think what Bob's trying to express is, a little bit of. Some emotion. I'm not going to tell say if it's confusion or frustration or what, but some emotion that I think, I mean, I shared as well, of surprise that we didn't have a monitoring alert or, anything, because I think we did see, and Alex knows better, he was in the Grafana. I wasn't in the dashboards, but I, I think we did see like the volume drop off and stuff. So, I think there was, I think what I'm hearing, and I, and I agree, was like there was surprise that this was not detected internally and that created, a little bit more of the initial crush and saturation that we had on the biz comms side because it was all coming in, like we were trying to give information and get information via biz comms and customers, which was a lot of this was being done by Alex. Versus us knowing and having, you know, like we have an issue, we're standing everything up. We're already looking at it and just pushing to customers. I think that bidirectional communication made for a pretty, like Bob was pretty crushed at one point because we're telling him like, do a status page. Also, we need to know the customer experiences.

Eric: 36:41

need to take a moment to go meta in our conversation, step out of the retro and make an observation about the retro we're having, having sat through a lot of retros. one of the experiences for me as an analyst in this is that I have never analyzed an incident where my participants already knew so much about resilience engineering. So there there are that Alex and Sarah are bringing forward that, I need a moment for the audience to recognize are deep insights and extremely unusual. These two are rock stars as incident commanders with a lot of experience in the complexity of being in incidents and, and they, they are speaking about this with a, a. There's no better word for it. The, the, this is fluency of expertise and, in the resilience engineering space, we have a law for it. I'm gonna sort of skip some of those details. But when Alex was, first challenging, what I've left in the narrative, his framing for it was one of the things about having so much evidence, it's easy to not see blah, blah, blah. He is pointing at my expertise as an analyst and that I made the synopsis look easy and obvious. That is a perfect paraphrase of the nature of expertise, and most folks don't know this, so it's, it bears going meta on the conversation to draw attention to, Alex understands a thing about expertise that the experts make it look easy. So my expertise as an analyst. I make it look easy, like you can just go sift through a bunch of incidents and find the key plot points and summarize'em in a couple of sentences. Turns out that ain't easy. Alex knows it. But the other thing about that skill, that property of expertise, is that it hides stuff from you. So the nature of an expert's work means you, you, you don't see how hard they're working or how hard it would be for somebody who didn't have their skills. Yes. Now the second key idea and the fluency that's shown up here is that both of them stepped in with a story about saturation, which is one of the themes in our incident. This will become a segue back into the, the retro, understanding that the participants in an incident get overloaded the flood of information and the confusion. is a very high level awareness about incident response. It is really easy and really common for folks to be in overload and in a retro to totally not know that that's a thing worth talking about. But the fact that the fact that that Hamed shows up with Bob, and in fact actually this is the, the non-player character of Bob is actually savvy already. Because when he shows up with, Hey, I'm overloaded by customer support, he also says, and I'm not going to be as responsive to you as you're used to me being that kind of signaling is deep expertise about how during the incident, if you're overloaded, knowing to say so is deep skill in the coordination. So the, I can't emphasize enough. These are unusual skills for an elite group of folks involved in incident response. And I, I did, I didn't wanna let the, their expertise pass without commentary. what I'd like to do now is, unless there's, unless there's sort of further, we could, like, I'll get back to, you know, step out of the meta, go back to the retro, and we can start talking about themes.

Courtney: 40:32

The only, the only meta thing I'll add to that is never have I recorded a podcast where I just sit here and say nothing, because everybody answers all the questions I would already ask. So just like, on. This Keep

Eric: 40:44

I'll dip into the themes a bit. I, I pulled four themes out. because I know you're not used to being in a retro, like I run, this is an unfamiliar format. I'll take a moment to explain what a theme is or why we care about it. maybe the fastest way to explain it is a theme is something I heard from more than one person in the conversation. I saw a pattern from my own investigation of what happened in the Slack transcript before I interviewed people, or in the discussions I had with the participants independently these, patterns showed up. So, and for example, the, the, the sense of feeling overloaded, of feeling saturated. When I was asking Al, I asked Alex the same question you did Courtney, about, did this feel real? And he started that conversation by saying, yeah, it felt real as soon as I was saturated. Because Alex and I know each other, and we didn't have to, we didn't have to explain what saturation is, the, that we were able to move through that conversation quickly. But it, it, That experience of saturation was a firsthand, knew he was saturated. That's part of what made it realistic, Sarah, on several occasions. In fact, actually, the, I asked Sarah about this particular moment in the incident when there were, she was successfully managing four or five different threads of action by four or five different people along parallel tracks for troubleshooting. I was like, so if that was me, I'd have been overloaded in that moment. How was that for you? And she was like, no, no business as usual. that all the time. And I, I mean, not quite as blase as I just put it, but, but in all seriousness, Sarah was not saturated by stuff that absolutely would've buried me.

Sarah Butt: 42:41

Sarah also had a like, I think, I think the other thing we have to call out here and, and I mean'cause you are, you are exceptionally kind, Eric. You are. but I think the other piece here is. I had a luxury that not every company has. I had a deputy and not only did I have a deputy, I had an incredible deputy that, while Alex and I have never run an incident together, we have done a lot of other work together. We've written papers together, we've traveled together, we've worked together for a very long time. And, that allowed two things with managing saturation. One is, you see in the beginning that I throw a lot of things at Alex. I'm like, Hey, I need the Grafana checked. I need interface with BizCom. Put Bez in a box. He is making an entire mess of this whole thing, and he is terrifying my engineers get him out. And I knew, because I trust Alex in how Alex has experienced an incidents that Alex would load shed as appropriate or deprioritize like he would take the patterns of managing saturation that we see in resilience engineering. And I trusted him, to apply them. So I didn't feel like I had to carefully manage his workload as much as I did some of the NPCs because I knew of his experience. I think that's, there's something to be said about that. the other thing that. Having the relationship with Alex that I do is it allowed me to be very blunt with him, particularly on the voice bridge. So you, you hear me sort of fire things off at him, get me a can, do this, put Bess in a box and there's not a lot of like fluffy language around it. Would you please go do, put Bess in a box? So you're going to need to do this because Alex and I operate on this common ground that enables us to just communicate, in a way that I, I think, Had we not had that combined experience, I would've faced a lot more saturation because I wouldn't have been able to really efficiently offload.

Alex Elman: 44:28

Yeah, Sarah knows what I know and don't I know what Sarah knows. If there's certain things that she doesn't have to explain to me, like, can you go work on the can report? She doesn't have to explain a can report to me. and I was dealing with bez in the business comms channel. he was quite pushy, quite noisy, kind of coming down on Sarah. and I was managing that, but I didn't know if any of that was leaking through on the, the other channel, because I didn't have time to context switch,

Sarah Butt: 44:56

Had zero idea. Zero idea. I thought he was totally calm.

Alex Elman: 45:01

and that, when, that you mentioned that during the incident, and when you had said that, that's when I realized, oh, well maybe, maybe what I'm doing here in the, in the putting Bez in the box is effective. I just didn't know it at the time. It was'cause I was so saturated.

Courtney: 45:15

so we're having a bit of an incident of our own right now in that, Eric has disappeared. Out of the channel, out of the recording tool that we are currently using. I noticed this because he was sharing a screen with us to show the document that he uses to run the retro and like the screen disappeared and then I see like all four of us, and I'm looking while you're talking and I'm like, Uhhuh. I'm like, he's gone. He is not in the list of participants. Eric is gone, y'all. And I don't know what

Sarah Butt: 45:45

Do we have a BCP for, uh, for, uh, facilitating a retro.

Courtney: 45:52

I am quite certain that someone has actually gotten called into another incident while conducting a retro. There is like a very non-zero chance that has happened to somebody.

Sarah Butt: 46:01

Oh yeah.

Courtney: 46:02

but, so I'm now trying to, Incident command this. I'm looking in Slack, but I have nothing from him, and if he lost his internet, then he's also not going to be able to tell me Slack

Sarah Butt: 46:16

Alex, can you text him? I don't know if I have his number. I might.

Courtney: 46:20

I Don't know if

Sarah Butt: 46:21

have his? I would say I'm pretty sure of all the people, Alex is gonna be the one.

Courtney: 46:27

okay. I am gonna hit stop really quick.

46:40

So it turned out that Eric had a power outage in Boulder, Colorado, where he lives lost his internet, all of that. And we were already pushing, I don't know, I'm looking at the timeline like 45 minutes, like a full episode of the podcast. And we hadn't even scratched the surface. So, and thankfully, I guess it was a good, it was an omen because right after Eric's power went out, I got a migraine. Unbeknownst to me during the recording, Sarah was also getting a migraine, so I don't know if it's like incident PTSD or what the hell was going on, but we decided to break and we have a part two coming for you. And you'll get to pick up where we also left off. So see you. In the next one.