The VOID
The VOID makes public software-related incident reports available to everyone, raising awareness and increasing understanding of software-based failures in order to make the internet a more resilient and safe place. This podcast is an insider's look at software-related incident reports. Each episode, we pull an incident report from the VOID (https://www.thevoid.community/), and invite the author(s) on to discuss their experience both with the incident itself, and the also the process of analyzing and writing it up for others to lean from.
The VOID
Episode 6: Laura Nolan and Control Pain
In the second episode of the VOID podcast, Courtney Wang, an SRE at Reddit, said that he was inspired to start writing more in-depth narrative incident reports after reading the write-up of the Slack January 4th, 2021 outage. That incident report, along with many other excellent ones, was penned by Laura Nolan and I've been trying to get her on this podcast since I started it.
So, this is a very exciting episode for me. And for you all, it's going to be a bit different because instead of just discussing a single incident that Laura has written about, we get to lean on and learn from her accumulated knowledge doing this for quite a few organizations. And she's come with opinions.
A fun fact about this episode, I was going to title it "Laura Nolan and Control Plane Incidents," but the automated transcription service that I use, which is typically pretty spot on (thanks, Descript!), kept changing "plane" to "pain" and well, you're about to find out just how ironic that actually is...
We discussed:
- A set of incidents she's been involved with that featured some form of control plane or automation as a contributing factor to the incident.
- What we can learn from fields of study like Resilience Engineering, such as the notion of Joint Cognitive Systems
- Other notable incidents that have similar factors
- Ways that we can better factor in human-computer collaboration in tooling to help make our lives easier when it comes to handling incidents
References:
Slack's Outage on Jan 4th 2021
A Terrible, Horrible, No-Good, Very Bad Day at Slack
Google's "satpocalypse"
Meta (Facebook) outage
Reddit Pi-day outage
Ironies of Automation (Lissane Bainbridge)
In the second episode of the VOID podcast, Courtney Wang, an SRE at Reddit, said that he was inspired to start writing more in-depth narrative incident reports after reading the write-up of the Slack January 4th, 2021 outage. That incident report, along with many other excellent ones, was penned by Laura Nolan and I've been trying to get her on this podcast since I started it. So, this is a very exciting episode for me. And for you all, it's going to be a bit different because instead of just discussing a single incident that Laura has written about, we get to lean on and learn from her accumulated knowledge doing this for quite a few organizations. And she's come with opinions. A fun fact about this episode, I was going to title it Laura Nolan and control plane incidents, but the automated transcription service that I use, which is typically pretty spot on(thanks, Descript!), kept changing"plane" to"pain" and well, you're about to find out just how ironic that actually is. I'm thrilled to introduce you all to, if you don't already know her, Laura Nolan. Laura, please tell us a bit about yourself.
Laura Nolan:Yeah, so I'm a software engineer and I guess a site reliability engineer as well. I have been doing computery stuff for quite a long time now. To this audience I might be best known for writing engineering blog posts about various incidents at Slack, I did a few of those during the three and a bit years that I was at Slack. Before that I was at Google, I contributed to the Site Reliability Engineering book. And I'm currently at a very, very tiny startup. We're called Stanza and we're trying to build tools to help people run their systems more stably and with less stress. I'm on the USENIX board of directors and I contribute to Log:In and I am one of the organizers at the SRECon Conference. I'm on the steering committee for that. And I'll give you a fun fact. I campaigned for a international ban treaty against lethal autonomous weapons or killer robots. So
Courtney Nash:Oh.
Laura Nolan:keeps me busy.
Courtney Nash:That's a whole nother podcast. Killer robots. So, okay. Since you've been involved in a few of these at a few different places you've built up quite not only a body of work that you've produced, but some opinions per se, about how these systems work, how they fail. And so we have a theme today and a set of incidents that we're going to talk about. Control plane incidents, automation, the role that automation plays, the way we're building the systems and how that can lead to some fairly I believe the word you used earlier when we were talking about this was gnarly incidents.
Laura Nolan:My thesis is that we have kind of everyday incidents. So your minor things where you burn a bit of SLO but you can pretty quickly figure out what it is, fix it, and you go on about your business. Kind of small accidents. If we were looking at this from a classical safety science viewpoint, these are your slips and your trips and your, your minor things, right? And then you got the big incidents, the ones that take you down for hours on end, the really big ones, the ones that make the front of the New York Times. These are a different category of beast, right? In safety science, if you look at the distribution of incidents there's this classical idea that you have this pyramid of incidents and at the top you have the unlikely improbable things that hardly ever happen. Then you go down the pyramid and you get, you get to the more likely and more sort of everyday kinds of things. You cannot just say, if we can reduce the number of everyday incidents, we reduce the risk of the big things because they're happening in different distributions. They're not the top of one pyramid. They're entirely different things, so you cannot estimate the risk of these big things happening on the basis of whether or not small things are happening. And this sort of invalidates some of the assumptions that SLOs are based on a little bit. I mean, they're still useful, but they can't necessarily stop you having these big, gnarly outages. So what are, what are these big, gnarly outages? A lot of them, I think they stem from a control plane or some powerful piece of automation going wrong. Our job has changed completely as, as software operators, as you know, software infrastructure people. Our job has changed hugely in the past decade to 15 years now. And the way it's changed is that we stopped running our, our infrastructure directly and we started building autonomous systems, control planes, that run our infrastructure. And they either do it completely by themselves, or they do it based on us telling them what to do and their interpretations of that. And this is a different thing and I think a lot of people listening to this may have heard or read Lissane Bainbridge's very, very, very influential paper called the"Ironies of Automation." And Bainbridge's argument is that when you go down the path of automation, you fundamentally change the work. So instead of me as a computer system operator 10 years ago running my Ansible or doing whatever you know, kind of clicking around in an AWS console by hand, I now have Kubernetes and other sorts of powerful sort of automation that are doing things. And I need to not only understand what I want to achieve, what I want to do and what's happening, but also what is the automation doing? I have to now act as a team with, with, with pieces of automation. And Bainbridge says that this is a much harder thing to do because instead of understanding just the system that you're trying to run and influence, you have to understand the automation as well. So you've got huge amounts of extra complexity. So my question is, does this lead us to have incidents that are related to that? And I think that we do. I think we do, we do see a pattern of that. And some examples that I can give that we can talk through are Slack had an auto scaling had an incident on just on. The first day of work, January 4th in 2021.
Courtney Nash:When, when everybody was you know, just hanging around like, I mean, it was an an epic timing moment too, right? Everyone was, was home. People who had never worked from home. The whole world was working from home. And we all came back.
Laura Nolan:Exactly. Yeah. And it was a very big control plane aspect to that cause autoscaling had a large part to play in that outage. And it wasn't the only part of it, but it would had a large part to play in it. And then we had, there was a one where I wrote a blog post called"The Terrible, Horrible, No Good, Very Bad Day at Slack, and this was about May 12th, 2020. This had a big control plane aspect as well. This basically was a, an issue that we had in hostess that were distributed to our to our load balancing tier. And this went badly wrong and we had a very nasty outage as a result of that. Another one that I've seen directly was at Google. There was a tooling issue that involved basically Google's entire CDN getting deleted within the space of several seconds. You can't delete all that data physically in that time, but you can certainly throw away your logical encryption keys, which is what happened. Then, if we look at recent examples in industry, there was the Reddit Pi day 14th of March incident. They messed up some of their networking control planes as part of a Kubernetes upgrade, and then the big Meta outage that we had in October of 2021. That was a, a networking control plane. As they were doing an upgrade on, or an operation on their networking backbone, and they drained the whole thing by accident. So we, we have lots of examples of this.
Courtney Nash:so many good examples and, I love that you brought up bainbridge and that work, I will put links to all the things that we're talking about in here. Because part of what she talks about in that paper, in terms of the way the work changes, is now we've got automation doing many things and we're not involved in the guts of that as it's happening, per se. When that stops working, you as the human operator you probably don't have all the, you don't have all the context. You may or may not have been even paying attention to what was happening with that. So you suddenly appear on the scene as it were. Right. And have to acquire all of this context. And I'm curious what you could potentially comment on that as someone who's been involved in these types of control plane automation incidents, would you consider that one of her ironies a part of what that experience was like as somebody responding to, and then writing these up after the fact?
Laura Nolan:Yeah. What I will say is I think in all these cases I'm talking about we had a case where automation was not a great team player. This is sort of a theme of again, safety science. There's a whole field called joint cognitive systems engineering, and there's this idea of automation can be a very bad team player because it can be very hard to direct. It's very hard, can be hard to tell it what to do. Particularly if you wanted to do something you know, a little bit more complex It can be hard to know what it has recently done and what it is going to do and why. We very often, we build these very opaque systems. We're humble software engineers. We sit down and we go, okay, I want a tool that does this thing. And we're not thinking about, well how is, how is this tool going to be a good team player to the person who's gonna have my job in three years when I've left this company and you know, somebody else who doesn't understand this automation, end to end is gonna have to work with it.
Courtney Nash:Not only are we not generally thinking about automation as a team player or, or as part of a joint cognitive system, from what I've seen, the prevailing attitude or opinion has been more of a replacement theory, right than an augmentation or a, or a team theory. So it's like, well, we're gonna get all this automation to just do the work for us, and then we don't have to worry about it. So it's, it's even worse. Right. I think what you're saying is if, if you were to show up on a physical accident and there was somebody there, you could go to them and say, what happened? And they would say, well, this car drove down the road and it ran into this other car... You can't go to your automation and say what happened in the same way? Yeah. I mean, is that partly what you're saying?
Laura Nolan:I think so. Yeah. And even if the automation will give you some sort of logs or some sort of information about what it has done, it's very often hard to find, like you might need to go digging in, in your ELK cluster or, we rarely build a good status screen for these things. You know, it's just stuff is happening and you have to kind of go and dig to find out what happens. It can be very, very simple automation. It doesn't actually need to be like a huge code base or very complicated. So I think a great example of what I'm talking about here in terms of automation not being a great team player was the Slack outage on January 4th, 2021. So we had a big networking problem which was related to part of our network infrastructure becoming very, very saturated. But we made it worse.
Courtney Nash:Oh no.
Laura Nolan:Or our control planes made it worse. So basically because the network was bad, a bunch of instances got marked unhealthy by automation because they were unreachable because of the network problem. They were perfectly healthy, just unreachable. So what happened then was their CPU utilization dropped. And because their CPU utilization dropped it triggered some auto scaling rules and the system got scaled down. So a bunch of instances got terminated. This is a great example of automation being a terrible team player because no human being, knowing that Slack was having an enormous incident and a big problem with the network would downscale the main web, right? It's a ludicrous thing to do, but the auto-scaling rules don't have any, and this is very characteristic of automation, right? We build a piece of automation to do one very well defined thing, which is great during normal operations. You wanna downscale the instances that you're not using. But in an incident, it doesn't have the context that the human responders have and it can do downright stupid things such as down scaling your web tier. But it gets worse. There's a really fantastic book. It's quite recently out on O'Reilly, it's about building PID controllers. So a PID controller is basically used to try and preserve a system' homeostasis— say a level of CPU utilization or something, in the case of autoscaling. It says"Do not try and use two PID controllers on the same thing." And they are right because we were doing that at Slack and we had a second autoscaling set thing that was based on utilization of Apache worker threads. So what happened was At the same time as the network went bad and the CPU utilization dropped, the worker thread utilization went way up because the network was slow. So the other auto scaling rule said,"Okay, let's scale everything way up." so you've got one piece of automation scaling you down and another piece of automation scaling you up. Nothing is communicating with anything. It certainly is possible to go into the console and turn these things off, but that's an extra thing that you have to do during an incident. You have to remember that you have these scaling rules. While you've got a whole bunch of other things going on, you have to say,"Okay, well this potentially will cause a problem and yada yada." Turning off autoscaling is actually something that we did sometimes do during incidents after this. Just anticipating that, there can be negative impacts of it, but there's no easy way to say, Hey don't autoscale if we're having a serious incident," the automation is not built to coordinate in that sort of way. It all has to be done by the human being. The human being has to adapt to the rather stupid automation rather than anything else. So you just have all this complexity and it's very, very difficult to anticipate how it's going to influence your current situation. Cause it got even worse the scale up was too large for our provisioning service to handle.
Courtney Nash:Hmm.
Laura Nolan:And that basically wasn't able to keep up with the number of instances that we're trying to spin up all the time. Especially given that the network was still recovering. And so, yeah, it all went bad. So while automation wasn't the trigger for this incident, it made it worse and, you know, that's, that, that's something that we see as well. There was a really interesting Reddit writeup a few years back where they had a nasty outage because they were doing an upgrade on their service discovery system, I think. And what they had done was they had turned off a bunch of the automation that depended on that so that it wouldn't say read from an empty service discovery cluster as the cluster was coming back up. But they forgot, they forgot that they had SystemD or some other thing that was configured to restart this automation that they had turned off for the duration of their operations. And it came back up and it read from the empty database and it broke everything. It's just another, another example of the, lack of context sensitivity that we have with these very powerful control plane tools that we build.
Courtney Nash:We talk about this all the time, where, where one person who may have been designing that piece doesn't necessarily know how all the other pieces, you know, interact or can have these kinds of effects on each other. And we've, we've seen talk of this in, in other kinds of language, around sort of meta stable failures or maybe what some people often refer to as cascading kinds of failures. Do you consider these types of automation failures a, a subcategory of. It's own, it's own nemesis. I'm curious what you think about that.
Laura Nolan:I think it's its own thing. So I think that meta stable failures are interesting and there's been a lot of interest in them because they're another contributor to these big, gnarly, multi-hour outages that we see. They're very hard to recover from. And automation control plane type accidents can also be, gnarly and, and hard to recover from. And sometimes you, you can wind up with a cascading or, or meta stable element as part of this because, I think we saw that in the recent Reddit outage that they had where the, as a result of their pie day outage they had, I think they had a meta stable failure as they were trying to come back up because of the traffic trying to pour in and, having that empty, empty state.
Courtney Nash:The reason they seem similar to me, I guess, or at least maybe they share some characteristics. My dog is stomping at my door. Hold on.
Laura Nolan:Dogs are also terrible team players,
Courtney Nash:[In the distance to her dog]: You're a terrible team player. Come sit on your favorite chair.
Laura Nolan:But very cute.
Courtney Nash:Your kitties are much less disruptive. So what I understand from the types of meta stable failures like the Reddit one and some other ones is the way to recover from that is some human intervening and, and ultimately you have to shed some load or bring that system back down to a different level. Maybe it's more like a meta stable failure as an an adjacent phenomenon to these kinds of, of automation failures.
Laura Nolan:There's something there because when you have a control plane go very badly wrong it normally snarls up your system so badly that you do need to intervene. So, yeah, it's another kind of failure that you're unlikely to automatically recover from. Usually what you end up with is some sort of heroic human effort to manually or to cobble together enough enough coordination to get things back on track. So, you know, in the case of that Slack outage we had, we were scrambling around getting more quota to try and increase the number of the number of instances that we could provision. In the case of Google deleting all of its CDN there was a case of let's pull out all the stops to reprovision all of this and fill up that content again as quickly as possible and to prioritize where the bandwidth was most needed.
Courtney Nash:And in the Meta case, they had to figure out how to get into their own building.
Laura Nolan:Oh, yes. I mean the, yeah, I mean, they didn't just break their control plane. They broke their entire network, including their badge access and everything else. Yeah.
Courtney Nash:Yeah. It was, it is one of the more unique cascading type failures I think we've ever seen. I, I believe there was talk of like an ax grinder or something. I don't know. But yes, the humans have to come in and, and, and help figure out what's going on and then adjust to that. So we're not gonna stop, we're not gonna stop building automated tools.
Laura Nolan:No, we're not.
Courtney Nash:And you know, I love this notion of automation and humans as team players. This is an interesting one because we start talking about it this way and it's, it's almost a little creepy, right? But we both have a view of the system, right? The automation has some view of the system and, and we have our own views of those systems and we have to work together to do this and so far that's been not awesome. What are your thoughts on, on how we grapple with that? We could talk about building different kinds of tools. In your experience, especially firsthand dealing with these what you think folks dealing with these kinds of systems can do to make things a little less painful if possible?
Laura Nolan:Well, I mean there, there is work on this. You know, there's a whole field called joint cognitive systems— Cognitive Systems Engineering— which does give us some guidance on this. But I think there's principles that we've been largely ignoring in software. So first off when we build these control planes, we need to think of them as really important pieces of software. If you're building a tool that can down your instances or that can change the configuration of your network control plane or, or can delete your CDN, you build that with a lot of attention and you spend a lot of time testing it, and you are extremely careful with your regression tests and all of these things. And too often we treat these things as sort of throw away pieces of software that maybe one person writes. There isn't a lot of coordination with other folks in the organization. And if we think about that in terms of the other Slack outages that I was talking about there, the one from 2020 May 12th which involved a piece of software that was updating host lists on our load balancer tier. I mean, that was something that one person had written kind of one time and basically just sort of sat there never being updated. So it was forgotten and it was really a bit under engineered for the sort of job it was doing in terms of the way it was reporting status and the way it was interacting with human beings. So that brings us on to one really important principle when we're building control plane tools, automation tools for running our systems. Particularly ones that can do destructive things or potentially destructive things, which is really make the interactions with humans a core principle of your system. Don't just say,"okay, it does the thing, it's done." Think also about how your operators are gonna interact with that and how you're gonna interact with that in an incident. Do you wanna just build a tool that can turn it off right away? You know, a big red button tool. Do you wanna build a status page? Do you wanna make it sensitive to things like whether or not there's an ongoing incident? How are you going to surface what it plans to do in the future? This is one of the key things. Many pieces of automation, they'll either decide to do something by themselves based on whatever sensors or signals they're watching and what they plan to do. They'll make a plan and they'll just execute it. There's nowhere you can go to say,"okay, well what is this thing going to do next?" And why did it do that, that thing that it just did? So think about those interfaces and how you can build them. And if it's a case of building a command line tool or a sort of a visual interface tool, it's very helpful to build something that says,"okay, this, this command that you just sent me, I'm now about to delete 100,000 machines." You know, that kind of thing. So that, that's pretty basic. And that's not the whole story, but that's certainly something that you can do to support human beings that have to interact with these things. Then the second thing I think is surfacing complexity in principled ways. So, one of the problems I think that Bainbridge identified in Bainbridge's ironies of automation that we talked about earlier, is when you start handing over control of our systems to software, we start to lose track of what's actually happening. All the complexity that the software is dealing with under the hood, I think it is actually useful to surface that and it needs to be surfaced in ways that human beings can deal with. So perhaps a case of showing the hierarchical approach and drilling down, that's not necessarily the best way in all cases, but showing ways that you can trace through a particular operation that the system is doing and why. So that humans can sort of stay in touch with what is this thing actually doing and why is it doing it? So, hiding complexity, it makes tools sort of simpler to use on the face of them, but, but then it makes it harder to actually interact with those tools and harder to maintain them and harder to predict what they're going to do in an incident. So I think making that complexity at least available for people to, to see is really important.
Courtney Nash:The way Bainbridge puts that is essentially to some degree we've given what we, we've given what we think the automation is, the easy parts.
Laura Nolan:Yeah,
Courtney Nash:And, and that leaves us the harder parts when that ease sort of fails in a way, then we're left grappling with the complexity of that or when that becomes unpredictable. Right.
Laura Nolan:One of my favorite examples of this is Terraform which I think many people are familiar with. A Terraform plan isn't a plan, it gives you the end state, it gives you your current state and your end state. It's a diff and it's not a plan. And the reason it's not a plan is it doesn't give you the order of the things it's gonna do. You as an operator, you have to figure out that, if I'm gonna replace this with that first, it's gonna delete that. And maybe your system goes down at this point, then it'll bring up the other thing, hopefully. It doesn't give you feedback about things like, this is the amount of quota you're going to need to do these operations. Do you have that? You know, how long is it gonna take? Most likely. I mean, and you can only make estimates about this to be fair, but the tool doesn't even attempt to do any sort of notion of time or any notion of sequencing. When we build our automation, we should in a way that does try and give clearer plans clearer sort of notions of time to people.
Courtney Nash:Well, so what's interesting to me about that is, I mean, you mentioned, joint cognitive systems as a field, we've seen a lot of research and work and thought come out of other industries, especially like aviation. You know, where cockpit design, right? This switch is next to that thing, and why are you suddenly flipping the mic on when you're trying to change, you know, something like that. And some of those things that are more easily graspable in a physical environment. To be able to do what you're suggesting, people have to have a view of their systems as not just us making computers do things which I think is a a big conceptual shift for our industry. It's a different set of skills to think about that than to write the code to do it.
Laura Nolan:yes. We, we, we haven't even started to develop those skills.
Courtney Nash:No, I mean there those are, those are dare I almost say like anthropological, right? Sociological sorts of
Laura Nolan:But, but, but also engineering. Also
Courtney Nash:But, but all of those things together. Right. There's a hole there that we haven't figured out how to fill just yet.
Laura Nolan:The other thing that you need to think about is, putting humans in control. Fundamentally humans always need to be able to say, stop that. Don't do that. Do, do that. We shoot ourselves in the foot a lot of the time by building systems that can't be easily turned off. That Can't be easily told to do something other than what they are, what would sort of do autonomously? It's not good enough to say, okay, well, you know, in an incident you have to run this terraform to turn this thing off, or, you know, you have to go click through five screens of a, of a console under pressure to try and do that because you're asking people under pressure to do complex things, and when you're under pressure, that's the last thing you do. The principle is, build the software so that humans under pressure can easily control it, turn it off, turn it on, tell it to do some specific thing. Because the human in loop is always going to know more than the software. The human's always going to have a lot more context and it's best to trust that human.
Courtney Nash:So handles around the automation as it were.
Laura Nolan:Handles around the automation. Yeah, I mean, don't build things that make people's lives harder. Build things that try at least have a chance of making people's lives easier. As you say, we're, we're not going to stop building these sorts of tooling. My suggestion is invest in it. Invest in thinking about the interactions with the human beings and keeping the human being informed and part of the work. Don't hide the complexity, surface it, give controls.
Courtney Nash:Do you think that there are also structural factors around why we haven't invested in, in those? What teams or who's tasked with building these versus who's tasked with building features and that kind of a thing? You could say to someone, well pay attention to this and then, you know, or make this important. And they're like, okay, I believe it's important, but up the chain from me. They don't see it as important.
Laura Nolan:Yeah, it's, it's just a lot of engineering effort. We don't have open source tooling that I think really respects these kinds of cognitive systems engineering principles. If you wanna have tooling like this, you're, you're largely building it. Or you're at least sort of integrating it out of other things and putting on the usability parts. First off, engineers don't necessarily know how to build these things better. A lot of people are perfectly content to build a thing that works for them right now, and you're not thinking about, well, how is this tool going to be operating in three years time when I'm gone and we have this terrible outage and someone else is trying to use this. I mean, people are just not thinking about that. People are trying to solve the needs that they have right now. There's always more to do than time to do it. there's definitely a production pressure aspect of this. If we want to move past the sort of phase in our industry's history where we are regularly shooting ourselves in the foot with our own control planes we need to build better tools collectively. So whether that be open source or whether that be vendors providing better tools, I don't know. I suspect that we will start seeing these things emerging over the next number of years. It's really only been 10 to 15 years since we've started building this kind of tooling to manage our infrastructure and systems. 10 years ago, most people were still doing this stuff by hand. Even 15 years ago maybe the likes of Google was facebook Microsoft may have been building these kinds of tooling, but nobody else was. But now we have, you know, mid-size and small firms every organization is expecting to build these things, and it's completely unrealistic economically to expect a tiny or even relatively small engineering organization of a hundred people who, with barely much of a software infrastructure organization to begin with, to build high quality tools that work well with humans. I think even that a substantially larger organizational size is difficult. It boils down to collectively finding a way to make the economics of building better tools work.
Courtney Nash:It's just a small, small challenge.
Laura Nolan:Small challenge, small challenge. This is my challenge to all of the cloud providers and you know, infrastructure tooling providers.
Courtney Nash:I hope that they will say challenge accepted, but yeah. It's both maybe a bit humbling, but also a bit encouraging to step back and realize how recent this is and how nascent that is. Because at the very least, then we're grasping this opportunity or we're, folks like you are, are seeing maybe, perhaps sooner than other industries that have had these kinds of challenges themselves and have taken sometimes decades longer to try to start figuring that out and we can hopefully learn from them along the way. So that's my optimistic take. So we'll, we'll figure it out sooner.
Laura Nolan:Yes. Well, hopefully we can both share this sort of Marxist narrative of humanity moving towards a particular goal as the history unfolds.