The VOID
The VOID makes public software-related incident reports available to everyone, raising awareness and increasing understanding of software-based failures in order to make the internet a more resilient and safe place. This podcast is an insider's look at software-related incident reports. Each episode, we pull an incident report from the VOID (https://www.thevoid.community/), and invite the author(s) on to discuss their experience both with the incident itself, and the also the process of analyzing and writing it up for others to lean from.
The VOID
Episode 2: Reddit and the Gamestop Shenanigans
At the end of January, 2021, a group of Reddit users organized what's called a "short squeeze." They intended to wreak havoc on hedge funds that were shorting the stock of a struggling brick and mortar game retailer called GameStop. They were coordinating to buy more stock in the company and drive its price further up.
In large part, they were successful—at least for a little while. One hedge fund lost somewhere around $2 billion and one Reddit user purportedly made off with around $13 million. Things managed to get even weirder from there, when online trading company Robinhood restricted trading for GameStop shares and sent its values plummeting losing three fourths of its value in just over an hour. But that's less relevant to this episode.
What matters is that while all this was happening, traffic to a very specific page on Reddit, called a subreddit, r/wallstreetbets went to the moon. Long after the dust had settled, and the team had a chance to recover and reflect, some of the engineers wrote up an anthology of reports based on the numerous incidents they had that week. We talk to Courtney Wang, Garrett Hoffman, and Fran Garcia about those incidents, and their write-ups, in this episode.
A few of the things we discussed include:
- The precarious dynamic where business successes (traffic surges based on cultural whims) are hard to predict, and can hit their systems in wild and surprising ways.
- How incidents like these have multiple contributing factors, not all of which are purely technical
- How much they learned about their company's processes, assumptions, organizational boundaries, and other "non-technical" factors
- How people are the source of resilience in these complex sociotechnical systems
- Creating psychologically safe environments for people who respond to incidents
- Their motivation for investing so much time and energy into analyzing, writing, and publishing these incident reviews
- What studying near misses illuminated for them about how their systems work
Resources mentioned in this episode include:
- Reddit's r/wallstreetsbets incident anthology, which links to all the reports we discuss.
- "Work as imagined and work as done" by Steven Shorrock (video)
At the end of January, 2021, a group of Reddit users organized what's called a"short squeeze." They intended to wreak havoc on hedge funds that were shorting the stock of a struggling brick and mortar game retailer called GameStop. They were coordinating to buy more stock in the company and drive its price further up. In large part, they were successful at least for a little while. One hedge fund lost somewhere around$2 billion and one Reddit user purportedly made off with around 13 million things managed to get even weirder from there. When online trading company, Robinhood restricted trading for GameStop shares and sent its values plummeting losing three fourths of its value in just over an hour. But that's less relevant to this episode. What matters is that while all this was happening, traffic to a very specific page on Reddit called a subreddit R slash wall street, bets went to the moon long after the dust had settled, and the team had a chance to recover and reflect. Some of the engineers wrote up an anthology of reports based on the numerous incidents they had that week. Fran, Courtney and Garrett from Reddit. So what happened? What was your experience when all of a sudden all of these people were swarming Reddit to talk about this situation, Courtney, why don't you go ahead and kick it off?
Courtney Wang:A confession: I didn't really know about it until the first incident happened. I was not following very much of the, the crazy hype. we have a incident room channel at Reddit. I will just go in and, you know, see what happens. And we were seeing a lot of traffic and somebody just mentioned, oh, this must be r/wallstreetbets. And I was kind of like, huh? W what is that? And this is already like Wednesday or Tuesday. I think it already started in the weekend. Right. And we hadn't seen real impact until the week started. So that was really my first introduction to it. And really through the first day, until I could sign off and actually go onto Reddit when I wasn't managing incident, I still had relatively no idea what was going on.
Courtney Nash:So just you all, some of you, at least we're learning what this social phenomenon was while fighting fires on the infrastructure side of the house.
Fran Garcia:Yeah. Like I remember in, in my case, one of the first inklings I had of what was going on was what we were dealing with, you know, traffic surges. And we were looking at what was going on and someone mentioned, oh yeah, these traffic searches is happening because the markets are opening. That was kind of like a very quick course in to what was going on and why the markets were suddenly, for a week or two, very, very important to my job. There was all of a sudden times I had to set like an alarm because there's going to be a high traffic event. Then the market is going to close. A lot of people are going to go into one particular subreddit. And they're going to share with each other, how much money they made or they lost. And that was something I was not prepared for. that was a lot of learning that we had to do on the fly
Courtney Nash:Garrett, was that the same experience for you as well?
Garrett Hoffman:Yeah. So I was a little bit aware that, something was going on with GameStop. Prior to Reddit, I worked at a social media company focused solely on the retail trading market. So I have former colleagues, friends kind of really close to the markets and, and close to that space. I've actually seen crazy market driven stuff like this at that job, but I didn't really understand the full extent of, of what was happening until we started seeing these incidents. We're looking at data pipelines and being like, why are we getting so much hotspotting here? And, It was really only, I think when we saw the impact it was having on our systems that we really fully grasped the magnitude of the situation.
Courtney Nash:It wasn't just like,"oh crap. Our stuff's on fire." What is going on in the world at the same time, which I think is in some ways, maybe one of the more unique aspects of, of this whole series of incidents. Before we get into all the individual ones, what led you all to decide to write this anthology, which by the way, I've never heard of an"incident anthology" before. And I really loved that notion because you're like anthropologists to a certain degree of your own organization. what prompted you all to go into this much level of effort and detail to not just analyze this, but to write these up and publish them so that other people could read them.
Courtney Wang:I just really love telling the stories outside my work life as well. I think storytelling is one of the fundamental ways that humans interact and learn from each other, support each other. I think there are so many cool incident stories out there that aren't being told. My motivation as I was going through this process which was really like the span of a couple months of doing longer post-mortem interviews than we normally do coming and knocking on Fran's door and Garrett's door and saying,"Please do this!" was kind of to set an example and to say, Hey, this. is one way that Incident stories can be written, can be told. I really was hoping, I still am hoping that the folks at other places will come across this set and say, oh, you know, I could write a story like this. And one of the, actually one of the main inspirations for me was Slack's January outage write-up. I think if I hadn't read that and seen something presented so nicely, I might not have had as much motivation to do this one.
Courtney Nash:That that's a great one. Laura Nolan, who is, is the person behind that one. I agree. It's great. And when you see these other examples of other people doing it, it's so exciting to me to hear someone else go, oh, not just we could do that cause it's fun, but I mean, what's the value you see in that?
Courtney Wang:For me, the value was hopefully learning. It was hopefully to, to tell the story in stories, in a compelling way that external folks could learn, but also this was such a unique experience for all of us and one at a time of very rapid growth for Reddit, that it seemed really important for us to capture a lot of the intricacies around us dealing with 13 or 14 different incidents. It felt very important to capture that.
Courtney Nash:Oh, wow. It was hard to tell from the write-up that you're talking about 13 or 14. I think that wasn't even apparent to me from rereading these so many times.
Courtney Wang:We had, yeah, so there were, there were some there's a bunch of actually we did not we chose, we pick and we picked and chose certain things to talk about. And similarly there were a lot of other scaling wins and operational near misses that I alluded to that aren't fully captured in, in this anthology because of time.
Courtney Nash:There's a lot of pressure to move on, right. To go onto the next thing. And why did you choose these from the set then?
Courtney Wang:For these ones I think they were especially compelling because there were several outside forces. These were ways that subreddit interactions mess with our systems or maybe mess with is the wrong word, but interacted with our internal systems, they were patterns of behavior that we had really never seen so things like modmail spam, where a subreddit goes private people start spamming modmail. That was interesting. Very cool. And a lot of things were the ones that I also wanted to highlight were in retrospect, what made them interesting to me was that they were Unpredictable. I'm going to say they were unpredictable. I think some people might read the incidents and say, oh, you know, you, you could have done this or that. And I'm going to say no, that's hindsight bias. There's no way, you know, we could have predicted this kind of rapid growth so quickly, even if we had provisioned, you know, there's so many other factors in play. there are examples of ways that just things will fail and that's why I also chose Fran and Garretts near misses is because they were anticipatory up to an extent, like they were, they were investments into things that were like, maybe they will happen, right. Maybe we will need this in the future. Hey, we did need this in the future. And, and so that's kind of the reasoning behind the specific ones that I pulled out.
Courtney Nash:Garrett. Had you done this before, had you invested this much time to analyze and write up an incident when Courtney came knocking at your door? And then from there, what motivated you to actually say yes. And go ahead and do it.
Garrett Hoffman:In my career, I've done the traditional post-mortem write-up, but I think that's more procedural rather than this more narrative approach to reflecting on, on your work. I'm a big proponent of writing stuff down, whether it's before you designed something after you've built it and you're reflecting on it. I think that that gives you the mechanism to really think critically and clearly, and be able to think about your systems in a way that's coherent enough to explain them to an outside audience who doesn't have full context. And I think there's value you get out of that, doing that internally as well, because tons of people come in and out of your org and not everyone can have full context about everything. So to the extent that you're able to get this stuff down and share it, it's great in the broader ecosystem of, of the technology landscape, but it's also a little bit self-serving and helpful for us internally as well. As far as saying yes to Courtney, I've done a little bit of writing before, but the thing that was so intriguing to me about this series of blog posts, is it takes the technical writing and flips it on its head, because the standard engineering blog post is kind of, we had this problem. We came up with this solution and here's your new solution tied up, packaged nicely with a tight bow. But, we all know that that's not the reality of most of your day-to-day, work environment. So to be part of sharing stories that are opening up the curtain and giving people a view of the raw unfiltered, situation of, of what went down. It was just a really, really cool idea to me and different than most of the kind of blogging I've had experience with in the past.
Courtney Nash:There's a really great researcher in the UK named Steven Shorrock, who writes and research a lot about work as imagined versus work as actually done. Which I think is exactly what you're hitting at here. I'll, I'll put a link about some more details about that in the show notes, if people are curious about it. You alluded to this also Courtney in the meta post that talks about the things you're going to talk about of, of how much it exposed about how your company works. Which I think is one of those really important things about these kinds of analyses is you're not assuming you're going to find a technical"trigger," and then you're going to solve that and it's be done. When you really dig into these things, and especially in the near misses that we'll talk about later, you discovered a bunch of things about how your how your company works, how your processes work, how teams communicate. And I really appreciated that you all called that out in the beginning of it. so Fran along those lines, what's been your experience in, this kind of post-incident analysis and write-up?
Fran Garcia:In my previous life, before joining Reddit, I worked for a monitoring company, so we try to take that kind of thing very seriously because you have highly technical customers that really need a very highly available system. So whenever something goes wrong, you will try to spend as much time as possible giving a very detailed postmortem, very detailed write up of will happen. I will definitely spend a lot of time doing that kind of thing. So whenever, Courtney will approach us saying do you know, we want to write these blog posts. It was a kind of feeling like I'm always happy to do more writing. The framing of these series of posts is completely different is we're going to let you see what was going on and what was happening with our systems. And I feel like that's a very more interesting way of approaching it. in terms of telling these kind of stories, it serves a purpose that is not really served by anything else that we do, even postmortems or even documentation, because this is, this is basically memory, right? We have memory as people on, we shared that memory across people by telling stories, we need to do that as engineers and the only way that we can actually do it across with people across different organizations, by writing it down, and having that collective memory grow. So the more we can do that and the more we can encourage people to do that, the better I feel. This is something. everybody should be doing. The reason I haven't even tried to do it before Courtney approach me is something that I think happens to many of us, which is we'll have a story or two inside of us, but we'll think is someone really going to be interested in this? Is it not really incredibly obvious in hindsight and you know... because Courtney is a big ball of enthusiasm. So he will come to you and he will say. No! This is awesome. you need to write it because everybody needs to read it." And that's, I think that was the key that, you know, when that to keep dialogue, people needed to share this.
Courtney Nash:there's so much shared experience that's not public. And in particular, there's all this human stuff behind these incidents, which is one of my next questions. This one sounds like it was not one, these dozens sound like a doozy that must have been quite a week. Can you take me back a little bit to what the team felt like? What were you all experiencing was everybody's super stressed out? Was it, what did it feel like behind the, technical detective work?
Fran Garcia:I can tell you, for example, not necessarily that week, but the week after to me is a blur. I don't remember much of it. I think there was this general understanding that, that week was going to be a cool-down week. There was a lot of clean-up up that needed to happen. And we did a lot of cleanup. Courtney himself, he went on a rampage of writing postmortems getting a lot of things ready from that documentation point of view. It was a week of holding the Fort, cooling down and getting to, a physical, psychological, emotional state where we will continue doing. That particular week where everything happened, I think my experience is a little bit different than a lot of people already because I am based in Europe. So my hours are shifted a little bit, which means that the beginning of my day was a lot more me trying to hold the fort in isolation in anticipation of the markets are going to open. Is someone going to do something weird somewhere you never know? So that was usually the first half of my days there's, there's definitely a lot of pressure to, you know, you need to make sure that you're holding the Fort. You do, you need to to make sure that you're keeping an eye on everything. And the, everybody starts coming online and that's when all that cooperation starts happening, do you need to have all these teams talking to each other and all this cooperation happening? It was actually, I mean, I don't want to say surprising, but like, it was very nice to see how natural all that cooperation happened. It was all very natural on people know who they need to talk to. Our know community team knows how to reach out to us and say, you know, maybe people are complaining about this. Do we need to talk about, do you know any of our communities about any of this? So it was, it was very good to see what, while that was happening, but I think you only get to appreciate that after the fact when you sit down. Just looking at your slack logs and say, oh look, everybody's cooperating. That's very nice. But at the time it's just a big blur.
Courtney Nash:I recall the specific mentioned in one of the write-ups about the relationship with the community team and how, well that went. was that the result of explicit investment in that relationship. Was it more organic? do you have established processes and runbooks and all that kind of stuff? How did that come to be something that worked so well?
Fran Garcia:I wouldn't say we have a process that's as clearly defined as that. It's, it's something that I think grew very naturally because it's something that we have to take advantage of very, very frequently, because by the actual nature of how Reddit works, there will always be that one subreddit that's doing something weird. That's tickling the database in a strange way. Right? So we need to be able to contact someone from the community team and talk to them and see what our options are. So that relation needs to be there because there are all times where we very quickly might need to reach out to them or they very quickly might need to reach out to us. So I think that kind of grows very naturally because at some point it's not just you have infrastructure engineers working on the servers and that's it. It's all part of a whole.
Courtney Nash:You're a victim of your own success in that regard, right? Having subreddits sort of explode is a good thing, right? It's, it's a feature, not a bug, but you feel the consequences of that in interesting ways and you can't resolve those just technologically.
Fran Garcia:I think that's one of the things that make Reddit so unique. Not only as a platform, but a as whole, is that every subreddit it's its own ecosystem, it will have its own behavior patterns. It will have its own, group of people that use it. So you can make assumptions based on the behavior. What's a rates from all the others. Do you need to be ready for anything at any given time, which is...fun.
Courtney Wang:The community aspect of specifically these incidents and the interactions that we had was something I really wanted to share internally with new infrastructure folks, anyone who's on call essentially to, to open their eyes more to this group of people that works generally behind the scenes. Our community team is an incredible team and it's a huge source of resilience for us. In handling things that we can't predict, because like Fran said, there's no way we are going to predict what every subreddit is going to do. And there's so much technology we can build towards, a guard railing or preventing some major technology things. But also that Shouldn't mean, we ignore this group of people and, and processes that are built in. And a lot of it was so organic and that stood out to me. It was a lot of back-channeling and conversations there's we don't really have like a Wiki at the time. We didn't on how to do this. And since then I will say we we've improved that to be less organic and to be more explicit on how we do these sorts of interactions. And that was one of the, I think the huge organizational wins. Of all of the things that happened was we now on the technology side are much more ingrained with our community team.
Courtney Nash:I like the theme across all of these in terms of where the people are, the sources of resilience for you all. To that end, I want to dive into the open systems post which I believe you wrote Courtney and you talk about the various teams and responders can you give me a sense of how many people were involved over the course of, managing all of these incidents that week?
Courtney Wang:That is a great question. And there are so many questions in, in this sample, the list of sample questions you gave me that I was like, I wish I'd asked this during the reviews during the review process. Why didn't I? It's at least 60 or 70 individual people across all of our teams. Every single team at Reddit contributed in some way to facilitating our response. I think in an individual incidents, there were a solid group of eight or 10 that we're in most of them. Seeing the same 10 names come up consistently was also to me, a very interesting kind of red flag, right? An indication that, oh, you know why it was just these 10 people. And to your, your question earlier about, how is everyone feeling? That's a really good question that we didn't capture. And we can't now because it was so far behind. I actually, before this, I thought that this event happened last year. I thought there's a whole year between like, I was like five months ago. Wait, no way. So, so my brain is already so foggy about that time. And it's a really good question and one that we weren't, we aren't able to answer and I'd like to go back actually, and get an actual number of these are all people that jumped into slack. And they're also a bunch of people behind the scenes, probably doing a lot of work that isn't captured, even in my review.
Courtney Nash:there's, There's usually marketing people or PR people things reach this level. You've got the executive team involved. it really hits into all of these nooks and crannies of an organization. So it was, I thought it was really great to see the ones that you did identify. wow. 60 to 70. I mean, Even if you're just ballparking that and you're close, that's it, that's kind of amazing. And the, the name's coming up again, and again, is an interesting one. There's a whole other, probably podcast about burnout and incident response. But the sort of flip side you know, of humans helping is expertise, You kept seeing those same names. I'm going to, I'm going to spit ball here. And some of those names kept showing up because they were the folks who have probably had the deepest expertise about how those systems work. And, and you can't just clone that. And so then when you need that expertise, it's a, it's a heavy burden to a certain degree. I don't know if that's right word, but it's, you rely on those people even more. And that's a factor, I think when you have this set of incidents recurring over and over again. And maybe that's why it was a bit of a blur?
Courtney Wang:I think that's a value of doing, of writing, writing up incidents. Is it in a lot of ways it helps new people and existing folks learn more about what they need to learn about. So the same eight or nine or 10 people, aren't always on the thing. It's more people should be reading incident reviews, internally and also externally to better understand what you don't know.
Fran Garcia:I think there'sa lot of value in telling these stories in this way, because that's the only way there are only two ways. Well, there's only really one way you can get people to learn all these things is they need to be there and they need to go through it. Now, that's not a way that scales particularly well for many reasons. And sometimes might not be even the more psychologically safe way of doing that. One of the other ways that you can do is with that kind of training where you say, I'm going to tell, I'm going to get you to go through the story of how I saw it and you can see it as close to as you can see it through my eyes. So it's not enough for me to tell you, Don't do that thing with the database because the database doesn't like that," right. You need to tell"Here I was on a dark stormy night and the database was doing the thing, and these are the things I was seeing, and these are the things I was doing." And that's a low barrier to, to help people grow into understanding how that looks like and how they can get to that place.
Courtney Nash:I have so many questions that I want to make sure I get to as many of uh, the reports as I can. So I'm to kind of spelunk into the open systems one. You had covered sort of some surprising details related to the Crowd Control related database issues, but then after you all sort of get through that, then you have this, follow on secondary effect with, with the Mailmod flood that you, you alluded to Courtney and, and it highlighted how unforeseen user features and behaviors can impact your infrastructure. You talk sort of being like surprising and I was hoping you could take me back to what that was like for the team. What was surprising about what was happening how did you begin to tease that apart to try to understand.
Courtney Wang:Those were very interesting ones. So they were surprising because they didn't follow patterns that we had seen previously. To take you into our mindset before the series of incidents opened up kind of how unpredictable reddit is. I I'm going to make the assumption that a lot of us thought that Reddit was relatively predictable. We have a couple of events. Like soup, like the Superbowl, for example election days that we for many years were kind of like, these are the days that things were going to happen. We're going to see just these kinds of traffic surges. And we will prescale or, add more resources accordingly. And there might've been things that existed that maybe individuals knew about, but it wasn't shared broadly. So it was surprising mostly because none of us had ever seen the combination of effects that we were looking at on dashboards together before. So two disparate entirely disparate systems modmail and all of our other infrastructure separately, going down. Okay, they must not be connected. It's actually, yesterday I was talking with somebody about we were looking at some 500 errors and they said,"Oh, these 500 errors are This status code. I'm not looking at this other system." And he said that sentence specifically, I'm not even looking at this because there's no way they can be related. It turns out they were related. And so I think the surprising part is how It's just how we didn't know what we didn't know. And the process of teasing that out was extremely difficult and caused a lot of reflection. And it was extremely difficult because when we build stuff it's interesting that a lot of the tools that we build and the dashboards that we build and fixes and mitigations for things that we build are for things that we've seen before. And that's kind of wild. When you think about how many things we're not like...we don't know. It's hubris to think that we can predict all of those things. Like Garretts, the recent consumed work is that that's an example. And the reason I highlighted that is that's an example of a project where it kind of, it scales infinitely in that it is a good solution to a lot of problems, including ones that we might not know about. And I think highlighting that as an example of a near-miss project was really powerful because we're not solving for a specific case. We'resolving for the general case of"traffic might increase in a lot of different ways we can solve it this way." And similarly with autoscaler right.
Courtney Nash:I would love to talk about Garret's write-up but before I do that, I just wanted to call out how invaluable I think it is to study these kinds of near misses and Fran, you alluded to this early on, it could be framed very differently. It could be framed like"here's how Reddit's awesome, and this is why our system is this way, and like, you should go do the same." Which spoiler alert people probably shouldn't because their business model looks nothing like yours, but it, the framing of it is that it's not that this was a giant success, but it was a near miss. And I love that framing because I think you've, you all have teased out and understood just as much about how your business and your systems and your, social organizations work from studying the near misses So, it sounds like maybe a conscious choice on your part Courtney, but, Garrett, talk me through what you were thinking about in terms of writing this one up. Were you thinking about it as a near miss when you first did it?
Garrett Hoffman:Yeah, I think so because, It's a Friday night, it's late, you're having a database be at 95% capacity out of nowhere. Based on like Courtney said, unpredictable patterns of, putting additional load on this database. And we're, we're on slack, we're writing up docs. How are we going to scale this up? Do we have to have any downtime? You know, what's the mitigation plan? How long is this going to take? And you're, you're writing this up and you're going through these steps and it all goes according to plan, we can scale it up easily, we can scale it up online with no downtime. And I think in that exact moment, I think me and Courtney slacked each other, and we were like, could you imagine if this happened nine months ago before we redesigned the system. I do not think what just happened would have happened. I think we weren't that conscious of it, like in the moment of responding to, to this incident and it was, I don't think we immediately said Hey, this is something we need to, write about or talk about, but I think that little conversation between me and Courtney resonated enough that I think he was like,"no, this is worth talking about it" because you need to talk about this to remind yourself why it's so important to do that redesign and, and maybe push off that, one new feature by a quarter to just prepare for this exact situation.
Courtney Nash:So you alluded to that trade off of, should we spend X amount of time on feature Y or should we invest in this thing that may or may not pay off one day. What was that decision making process like at Reddit? was that an easy sell, were there a lot of people that had to be involved— hard trade-offs, engineering managers, what did that look like internally to make that decision?
Garrett Hoffman:Courtney and Fran might definitely have a bit more to add because they have a bit more time at Reddit than I do, but especially in my, about a year and a half here, more and more, has there been given more focus on foundational work and quality. So I think it's, it's becoming easier than I imagine it may have been in the past when Reddit was kind of in smaller, super, super high growth phase. Not that we're not in that anymore, but I think we're at the point where we're kind of finding the right balance of, that high growth, new feature development along with the foundational work. It's largely more of a trade-off in resourcing. I think Reddit's notoriously operated very light, relative to the scale that it, that it serves. And so I think those decisions mostly come down to, when you're strapped for resources, what. What do you do? And there's merits in both approach is what's the point of preparing for more users when you aren't building the features that you need to get those users? There are certain systems that, when it becomes super apparent of the limitations, it becomes easier. In this system, we were fortunate enough to have hit those limitations in like a more gradual, fashion rather than just being slammed at one time.
Courtney Nash:So we're talking about two"What Worked" posts just for context for folks and all of these will be included in the show notes. One was around the autoscaler and one was around recently consumed. I believe it was you Fran for autoscaler. I'm going to read what you wrote."The code was originally written in response to a scaling event years ago that had been largely unchanged. Unfortunately, this means that it wasn't an easy, easy code base to understand if you didn't have the right context. And the prospect of making changes to it was intimidating, which only exacerbated the issue." And I'm sure most engineers can agree. Like that's the spooky shit. Don't touch it. it But take me back to when that code was written, if you can, and the event that precipitated you said it as a scaling event, years ago, something precipitated your willingness to wade into this— why was it intimidating for the people who were dealing with this code, later downstream and why did nobody want to touch it?
Fran Garcia:I actually just check this right now. The code was created like eight and a half years ago, which is more than half the life of Reddit. So that should give us an idea. Like it basically remained fairly unchanged since then. And I think that's in itself a testament of it being a solid piece of code, right? It did its job, and mostly didn't complain. It was, it was written by, by, by Jason Harvey. One of our engineers here at Reddit, he's still around. He wrote it, the legend said that he, he really well, he had the flu. So I, I equate that to Michael Jordan's flu game, but I don't, I don't know if pizza was involved or anything, but like the point is that it was created, it works reasonably well. And I feel like that to touch on what Garret was talking about, there are two types of software projects in terms of the value that they're bringing and the kind of investment that they get. So there are the projects that are delivering a lot of value on those are going to get a lot of investment. Right? Because it's, it's very obvious that there's a reward. there. There are the other ones that are on the opposite ends of the spectrum, which is like, everything's terrible, everything's broken and you need to do a lot of investment just to keep things running. And then there's everything else that kind of falls into the middle where they're mostlyworking the most don't complain. You can make it better, but it's not necessarily obvious,you know how, why. A lot of the time to kind of fall through the cracks and investment doesn't happen. And on the Autoscaler was kind of falling on that criteria like that. So if you want to test it. How did you test the thing that requires so many moving parts, right? You need all these auto scaling groups, all these large pools of servers that are serving, you know, varying amounts of traffic—you need all of those things. So we didn't really have that infrastructure. We didn't really invest in that infrastructure to test that. So if you want it to make any change to that particular piece of code, you need to have a very deep, very deep awareness of. How do low balancers work? How requests are distributed to all those load bouncers. What happens if you screw up with your code, how do you revert it very quickly, what can break? So the end result is maybe don't want to change it. And there's a very small number of people—like I was checking the commit logs—a very small number of people that have contributed to it over the years. And most of the changes were, out of necessity. One of the things that we tried to do to fix his was okay, so can we refactor this in a way that if people from other teams tell me,"can I see how the autoscaling decisions for my server pull are made? Can Isee that can I modify that?" people should be able to come to me from any team that has no infrastructure background whatsoever. I should be able to point them to the code and say,"Yeah, just modify this or look at this and it's all you need." And that's, that's basically the problem that we set out to solve because nothing was broken. But we weren't giving people the flexibility to do those things, to make those changes, which is what worked in this particular case, right? In this particular case, during the wallstreetbets this shenanigans. What we had was a case where we thought we will fine tune the autoscaler. But we needed to have a way to do it easily. We need to have a way to have maybe different people contribute to that and be able to chime in and say, I think you can make this particular change, or I think you can make that particular change. It's going to work better. Maybe in the previous version of the autoscaler, the would be more difficult to achieve.
Courtney Nash:I want to get through one more of the cases really quickly, if we can. The other one that you wrote up Courtney was the"more data, more problems" incident, and early on, you mentioned a quote unquote known pattern. I love known patterns, right? Because I mean, these are, when we talk about expertise, there's heuristics, there's all this acquired knowledge. Oh, this looks like that." sometimes that can lead you down a garden path. And you're like,"no those two...that can't be connected. It looks like this." Right? How did the team come to acquire their knowledge of that particular known pattern? I think you said the four tables were a pattern that pointed to a hot post on Reddit. What did that knowledge acquisition process look like for you all?
Courtney Wang:This is another question that I wish I had asked during the review with the responders who came up with that, and the people who responded to those particular incidents the three main engineers, I'm going to shout out Jason Harvey, re-hear mirrors and Brian Simpson, the three of them collectively are the most senior infrastructure engineers who worked on the Infra team at the time. so, it must be like 8, 9, 10 years individually. So like 30 years ish of collective experience doing this sort of work. And so my bet is that they figured out that pattern and just had it ingrained in them from years of interacting with those data models and with seeing how they broke. And that was just time that they could use to understand those, intricacies.
Courtney Nash:And then there was a solution. It was, it was a form of an availability versus consistency sacrifice decision you have to make here. so it was sort of do you turn the feature off and preserve the core experience, like who was involved in making that trade off and that decision, what factors were considered, how do you reach that conclusion?
Courtney Wang:I do know who was involved because this was something that I saw. And I looked, I went back and checked slack logs and looked up. And so at the time it was actually Fran, who was leading. I would like him to chime in a little on what he was thinking about when he said,"pull the trigger on this." But the engineers, one of the engineers basically just said," I think if we just turn off this particular set of features, it will work." I didn't ask, you know, what went into that decision making in general, I can speak now to kind of some of our internal general processes around that when we're thinking about turning features off. A big part is like, how easy is it to, to disable the feature? These are questions that we ask during incidents, when somebody says, comes in an engineer for my project and says,"Hey, I can turn this off if we need to." And that actually happened during this incident as well is we had a couple other engineers chime in and say, oh, you know, these services interact with these data stores. We can turn off X, Y, and Z feature." And that was really cool to see now looking back we never really took advantage of that, and we never needed to. However we decide, okay, how easy does it turn it off? How many, what is the actual impact of the core Reddit experience? Is it a lower level feature? Is it in this case, being able to save and hide links was kind of a, okay, many people use this, how necessary is it to the core? And that is that is a tough business decision. That at the time we, as the infrastructure group, made. We were given the agency implicitly, I think at the time to make those decisions. And now looking at them, I was like, that's a really hard decision to make. I don't know that every engineer if you don't have, you know, 10 years of experience here, will be able to make that decision on your own. And that's why in the months since we've started a more robust incident command process, and we have these sorts of flows. A lot of that I think came out of those learnings. And at the end of the day, our core goal is to get, read availability up. and so anything where we can get users actually just be able to look at content again, as opposed to interacting. That's one of the main levers that we turn, but, but I would like to ask Fran, in your mind when Brian said,"Hey, I can turn this off" and you said,"go for it." What went through your head?
Fran Garcia:Brian said it was good to do it. No, I mean, I think to add a little bit more context to it this is something that is very integral to how things work at Reddit. There's always a hot key or there's always a hot something because it's due to the way that Reddit works. So they'll always be one subreddit that's being more active than anything else, one user that's more active than anything else, one thread that's more active than anything else. so that kind of pattern happens very, very frequently. So this is something that I think we're getting to the point where we are very good at identifying those on identifying,"okay. Let's shed that load, as best as we can" and that also works with features, right? So let's degrade as gracefully as we can. And sometimes. We don't do enough of a good job to prepare for those who have the features to have a simple toggle. There are many, many incidents where you will see someone very quickly making a pull request, which is basically saying if you're going to read that particular key, don't do it. It's good but it works. And that informs decisions later on to say, okay, so we need to toggle here to allow us to shed load here. We need to be able to be in a position to say, we need to stop all these kinds of reads from the database at the drop of a hat. So every time that happens, it informs a situation where hopefully we can be in a better position in the future, that's something that I feel we do a lot on the fact that we are sincere and responders feel empowered to make that kind of decision. It's really, really important because I've always been part of companies where that kind of this issue probably involved a conference call with at least 10 different people.
Courtney Nash:Maybe a war room, which
Fran Garcia:Oh no!. At the very least
Courtney Nash:I have PTSD from previous jobs with that one.
Fran Garcia:Carney described the strategy was basically you have Brian, and Brian says,"I can do it!" On them, whatever engineer is there is like"Yeah, do it.". And that's, that's basically all you need. Feeling empowered to be able to do that is very, very important.
Courtney Nash:I would love to close out. Courtney, You mentioned that you are developing a, broader sort of incident response plan... team?
Courtney Wang:Yeah, it's probably a good topic for a future blog post. Coming out of that the incident wall street bet stuff, it's actually, the first initiative I took on was, Hey, we're starting to see a lot of coordination needed among a bunch of teams. And this can't be the same, three or four people. We need to train more people to, to find these connections and to make them. And so we now have an incident commander rotation internally comprised of individuals that we would like to expand to the broader org which is the idea that anyone can become an Incident Commander. it's not a technical, it's more of a non-technical role if anything. And these folks are there to facilitate communication, coordination. The biggest value I see is safety— help incident responders feel safe, especially as we are, as an org, scaling very rapidly. And our on-call model is service owners are in charge of their services and, and of being on call but sometimes services talk to each other and other services and other teams. And sometimes, you know, you're on call and if you are paged, we want to make sure that there is a safety net for people to come in. And so we've been running this program for a few months now and starting to collect some data on how useful it is, right? How much safer do people feel? How much quicker do we resolve incidents? The stated goal is these people are there to help reduce severity and duration of incidents and, and help folks find what they don't know. I think if we had the, the program in place for our r/wallstreetbets, I'm not sure that things would have necessarily resolved more quickly, but I do think psychologically folks would have felt more comfortable and cared for in a very stressful time.
Courtney Nash:Well, thank you all. First of all, for writing all of these up, it is an incredible investment of time, but it seems pretty clear to me the, payoff in terms of what you've learned has been well worth it. I hope you will continue to do more of that because you're going to have more incidents. I wish I could tell you you're not...Fran's shaking his head.
Courtney Wang:We have had our last incident officially just going to announce that.
Courtney Nash:Oh, You're done!
Fran Garcia:We had a meeting. We all agreed that that was it.
Courtney Nash:Cool. Good. I'm glad to hear that. I'm sure everyone else will hop on that train too.