The VOID

The VOID makes public software-related incident reports available to everyone, raising awareness and increasing understanding of software-based failures in order to make the internet a more resilient and safe place. This podcast is an insider's look at software-related incident reports. Each episode, we pull an incident report from the VOID (https://www.thevoid.community/), and invite the author(s) on to discuss their experience both with the incident itself, and the also the process of analyzing and writing it up for others to lean from.

All Episodes

The VOID

Canva and the Thundering Herd

May 14, 2025 • Courtney Nash • Season 2 • Episode 1

Greetings fellow incident nerds, and welcome to Season 2 of The VOID podcast. The main new thing for this new season is we’re now available in video—so if you’re listening to this and prefer watching me make odd faces and nod a lot, you can find us here on YouTube.

The other new thing is we now have sponsors! These folks help make this podcast possible, but they don’t have any say over who joins us or what we talk about, so fear not.

This episode’s sponsor is Uptime Labs. Uptime Labs is a pioneering platform specializing in immersive incident response training. Their solution helps technical teams build confidence and expertise through realistic simulations that mirror real-world outages and security incidents. When most of investment these days in the incident space goes to technology and process, Uptime Labs focuses on sharpening the human element of incident response.

In this episode, we talk to Simon Newton, Head of Platforms at Canva, about their first public incident report. It’s not their first incident by any means, but it’s the first time they chose as a company to invest in sharing the details of an incident with the rest of us, which of course we’re big fans of here at the VOID.

We discuss:

What led to Canva finally deciding to publish a public incident report
What the size and nature of their incident response looks like (this incident involved around 20 different people!)
Their progression from a handful of engineers handling incidents to having a dedicated Incident Command (IC) role
Avoiding blame when a known performance fix was ready to be deployed but hadn't yet, which contributed to the incident getting worse as it progressed
The various ways the people involved in the incident collaborated and improvised to resolve it

Courtney: 0:00

Greetings, fellow incident nerds and welcome to season two of the Void Podcast. The main new thing for this new season is we're now available in video. So if you're listening to this and prefer watching me make odd faces and not a lot, you can find us on YouTube. The link is in the show notes. The other new thing is we now have sponsors. Don't worry, these folks help make the podcast possible, but they don't have any say over who joins us or what we talk about. So fear not. This episode's sponsor is Uptime Labs. Uptime Labs is a pioneering platform specializing in immersive incident response training. Their solution helps technical teams build confidence and expertise through realistic simulations that mirror real world outages and security incidents. When most of investment these days in the incident space goes to technology and process, Uptime Labs focuses on sharpening the human element of incident response. All right. In this episode, we talked to Simon Newton, head of platforms at Canva. They just published their first public incident report. It's not their first incident, but it's the first time we get to delve into the details of what happened, and we really appreciate them sharing that with us and the world. So let's get into it and talk to Simon. Simon, thank you so much for joining me on The Void Podcast. Will you do our listeners a favor and introduce yourself and tell me where you work and what you do.

Simon Newton: 1:26

Hi. I'm very happy to be here. I'm Simon Newton. I. of platforms at Canva. I've been there about three years or so, and, platforms at Canva consists of a bunch of different areas, I guess most interesting today. that includes the teams that run our edge and gateway, the, front door into Canva, as the teams that, manage all of our cloud resources. So yeah, Canva can be, in the case you haven't heard of it, is a visual communication company. what we do is make it really easy for people to express their creative ideas, to others, in a way using visual content. we have a bunch of different products, built into what we call the, editor. so we have whiteboards and presentations and responsive docs and websites. and drawing tools for making social media posts. so yeah, there's a very, there's a huge amount you can do.

Courtney: 2:20

Yep. That's how I'm gonna make the social media posts for this podcast. So, thank you all very much at Canva. that's why I was very excited not for you really, but when I saw this incident report, I. go by. I'm in a Slack group, this resilience and software group, and we have a channel that's just called cases. and that's, honestly, I just like, I could sit and watch that all day. Like, and so whenever, you know, sometimes we know because things happen to us. But I found out about this, this incident report and the outage. I did not experience the outage personally. but I did find out about the incident report from the Slack group that I'm in. And I think also I noticed, That it's the first one you've ever done, right? I don't recall ever reading a Canva incident report before. So. So you've had, and in the report, this report, I think you say you've had an internal process since we're gonna get into the incident, but you've had an incident process since 2017, and I'm always so curious about what it is that pushes an organization over the edge or, I mean, I don't want it to sound so negative, but like what brought you to the point where you said, okay, we're gonna go to the effort of publishing our first public incident report.

Simon Newton: 3:33

Yeah. And it is a bit more effort, right? we've been publishing, incident reports internally, like you said, for many, years. but there is a difference between, a report for internal consumption and then a polished report for external consumption. so in terms of the why behind it, It's really a reflection of how we're evolving as a company. so Canva, stuck or got its start, with a bunch of, individuals and small businesses. that was typically the, user groups. and they'd, be creating social media content and marketing, for their small businesses, and that, that sort of material. but in the last couple of years, we've really seen increasing adoption within enterprises, and enterprises bring their own. Like new and different set of customer requirements, And customer expectations. and so it is part of us wanting to show our commitment to transparency, to those enterprise customers. and, also because we believe, and I very much believe, that sort of being open about incident reports does benefit the broader industry. and I think

Courtney: 4:36

I agree.

Simon Newton: 4:37

important. Yeah. and I, love, I, love what, the VOID is doing here. my, my take is that I think software engineering generally as a field does a fairly poor job at learning from previous failures. if you look at other engineering fields. you often they're like more highly regulated and there's like processes around this where, where, failures are like understood and investigated. but in software engineering, we, seem to make the same mistakes, many times, right? And there's not that, that not, that's not, that's learning, as a broader field. and so yeah, very keen on, on like voids efforts to see, improved education around this, right? So that we can all, uplift the industry as a whole.

Courtney: 5:21

Thank you for the shameless plug. I appreciate it. I wanna ask you the same question I ask everyone at the start of this podcast, which I think is cruel, and yet I do it every time. Can you give a brief summary of what happened?

Simon Newton: 5:33

I will, try and keep it brief. the a as sure you are aware, right? The, large and interesting, are, because of a confluence of factors, right? So in this particular case, they were, there were multiple contributing factors that sort of lined up in a way that caused this much, much, larger incident. But maybe a little bit of background. so the Canva editor is a single page app. we deploy that multiple times a day. and amongst other things, as part of that deploy, we, build the JavaScript assets, and then we publish those into an S3 bucket. and so then when clients, reload the page, or the editor self updates, those clients then go and, download the new assets, via our edge provider. and once those assets are loaded, and the editor is like functional and running, then it starts making API calls to, to load in all the content. API calls also go via our edge provider, to what we call our Canva API Gateways. and then those gateways route those requests to the various services, within, our systems, which then will, handle them. and I guess the, other key bit of information is those gateways run as auto scale groups, within our cloud provider, AWS. so in this incident, the first, the first like factor that contributed towards it was we did a deployment of the editor, at the same time that they were network issues that were occurring within our Edge provider. and now normally, these network happen. Issues happen all the time, right? Like the internet is a messy place. this is happening day in

Courtney: 7:09

Yep.

Simon Newton: 7:10

And so normally what happens is our Edge provider has automated software, that can detect these issues and can mitigate them and route around them. unfortunately in this particular case, and again, we only learned this later, they had a bit of manual stale configuration in place that have prevented that automation from running. so first contributing factor is there's a network issue, and it hasn't been mitigated, automatically.

Courtney: 7:37

As, as often happens with those things, unfortunately. Yeah. Yep.

Simon Newton: 7:42

And so then maybe the next bit to, to discuss is if you consider how the Edge proxies work, right? You've got a lot of clients, making requests for resources. and if that resource isn't in the local case, if the proxy, it all has to go and fetch it from the origin server. and it's very likely that there's multiple clients requesting the same bit of content, right? and pretty inefficient to have the, or the proxy server then go and turn around and request multiple times from the origin. So what you wanna do is you want to coalesce all those requests, do a single origin fetch, and then send that content back out to all the clients. and you can probably see where this is going, but, with a bunch of clients requesting the same JavaScript asset,

Courtney: 8:25

Is that the sound, is that like the sound of a thundering herd I hear in the distance? yeah, that's, yeah.

Simon Newton: 8:31

but the interesting bit here was if everything behaved as it was, as, it would normally, latency would go up because the origin fest fetches are taking a bit longer. and you know that, that's fine. it's just a bit of increased latency for the clients. but where we got really unlucky was one of those origin fetches didn't just go out from measured in milliseconds to seconds. The origin fetch actually took 20 minutes to occur. And, as I'm sure 20 minutes in computing time is an

Courtney: 9:04

It is like geological time. Yeah, it's, yeah, it's geologic time, so yeah. That's rough.

Simon Newton: 9:09

And then when we got particularly lucky again, was that this one, one particular asset, was the JavaScript that loads, what we call the object panel in Canva. so this is the sort of main place where you interact with content, in your designs, right? And so without the object panel loading, the editor is, essentially, frozen and dead. and what it also means is that once that one bit of JavaScript loads, it then triggers a bunch of API calls because

Courtney: 9:35

Yeah.

Simon Newton: 9:36

yes, if you've got. A fetch taking 20 minutes. got all these clients like coalescing and waiting for it. and then suddenly it becomes available. that turns into a large number of a API requests, hitting the gateways.

Courtney: 9:52

Yeah. it is pretty ugly. At that point things fell over in terms of like the autoscaling and whatnot, right? At some point, it all just like collapsed on top of itself there, right?

Simon Newton: 10:02

Yeah. And so that turned into, about, I think it was about one and a half million, API requests in a very short amount of time. and that was enough to, yeah, overwhelm the gateways. Auto scalers can't react fast enough to that. gateways went out of memory, and then that started the cascading failure due to overload.

Courtney: 10:22

Okay. That's the shortest version of this incident, which is hard. and so I think, we'll, we'll talk a little bit more about some of these details. but how did you all find out about this?

Simon Newton: 10:35

I can run you through that. so the, first signs that something was wrong, was about 10 minutes after the deploy, because we started noting, noticing a drop in search traffic. and so our search on call team was paged, for a drop in traffic. I think they went down around 20% or so. and so they started investigating. and at first they thought oh, maybe this is something like very isolated to search. as they started, communicating in an incident room, it was a sev one, which is our, second highest tier of, incident At that point, as they started communicating, other service owners started getting involved and saying oh, actually, like we've also noticed a drop in traffic. And as they were doing that, that was about the time that the fetch itself completed. and then our gateway and edge teams got paged for, massive gateway failures. at that point, the incident was upgraded to a sev zero. The gateway teams were pulled in. I was pulled in and we, the incident, incident coordinator was, activated. and then, we went from there.

Courtney: 11:41

Yeah. What is your incident response look like? How, you know, do you all have sort of some, you know, protocols or what does that look like and how, how many people ended up being involved? If you can tell me, because it sounds like it was a doozy, and a lot of people got called in. I'm always really curious, like how many hands are on deck?

Simon Newton: 11:58

Yeah, I would guess probably at least 20 or so. so let me, describe the process. typically, alerts will page, service on call teams. I think it's a, very typical process, right? Those service teams will triage, an open incident, if it's a SEV zero or SEV one, that'll activate the incident coordinator rotation, which is a, it's a small group of people, that have a lot of training and skills around managing, and coordinating these large scale incidents. the Sev Zeros will page myself. and then it, really depends on the incident as to the structure of it, right? If it's a smaller incident, you might have only a handful of people, and they'll be fulfilling all the roles, right? if it's a large incident, like in this one, we'll have, we'll, set it up so that there's dedicated people running each particular role. I think maybe one, one thing that's important as well. Is that, the larger or the highest severity incidents, will automatically assign a representative from our customer support organization.

Courtney: 13:08

Oh, okay.

Simon Newton: 13:09

all of the user reports, and, looking for signal there to not overwhelm the, sort of the responding team. and

Courtney: 13:17

Mm-hmm.

Simon Newton: 13:17

the user support, so that they can feedback look, this is the state of things. If people are, contacting or reaching out to Canva, this is what we should be telling them. If there's anything they can do. to mitigate it from a user perspective, that as well. yeah. And then, post incident, like I mentioned, we'll write those incident reports. and then we also. we've just, recently started using AI to try and extract common themes, across all of those reports as well, to, look for areas where we can, where we can improve.

Courtney: 13:49

I have so many questions. so the IC is getting pulled. Sort of automatically, if it's like sev one, sev zero, how is that team? Are they, are they, also engineers on, on different teams? Are they kind of a, a dedicated SRE esque function? And when they're not ic, what are, what are they doing? I'm kind curious what the shape of that looks like.

Simon Newton: 14:12

Yeah. So early on, say our, ICS were just a bunch of very experienced, battle hardened engineers at Canva. and then over time, of course, as we get bigger, we need to set up a dedicated function. and those, because those, folks are also, they've got their own projects, right? They their own deliverables, that they're working on. So yeah, IC is a dedicated function at Canva. when they're not on call, they're doing a bunch of other different things. some of those is improving, the incident process itself, looking for those patterns in incidents. plus also there's a bunch of work that goes into our large, launch events. So we do two, two main launches a year, and we've got one coming up now in April. and so they're, like doing a bunch of planning for that. there's a bunch of capacity planning involved, like risk, un understanding the risk and mitigating, the risks that are unique to each one of those launches. And so yeah, that's a, that's like a dedicated reliability function.

Courtney: 15:13

When you have like a pretty big incident like this one, do you have execs in the, in the channels, in the, active incident channels? You, or like, or do you have a dedicated role who's talking to folks and kind of coordinating back or forth? What does that look like?

Simon Newton: 15:29

Yeah. So in this particular case, the founders, our founders, did join the incident channel. I. typically what will happen, sometimes they're there, sometimes they're, they're busy with other things. typically like myself or one of the other senior leaders will act as that channel, to the, founders, to give them updates and, help'em understand what's going on.

Courtney: 15:50

So you mentioned you found out later, so was was the CloudFlare wrinkle of their traffic, and you didn't mention this explicitly, but I think this is what you're getting at with the network configuration. So their traffic was going over the public internet, not over their private backbone. Is that a piece of it that you found out later? And so that was sort of like, you didn't know why that was contributing to the problem.

Simon Newton: 16:11

yes. we don't have visibility into that traffic between, cloud CloudFlare and

Courtney: 16:16

Yeah.

Simon Newton: 16:17

in A-W-A-W-S. so yes, that was a detail that we, didn't find out until, I think maybe one or two days after the incident as we were working with CloudFlare to understand what happened.

Courtney: 16:27

At, at some point, did anyone have the intuition that this was gonna get worse in the way that it did? Like once the files were downloaded from the origin server and, and there was, did you see that, hear the thundering heard coming? Like, I'm always a little curious what the, what the existential dread might be like, like in an incident or was that, did that catch you by surprise as well?

Simon Newton: 16:50

yeah, it all happened, like fairly quickly in terms

Courtney: 16:54

Okay.

Simon Newton: 16:55

the unfolding incident. yeah, by the time that folks were in the room saying oh, this is, broader than search, that was about the time that origin fetch completed. and then the gateway started starting, falling, over.

Courtney: 17:10

and so that, so even so the, like the, all the clients hitting the API gateway, you've got a known performance issue. So that all happened really quickly. And the the, or were piece of those only aware after the fact as well.

Simon Newton: 17:23

Yes. I didn't actually touch on the performance side.

Courtney: 17:28

get into that. Then I jumped ahead.

Simon Newton: 17:32

Gateways are now, now crashing. we have a bunch of people on the call and that's where we started breaking up into different work streams, right? there were a set of people that were responsible for, contact taking our vendors, and getting them involved. That process differs across vendors, right? Like some of them, will be like, we'll go to our portal and sevmit like a, a P zero request. Others will be like, email this address and it will page sort of your on-call account managers. and so that process there, needs to be well practiced, right? And I think this is actually something that, we improved afterwards as well. we have internal docs on how to escalate to each of our vendors, but we tied it up that documentation and just made it a bit clearer because when you are. you're doing that, you are already in a high stress environment, right? And so you don't want to

Courtney: 18:20

Absolutely.

Simon Newton: 18:21

and have four pages of workflow to go through. It needs to be a very clear, do these four things, right?

Courtney: 18:27

Yeah.

Simon Newton: 18:28

had a set of people contacting vendors, We had one of the engineers who was not on call at the time, but was, saw the incident and jumped on, as, many people at Canva do. We, there's a very good, culture of, banding together, especially for these large incidents. so we had one, one engineer who went off and started profiling the gateway. this is he was reporting back in saying oh look, actually, profile looks different. I think, if I remember correctly, actually had, he'd done this before, other times. And so he had past profiles to compare it to. and he was like, look, when I did this two weeks ago, this, one looks different, right? And I'm seeing lock contention here. And so that was, like one, one avenue that we thought, might have been the contributing factor. and so we had a bunch of people exploring that, that turned out to be, a change in our metrics library. what we were doing was changing, making some changes to the way that metrics were collected. and so we were integrating a, like a third party library there. that had done is, inadvertently put metric registration behind a lock. and so the capacity of our gateways were reduced. which would then also contribute to the sort of the overload situation, and what we saw with the cascading failure. but that was all emerging in parallel to the, to the,

Courtney: 20:00

Yeah.

Simon Newton: 20:01

activity that was happening on the call.

Courtney: 20:02

So you got the performance issue, which must have been. And if I'm not wrong, there was this really juicy detail in there. these are the kinds of things that I love it when companies share, where it was a known issue and there was a fix for it in like the deployment pipeline for that day. That had not gone out yet. And in many organizations there could be a lot of blame and like post hoc, like, I can't believe we didn't do that. We should have, like what was the general sense when you all realized that that also was contributing to this incident?

Simon Newton: 20:41

Yeah. definitely, there, there was never any blame. and it was just very unfortunate that yes, the metrics team had already identified this, they'd already fixed it, it was already merged. and, I guess what would that have been in, in 12 hours later? the new gateway would've been deployed and, this have very different.

Courtney: 21:04

yeah.

Simon Newton: 21:04

just getting unlucky. we do have a project underway. to, move to incremental deploys. right now I'd say most backend components, are deployed on a daily cycle, and then, and then the components individually can opt into incremental deploys, and deploy on merge. but we do want to move more and more of our components to like over to that incremental model, which would've, ideally caught this, in this case.

Courtney: 21:35

And so then the last, but not least, certainly I believe, again, if I've got the timeline right here, so you've got that going on, and then basically you had like the load balancers on your ECS containers, just they couldn't. Like they were just getting hammered at this point as well. Right. And was that also, a matter of sort of, this was reaching out and trying to figure out what was happening with AWS or like what, what was that last piece of the puzzle like? I.

Simon Newton: 22:02

Yeah, so we did, we had a set of people, escalating CloudFlare. we had a set of people reaching out to AWS because, we were at that point, this was just after that Origin Fetch had completed. we were concerned that there might be something going on the AWS side, but thankfully, we have, I would say like reasonably good observability into all of this, right? and a cascading failure due to overload is a very distinct pattern in a graph, right? Where you get this, sawtooth pattern, like if you're looking at

Courtney: 22:38

Yeah.

Simon Newton: 22:38

uptime, you get this sawtooth. That sort of offset with every instance. and I've certainly seen that before, a number of other people on the call had seen that before. and so it was very clear at that point. It's oh no, this is, this is a, an overload situation. and then, a again, we've, faced this before, right? I've, seen, quite a few of these. and there's, really only one option, which is you've got to, bring the load down and get the system back into a stable state, right? You can, try and add capacity. The problem is un unless you can do that in a, an atomic flip fashion, any new additional capacity that you bring up is just gonna get hammered into the ground. and so you've gotta cut that load back. In this

Courtney: 23:22

Yeah.

Simon Newton: 23:23

case, you know what would be really nice if is if we had a sort of a, lever where we could just say oh, okay, great, like dial the load to 20% of demand. we didn't have that sort of control within the cloud providers systems. so instead what we did was reach for, the, sort of country level controls, right? And so we put in a block that said, For any country in the world, want to display this status message instead of forwarding the requests onto the, backend. So that was a, that, that was the lever that we had at the time. and

Courtney: 23:58

Yeah.

Simon Newton: 23:59

doing was, once, once we got that block in place, I. We saw the traffic drop to zero, we saw all the gateways come up and stabilize. and then we started admitting more load. I think the, maybe the interesting bit here is that, we added Europe back in first because that was where the peak load of the time was, right? This happened in Australian, like early to mid evening time, would've been European daytime. The US was mostly asleep at that point. and so we Europe to be admitted back first, and then, rolled out, out to the rest of the world.

Courtney: 24:33

Yeah, I mean that was an interesting piece of what I like to think of as hard earned expertise where instead of just turning everything back on again and literally recreating potentially the same problem, which I've definitely seen incidents where they're like, oh no. And it like literally happens again. the only other piece I was curious about is in terms of what that incident response looks like, is like, are people remote? Is everything in a Slack channel? How does that work? Is everybody in the same time zone? How complicated does that look for you all?

Simon Newton: 25:04

Yeah. Canva is a hybrid work setup. so I'd say the bulk of our engineering is in the New Zealand to West coast of Australia. Time, zone. and so yeah, this was like early to mid evening for them. but yes, as part of. The automation. When an incident is triggered, it automatically creates a slack room. it ties that slack room to the main, incident slack room, so people, can go in there and find it, and it sets up the Zoom call as well. and so then I'd say the majority of the discussion occurs on Zoom. maybe another interesting point is that we, have deliberately set up that zoom in a way that, that Zoom chat is disabled. so it funnels all of the chat through the slack room. so

Courtney: 25:51

Ah, okay. Yeah.

Simon Newton: 25:52

I think that you don't have two, two different,

Courtney: 25:55

Two sources of, yeah. Yeah.

Simon Newton: 25:58

and so typically people will share, will be sharing links to dashboards or logs or et cetera in the Slack channel. and then we have a way of annotating within the Slack channel sort of key moments, in a way that it's very easy to, when you're going back later trying to write that incident report, you can pull out the key events and say okay, this is what we knew at this particular time.

Courtney: 26:20

homegrown tool that you all use, or is that a, that's a third party tool for the, for the incident sort of.

Simon Newton: 26:28

yeah, I think it's mostly a bunch of, a bunch of tools developed by that same set of, incident coordinators.

Courtney: 26:33

So some companies say, you know, for, for, I. Sev whatever number, we're gonna always have an incident review and we're gonna try to do it within this amount of time. and do you have like a dedicated incident analysis team and with an analysts, and do they have, processes around that and, and timeframes for when you want things to happen? Or is that more ad hoc?

Simon Newton: 26:58

Yeah, so we don't have a dedicated team of people analyzing these. the incident coordinators will work with the. The sort of service teams that were contributing towards the incident. and then we have a, a template for, the, PIRs, what we call them. I don't, I, don't think there is a sort of a set deadline. but certainly I. myself, I'll be, because I'll be involved in the Sev Zeros I'll be looking for that report. and if it hasn't shown up in maybe two, two or three weeks, I'll be, ping people saying where, is it at that point? no sort of system that, you

Courtney: 27:39

Yeah. Yeah.

Simon Newton: 27:40

it is all linked in the, in that automation, in that slack room as well. as the state. Moves through from, like responding. so I guess like triage, responding to mitigated, to resolved, that's all as

Courtney: 27:53

Yeah.

Simon Newton: 27:54

and, fed into that timeline.

Courtney: 27:57

Is there like a PIR meeting anyone can come to? Like how are the reports shared internally?

Simon Newton: 28:04

Yeah, so it's, it's then published when it is published. it's done typically is like the last message in that Slack room. that, that then gets archived. we have a weekly meeting as well where we review, the Sev Zeros plus, like of the, other, lesser, severity, but maybe like interesting, failure modes. and then we publish, in our sort of monthly engineering leads meeting. we, take the, list of incidents, and we go through them briefly there. that's so that all the engineering leads have the same level of visibility, and that we're identifying common themes.

Courtney: 28:42

Do you find that the, that those reports get referenced in like planning or are they, are they used. Do you see them being used after the fact? you know, as sort of learning pieces in terms of either yes, for planning or for architecture reviews or like any of those kinds of things.

Simon Newton: 29:02

Yeah, so the action items that come out of those reports, get created in Jira, which will then be sitting on those teams backlogs. and so then, next, next planning iteration, the team will be looking at, looking at their sort of like reliability backlog. and then like mixing that with the other sort of various categories of work on the backlog, and then using that to prioritize, what, what, work occurs.

Courtney: 29:27

And is there, is there anything outside of action items in terms of like distributing what you learned? or is there any other activity around the incidents like that?

Simon Newton: 29:38

I'd say no, other than the, sort of internal Slack channels where,

Courtney: 29:43

Yeah.

Simon Newton: 29:43

right? And so anyone in the company is free to and look at those, and, read through the reports.

Courtney: 29:48

What was the most surprising thing to you about this incident?

Simon Newton: 29:52

yeah, probably the biggest one is that our edge provider would coalesce requests indefinitely. I can, understand, I can understand why that was the outcome. And I can imagine, I can put myself in the person writing that code. and it's oh, like if, fetch in flight, wait for it to complete and then respond, right? that was what allowed that thundering herd to develop. and I. all of our internal RPC traffic has timeouts, which are propagated and respected at each hop in that service graph. in this particular case, there wasn't, like any, timeout within that, within that edge providers, layer, which makes it, somewhat difficult from a client code perspective. To understand, is this an error? is this just like a slow fetch, or is this I part of a thundering herd? In which

Courtney: 30:59

Hey, there's

Simon Newton: 31:00

would wanna do something differently, add

Courtney: 31:03

right.

Simon Newton: 31:04

Yeah.

Courtney: 31:05

Yeah, it definitely seemed like one of the themes from this incident as well was, and I, I mean I see this a lot and I write about it a lot, was how automation actually made it worse. Right. and not because you're like, we want the automation to make things worse, but because it, it can't always. React to these kinds of situations and you know, like it's model of the world is not as flexible as ours is. and so, you know, I, I definitely saw like action items out of that and everything in the report, but was there, what, was there any reaction to that in terms of like where you had these. Automation surprises of, of, and not just like, what was your response to the sort of automation, making it harder and, and how are you all adapting to that after the fact for the future?

Simon Newton: 31:58

Yeah, so maybe a good example is, again, I didn't, mention this, but Part of what happened on that call, was there was a person responsible for freezing the auto scaling. because what can happen here is, your traffic go goes to zero. your auto, especially if that's, like over a, sort of like long amount of time. the auto scalers are like, oh, great. Like we can like down downscale all of the groups, and

Courtney: 32:26

Yeah.

Simon Newton: 32:27

course they can't

Courtney: 32:28

And like, oh no again. Yeah.

Simon Newton: 32:30

So we've got processes in place where, you know, for those particular types of incidents, we have tooling that we can run, again, developed, by those reliability folks where we can say hey, freeze all the auto scalers. and what we can do also is, go back and say oh, actually, set them, reset them to where they were like prior to the incident occurring, right? So if, any have down downscaled, we can get them back. and so that's a, process that we've adapted. given past experience and thinking through what's, like bad outcomes that could happen in these sort of event events that, turns the automation off for a second and humans, take control. Yeah.

Courtney: 33:11

Yeah, I mean that's definitely a theme of, of obviously my work, but like in, instead of just saying, well, we'll add more automation to make the automation. Better. Y'all were like, well, let's give the humans better controls over what that might look like. I like, I like seeing those kinds of, of adaptations, so

Simon Newton: 33:30

like I'm, I'm a very big fan of modes within automated software. Where it just throws up its hands and it says, this, does not match my, view, like my, what I've been modeled for and

Courtney: 33:44

yeah.

Simon Newton: 33:45

right? when I've been building systems in the past, I've built in these, lockdown modes, right? Where it's hey, like the inputs don't match anymore. I'm gonna go into lockdown mode. I'm gonna flag it in my telemetry. human come and get me out of this, right? Because rather than just like continuing down a path that sort of, no one has prepared, the software for, and, possibly making things much, much worse.

Courtney: 34:09

And what do you think was the biggest learning for the team out of this incident?

Simon Newton: 34:15

I think definitely practicing, those controls, on the Edge Providers network. when we were putting that block in place. we, added it with a very simple status message. I can't actually remember what it said, but it was definitely not a user friendly status message. because at the,

Courtney: 34:35

All bad. Good luck. yeah.

Simon Newton: 34:37

at the, wasn't quite that bad. but because at the time, at the

Courtney: 34:43

I could come up with a lot of bad error messages for you if you'd like, but probably not.

Simon Newton: 34:47

Yeah. so, we, because at the time the priority was like, get the load down rather than, debate the error message or, like how it was, gonna be visually presented, et cetera.

Courtney: 35:01

Yeah.

Simon Newton: 35:02

we've learned from that. We now have canned response ready to go, in a much more. user friendly and visually appealing style. and so that, yeah, that's, like one, one thing that's come outta that. also our, ability to practice using those controls and getting more familiar with them. cause you, really don't want to be using something for the first time, in a high pressure environment, right? Like you want to have this, know, built up almost in your muscle memory. one of the other, like general principles that I try and describe to when, when building these systems is that any sort of emergency or failure mode, is exercised day to day. And that way, that your

Courtney: 35:46

Yep.

Simon Newton: 35:46

for it. You've been using it, day to day or once a week. And it's not a, oh, I've gotta do this first, first. this thing. First time in a call. there's like a documentation page that is like four pages long, right? And I've

Courtney: 36:02

Yeah,

Simon Newton: 36:02

of steps perfectly,

Courtney: 36:04

oven's on fire and I've never used the like fire extinguisher thing before and it can't get the Yeah. Basically. Yep. Yep. And do you all do. Yeah. Do you do drills like tabletop drills, chaos engineering type stuff, or what does that preparedness look like? Is how is that structured and definitely happens?

Simon Newton: 36:24

I'd say there's a, variety of drills that occur, right? like teams themselves will do, will do wheel of misfortune style or like role playing incidents. but then there's also like larger drills that we do as a company, where we say oh, okay, yes, we're going to, we're gonna do like a sort of a, business continuity drill. will get involved and, do those exercises.

Courtney: 36:49

Thank you so much for joining me and sharing your internal process with the world. And while I hope you don't have, more sev zero incidents, I do hope that you continue to share your incident reports with us. And I, I really appreciate all of the time and effort and cat herding that goes into that. but as you said, it really is so beneficial to everyone else in the industry, so thank you for doing it.

Simon Newton: 37:13

Yeah, no worries. Yeah, I just, I hope people can, learn from it and we can get better, everyone together.

Courtney: 37:21

Absolutely. Okay. Thanks so much.

Simon Newton: 37:23

very much.