Episode 5: Incident.io and The First Big Incident Artwork

The VOID

The VOID makes public software-related incident reports available to everyone, raising awareness and increasing understanding of software-based failures in order to make the internet a more resilient and safe place. This podcast is an insider's look at software-related incident reports. Each episode, we pull an incident report from the VOID (https://www.thevoid.community/), and invite the author(s) on to discuss their experience both with the incident itself, and the also the process of analyzing and writing it up for others to lean from.

All Episodes

The VOID

Episode 5: Incident.io and The First Big Incident

February 14, 2023 • Courtney Nash • Episode 5

What happens when you use your own incident management software to manage your own incidents but said incident takes out your own incident management product? Tune in to find out...

We chat with engineer Lawrence Jones about:

How their product is designed, and how that both contributed to, but also helped them quickly resolve, the incident
The role that organizational scaling (hiring lots of folks quickly) can play in making incident response challenging
What happens when reality doesn't line up with your assumptions about how your system(s) works
The importance of taking a step back and making sure the team is taking care of each other when you can get a break in the urgency of an incident

Courtney Nash: 0:30

Today I am joined by Lawrence Jones, an engineer from Incident.io. We are here to talk about an incident report that you yourself authored from the end of November last year. Walk me through the top level summary of what you feel like happened and we can talk through some more of the interesting things that you all experienced.

Lawrence Jones (Incident.io): 0:51

Yeah, sure.

Courtney Nash: 0:52

And I'll put the link for people listening to that incident report in the, in the show notes.

Lawrence Jones (Incident.io): 0:56

The incident that we wrote up, we titled it"Intermittent Downtime From Repeated Crashes," which I think kind of encapsulates what happened over the course of this incident. So it happened back in November, and it I think took place over about 32 minutes, or at least that was the period with which we experienced some downtime. Though it wasn't the whole 32 minutes, it was, periods within that 32 minutes that we were seeing, the app kind of kind of crash a bit. And I think it's quite useful to explain the context of the app because otherwise you don't really understand, what"incident" means

Courtney Nash: 1:24

What is it y'all do?

Lawrence Jones (Incident.io): 1:25

Yeah, exactly. So, I, I work at a company called Incident.io. We offer a tool to help other companies deal with incidents. So, yeah, in a, in a kind of ironic sense, this is an incident company having an incident. So

Courtney Nash: 1:35

Everyone has them, right? It's no one is immune.

Lawrence Jones (Incident.io): 1:40

Exactly. so I think the, what happened here was, our app, which serves the dashboard, which is where you would go to kind of, view some information about the incident, and also the API that receives web hooks from Slack— because a lot of our incident response tool is Slack based— that was the thing that ended up failing. Obviously this means that our customers, if they were at this particular moment to try and open an incident or do things with incidents, they would be unable to, or they'd see some errors, in the period with which we were down. Hopefully they would retry maybe a minute later when it came back up and they get through. Obviously it's not, it's not ideal, which is exactly why we did kind of a public write up and, and went through the whole process of how can we make sure that we learn from this incident and get the most out of the experience so that we can do better in future and hopefully eliminate some of the sources that cause this thing. And maybe if it happens again or something similar happens again, we'll we'll know and be able to respond a lot faster.

Courtney Nash: 2:29

There are always these little interesting details that people might choose or not choose to include. So there's the early on in the first, you know, the initial crash description, there's this little detail that makes me think that something about this incident might have been a little stressful... You wrote"20 minutes before our end of week team debrief," which to me just makes me feel. It's Friday, everybody's like trying to wrap up. And your on call engineer was paged. It was like app crashed out of Heroku.

Lawrence Jones (Incident.io): 2:58

We actually use our product to respond to incidents. So we dog food the whole thing. Which honestly, when I first joined I thought"that can't possibly work." But it works quite, it works quite well actually. We get a sentry in, and we, we page on any kind of new error that we've seen just because any of those errors could mean a customer hasn't been able create an incident and we take that really seriously. Our pager, it has a fair amount of stuff coming through. Not, not crazy with, we're very happy with volume, but a page is not something, at least in working hours that we would be too, too worried about—it's kind of business as usual. But obviously when you get the page through and you realize that you haven't got an incident in your own system because we failed to create it, that that suddenly means, okay, cool, you're on, you're on a bit more of a serious incident here because it's not just a degraded system. You're seeing something that's actually impacting the way that we are creating incidents on our end, which is not good. Like you said, it was, it was on a Friday. So we're a fairly small team, about 40, 45 people at the moment and about 15 engineers in that. And every Friday at about 4:00 PM we host team time where we go and discuss, you know, how's the week been, et cetera. Obviously this happens 15 minutes beforehand when everyone is winding down or, you know, maybe someone kind of rushing to try and get a demo ready for a demo time, something

Courtney Nash: 4:06

No. No.

Lawrence Jones (Incident.io): 4:07

Geared for, geared for the for the team time that we've got coming up. And then suddenly you realize that the app is like, pretty badly down. Well firstly, our oncall engineer start starts going through this and starts responding. And we do, eventually the system comes back up. So we do get an incident in on our side, which means we've now got a Slack channel and we start coordinating it in the normal way. But I don't think we realize until a few minutes in that actually this is, this is something that will be bringing us down repeatedly because when we fix it the first time round it, it went down again a couple of minutes after. And then it gets a bit more serious, you know, and you start, you start getting a couple more senior engineers join in and you start building up a team around it, and then it kind of snowballs and you suddenly realize team time isn't gonna happen in the normal way. So, yeah, that was how we started, which is never, never exactly fun, but, yeah, it happens, isn't it?

Courtney Nash: 4:53

You mentioned from previous incidents you'd already learned something.

Lawrence Jones (Incident.io): 4:56

Yeah, so we, as I said, we're, we're a small team, but we've been growing quite a lot. So, I think over the last 12 months, I think we've gone from maybe four engineers to the 15 that we have now. So we've got a core group of engineers here who have dealt with all sorts of different things happening over that year. And one of the things that we do have that is a benefit to us is that we run our app in a very simple way. At the point of this incident at least, we just had, it's a Go monolith monolithic app, that we run in Heroku using the Docker container runtime. So we go and build the image, we go ship that image as a web-like process. This serves all of our incoming API requests. IT also processes a lot of async work that's coming in from pubsub. And we have a cron component, which is just doing regular, running regular jobs in the background. But that's basically all there is to it. It's, it's super simple. Obviously with simplicity you get some trade offs. Some of the trade offs for us are, if something goes wrong and the process was to crash, then you are gonna bring down the, the whole app. Now we run several replicas, which means it's not necessarily the whole app that goes down, but if there is something that is a common crash cause to all of the different processes that are running at the same time, then you're gonna find that this thing will, will turn off. Now coming from running systems in Kubernetes and other similar environments before, you might be used to, I, I certainly was, might be used to a kind of like aggressive restart strategy. So, you know that processes are gonna die. So immediately something will bring it back up. In Heroku anyway, you, you won't necessarily get this. So Heroku we'll try and bring you up quite quickly afterwards, but if something happens again, then you don't enter kind of like this exponential back off that's kind of like trying to bring you back up. It, it goes quite, quite harsh. So I think it, it might be placed in what Heroku call like a cool off period where for up to 20 minutes it just won't do a single thing. Which of course for an app like us is just nonsense. Like it can't work for us at all 20 minutes of downtime because Heroku is just waiting around and kind of put, put your app on the naughty step is, is not, is not what we need, like could to be telling us to cool off when, when everything goes down is is not gonna work. So I think we've learned before that if the app is continually crashing, if we're seeing something that is bringing the app down, then if we jump into the Heroku console and press restart, then we can kind of jumpstart that call, that cool off period, which will bring us back up. Because one of the benefits of it being a Go app is it, its very, very quickly, in fact, so quickly, we occasionally run into issues on Heroku, where I think they're built for more like Ruby apps, like racing against poor allocation and stuff like that. But it does mean that it comes up really, really quick. So it, it is quite an immediate relief to the incident, I guess, which is why we pressed it initially. We're back up. Cool. Let's start looking into this, and then it's only two or three minutes later that we hit the same issue again. Presumably that brings the app down and we have to go back and, and start hitting this manual, restart button more often than we would like.

Courtney Nash: 7:39

So it's interesting because, you know, one of the things that I talk a lot about with the VOID is, is how interventions from people based on their intimate knowledge of the system, you know, are typically what we rely on, in these kinds of situations. Was this the poison pill situation with the sort of subsequent crashes? How did you get there?

Lawrence Jones (Incident.io): 8:01

So I'll say upfront, by the way, when we came around to doing this review of the incident, our internal debrief has kind of like a lot more detail that's relevant to how we're running our teams and how we want to help our teams learn as well. And the preview of that is that there was a lot of stuff that went on inside of this incident that people took actions on manually that wasn't common knowledge amongst the team. So one of the things that we have definitely resolved to do out the back of this is try and schedule some game days, which we're gonna do, I think next month, where we'll simulate some of these issues and we'll do a bit more of an official kind of training and making sure that everyone's on the same page when it comes to what to do in these situations. So whilst I might have mentioned before that Heroku is not doing a normal exponential back off kind of pattern, thankfully we do have something in our system that is, and that's our pubsub handler. So usually when this is happening, if you can see a crash in the process, you are going to go look for the thing that caused the crash. So we were kind of sat there in logs looking, well, where is our sentry? Where is our exception? Where, where's it come from? We were really struggling to get it. And I think one of the problems here is that if the app immediately crashes, then you've got sentries and exceptions that are buffered as well as traces and logs that don't necessarily find their way into your observability system. So what we were having was the app would crash and then we would lose all the information that it had at that particular point, which leaves you in a state where you don't quite know what caused the crash. You just know what led up to the crash and if it's happening very quickly after you started processing the information, that's not ideal. The good thing was that we could kind of tell from the pattern of the crashes, that it was something that was going into some type of, kind of exponential back off. And I think having run the system for a while now, one of the patterns that you realize is, our async work that exists in Google pubsub, when you fail to ack a pubsub message, you're going to a standard exponential back off pattern. So the fact that we are seeing subsequent crashes come maybe two minutes and then three minutes, four minutes, that sort of thing, from each one of the retries, that gives you a, a pretty strong gut check that this is probably coming from some pubsub messages, which is why we went from subsequent crashes into cool, we might not be able to tell exactly what has crashed because you know, the Heroku logs are being buffered, so we don't even get the go routine traces coming out of this app when it's crashing. We obviously don't have any of our exceptions. We don't have any of our normal monitoring, but we know this thing is happening to a regular rhythm. It's probably pubsub and then you start looking for where have we got kind of an errant pubsub message across all of our subscriptions. So you start looking through them and going, cool the subscriptions that I know I'm okay to clear, I'm gonna start clearing. And we were just basically going through those and trying to clear them out so that we could get rid of the bad message, which presumably was the thing that was being retried each time and bringing it down. And that's how we had a couple of our engineers allocated going through subscriptions whilst the others were looking through recent code changes or trying to find from what logs we did have from Heroku, whether or not they could see a pattern in the Go routine traces.

Courtney Nash: 10:46

While this is going on, you noted that, you know, you turned off a number of non-critical parts of the app, right, that are related to sort of async work stuff. I think it's like,"it's been a while since you last up, you know, do you wanna update your incident?" was that just part of the sort of trying to figure out what was going on or

Lawrence Jones (Incident.io): 11:02

So I think I've, I've mentioned how in this particular situation, we didn't have access to a lot of the tools that we would normally have access to in an incident. So observability wise, it was, it was, not particularly great. One of the things that you can do whenever you're in that situation where there's essentially something bad happening, there's a lot of potential things that could cause it, is you just try and simplify. And I think that was the decision that we made. We know that we have some regular work that gets scheduled via what we call the clock, which is just a thing that's piping pubsub messages in at a regular basis. That's providing functionality that actually, in terms of how our customers will experience our product, it's not critical. They're like nudges to remind you to take an incident role or, or do something at a particular time. We could disable those and you'd still be able to create an incident, which is, and, and drive one proactively through Slack web hooks. So, that's obviously the thing that you'd prioritize if you had to pick some subsystems to keep up rather than others. And if you can get rid of that, then you suddenly remove a whole host of regular work that all might be providing this event or this, this job that is causing the thing to crash. It was more about we don't know what's going on. We have limited options, so let's try and simplify the problem in a chance that it might fix it. But also we know if we remove that we'll have less noise. So there will be just simply less hay through which to find the needle.

Courtney Nash: 12:19

Yeah. And so you, I mean, you found the needle. Was it Aha, we think we know it's this, or was it just, we turned a bunch of stuff off and finally it just stopped breaking.

Lawrence Jones (Incident.io): 12:30

Yeah, so I think, it was quite a funny one. I think we, we had a suspicion that it might be coming from something. And that was because we could see that there was an event that had just at our particular point where we'd started seeing the crashes, we could see in the Google pubsub metrics that there was an event that started failing or was, was left un ack'd in the subscription. So that was giving us kind of, you know,"this looks bad." it was also part of a piece of the code base that we had seen an error come from. I think the day prior, we, we were already feeling a bit suspect about this. We didn't quite know how it would've happened, because we were under the assumption at this point that, any errors that were returned by these subscription handlers, regardless of whether or not they panicked, that the app should proceed and continue, to, as it happened, that that was an incorrect assumption. At this point, you already don't have a consistent viewpoint of exactly what's causing the incident. So you start ditching some of the rules that you think you have in mind because it is logically inconsistent with what you see, that is actually happening in production.

Courtney Nash: 13:31

How hard is it to get to that point where you have to abandon an assumption, like a, a well earned assumption. Like that's, that's a tip, that's a tough thing to do, I think, especially in a somewhat pressured situation, right?

Lawrence Jones (Incident.io): 13:45

Now, when I say that that assumption was logically inconsistent with what we're seeing in production at this point, you have to have a fair amount of confidence in how you think that the system should work to come to the conclusion that it is logically inconsistent. Otherwise, you might think that you're missing something or, you're just interpreting the signals in the wrong way. Now I think the thing that we had definitely concluded, given what we could see from whatever Heroku logs we did manage to get out of the buffer, was that this app was crashing and we were, we were seeing the whole thing halt. That usually only happens if there's a panic that hasn't been caught. So even though you think that you have the panic handlers all over the app, you start considering other ways that it might go wrong. And we are aware that there are, for example, third party libraries inside of our app. Even down to the Prometheus host, so you've got a little endpoint that's serving Prometheus metrics. There could be a bug in there that's causing a panic, and if it's not handled correctly, that could be it for the app. So there are definitely causes that could lead to this. You start breaking out of your well doing work inside of our normal kind of constraints should be safe. And you go, well, something has failed somewhere, some safeguard. So we have to walk that one back, at least for now.

Courtney Nash: 14:50

So there are a couple of things that you talk about in here in terms of the regroup and I'm curious because you did mention something, I also appreciate that you also made sure people like stop to eat and drink. Something like take a little bit of a

Lawrence Jones (Incident.io): 15:04

Yeah, that's a big one for me. I've done that wrong too many times before.

Courtney Nash: 15:08

Yeah. Yeah. These are these important reminders that we're, we're humans. You mentioned something further back that I made a mental note I wanted to come back to in terms of the growth of the team. And some people taking certain actions during the incident, can you talk to me a little bit about what that looked like?

Lawrence Jones (Incident.io): 15:24

So Pete, out CTO, he came back and he'd been, he'd been watching the incident from afar. In fact, at one point, there was a very funny message being like, it'd be really nice if you could provide incident updates for this incident. Using our tool, of course, to which we have to go.

Courtney Nash: 15:36

Oh no

Lawrence Jones (Incident.io): 15:37

Like, yes. It's a, it's a real, like it's a real horrible message to look back on. You're like, oh, yes, no, that would be why. Going back to your point about the team growth and where we were at, at this particular point, our app has actually been very robust. We have quite a lot of experience building apps like this and we prefer simple technology over complex ones, just literally because we want this thing to be rock solid and we've dealt with the pain of those things before. So we've got a couple of burnt fingers that mean that the app has been kind of blessedly stable for a while. That meant that this was our first actual"ooooh, how are we gonna fix this?" moment where you actually hit a production incident and you go, well, I don't quite know what the cause of this would be, and if I don't fix it, then you know, we're, we're gonna be down for a while and that's gonna be really painful. And I think that was the, honestly, that was a bit of a shock. I've been on a lot of incidents before, but it's been now maybe about a year and a half that I've spent at Incident.io. So it means I'm a year and a half away from having dealt with production incidents like that before. And I think that that was the same even for our more senior responders, within the team. So, whilst we had done really quite well at dealing with the problem itself, one of the first things that you forget whenever you are doing something like this and you are kind of involved in the nitty gritty of fixing it, is doing all the bits around an incident. So getting someone to do proactive communications. We kind of had that halfway there, but someone had taken the lead role and then they weren't necessarily doing a lot of the things that we would expect the lead to do because, quite frankly, they were just super involved in the technical discussions. So I think Pete coming in did what you should do in that situation. And he made a call that was like, look, it looks like we're out of the woods, or at least some urgency is diminished now. So we've found some things out. I'm not clear on exactly what's happening, and I think we've got a lot of the team over here who necessarily weren't the most senior of the people in the team who are actively responding, but they definitely want to help and they can if we need them. So let's, let's take a moment. Take a breath. And we all just stood around, did an incident stand up. Had a couple of people catch up on comms. We reallocated the incident lead role, and got everyone to relax their shoulders, and get their posture back. And then just decided to split up the work that we needed to do. And moved on to trying to do it afterwards. So I think the regroup was exceptionally useful, and came at exactly the right time in the incident. And it's just something that, for me anyway, it was a gut check on, oh wow, it's been a while since I've done this. And it's really easy to forget if you're just in the weeds doing the actual fix.

Courtney Nash: 17:57

It doesn't matter how many times you've done this at however many different places, each local context has its own, yeah. Sort of, learning curve

Lawrence Jones (Incident.io): 18:05

I mean it should go without saying, but it's often forgotten, every time you change up the people who are responding, it doesn't matter if they have individually got that context from other places. You don't know until you, you really practice and put the time in, how this is gonna work. It's why we're scheduling game days and doing some drills and practicing because, obviously going forward this is gonna be something that's super important to us and we want to do it well. Quite, also a bit selfishly for our product. We think our product can help. So it would be a bit embarrassing if we weren't doing all this best practice stuff ourself.

Courtney Nash: 18:34

How much of the things that you did would you say sort of technical and how many would you say were like organizational, to, you know, from your mind and re and, and after the fact?

Lawrence Jones (Incident.io): 18:45

There was two angles to this, this incident debrief that we had. So we, we ran an internal incident debrief where we got several people together in a room to chat through this incident. At the time we were walking through a writeup that we'd produced that contain both technical detail and observations about how we'd worked as a company. So, I think that that debrief focused probably only one quarter on the technical stuff. So the public postmortem that we've released, that's primarily about the technical elements, quite frankly, because that's, that's interesting to a public audience and I think they explain the nature of the incident and kind of give color to what we've done to try and fix it. The internal stuff was a lot about the learnings that we got out of it as a company. So, as I said, this was one of the bigger incidents that we've had in the past. Hopefully it will be the bigger one that we have for at least a couple of months. But one thing that it does test is, you know, are we communicating correctly with customer support individuals? We have a lot of people who are even selling the product to people, and when they're doing that, they're doing demos. There's probably something organizationally there that's, that's, you know, if we've got a major incident going on, you, you don't want to be scheduling a demo call over it. That's not going to look good for us or make sense for the person on the other end. So probably proactively bail out of those and reschedule them rather than try and make a judgment call on if it will continue. So there's a lot of trying to figure out who or what are the levers that we will pull organizationally next time something like this happens to make sure that customer success is involved properly. And make sure that the communications that we need to be sending out are cohesive, they make sense, across both what we're putting on our status page and actually what's going via our customer success team. And yeah, make sure that everyone in the company is kept up to date in the right way. This is exactly what you would expect when you are running an incident for the first time that the size of company that we are at, especially given the, the last major one will have been honestly, half the people who are here now just weren't there before. So it's something that you really need to drill this and try and get everyone on the same page before these incidents happen if you want to make sure that it goes well.

Courtney Nash: 20:47

Yeah. I mean, the irony right, is you get better at them if you have them. Ideally you don't wanna have them, and so if you're fortunate enough to not be having a lot of them, then you don't practice a lot unless you intentionally practice. Was there, was there anything really surprising to you about the whole thing?

Lawrence Jones (Incident.io): 21:04

I don't, I don't think so. So there's there's a lot. Yeah. No, no, no. So I think, it, it's very interesting from the position of a startup as well, just because you think quite carefully about what is the right type of investment to make at different points. And especially coming from someone who's scaled an app like this before at a much larger company and gone through all those kind of honestly just hitting your head on the door as you go through each different, each different gate. I'm kind of waiting for different types of problem to appear as they come. And this was, this was one of them that we had called out as a risk, which was why we had something quite well prepared for it and why we were able to actually, within kind of an hour and a half of the incident closing, we were able to make a change for our app that split the way that the workloads are run in Heroku, which makes us entirely insensitive to a, a problem like this happening again, or at least our customers wouldn't necessarily recognize it if our workers were to crash like this. And that's kind of not what you would expect from an incident like this to have a fix in place two hours after. So we've given some thought to how this stuff might crop up, but you never really know when the thing might appear. We were quite happy with the decisions we made leading up to this, and learned a lot about how to fine tune that barometer on when we're looking at investment technically coming forward over exactly what we want to do to try and make sure that we're kind of keeping ahead of the curve. So ideally we won't need one of these to kind of keep us, on exactly the right balance there. But I think given everything in, in retrospect, I think we are probably making quite decisions, prioritizing that along the way.

Courtney Nash: 22:30

The other thing that I, you know, I was fascinated about by this is you use your own product because it's an incident response tool. My favorite case of these kind of incident things was when AWS went down and then the AWS status page like relied on AWS so they couldn't update the status page. Like they fixed that finally, I think. But what's your strategy going forward for using your own product...

Lawrence Jones (Incident.io): 22:52

yeah, I mean,

Courtney Nash: 22:52

When your own product goes down?

Lawrence Jones (Incident.io): 22:54

I think honestly we will definitely continue to do this and the, the benefit of us using it is, quite frankly, we, I mean there's one reason that we're able to sell this thing. That's because we really do believe in it. It makes you so much better at responding to incidents, and what I've learned from how this app is the greatest over the last year and a half is that when it does go kind of wonky, it's either in a very local, area or eventually it does kind of catch up on itself. And, honestly, like a lot of the value that you get is the creation of the incident channel, the coordination of everything, and the app only needs to be up for, you know, a brief moment to try and do all that stuff. So immediately you get the incident channel, you've escalated, you've invited everyone, and that that in itself is just the first people call it like the, the first few minutes of an incident, right? And it, it manages to shorten those and get you in the right place. And from that moment on, you were able to coordinate. It would be a similar situation too if Slack goes down. That's obviously an issue for us because people are using Slack to drive our product and we use Slack internally to communicate. Now we have plans around how we would communicate if that was to go down, but then also there's limited ways that we can help. At that point, we're we're waiting for Slack to bring things back up and. Quite frankly, our customers who are also at that time having incidents are impaired themselves. Very few people have the backup plan or the policy to go somewhere else when Slack goes down. And if, and if they are, then they're usually going for. I mean, not as ridiculous as this, but like collaborative Microsoft Paint in, in a

Courtney Nash: 24:14

Yeah.

Lawrence Jones (Incident.io): 24:15

it's not a tool, it's, it's not a tool that's going to be particularly familiar to them or, or they're trying to communicate via like a G Doc link. And, at that point everyone's kind of in degraded mode and, and you're waiting to bring these back up. But yeah, we have backup plans, but at the moment anyway, this is working really quite well for us. And I'd rather solve the problem if it's at all possible, by making our app incredibly robust, and investing in it like that.

Courtney Nash: 24:36

So, you collect a lot of data on your customer's incidents and you look at a lot of incidents from that perspective. I've seen a few things you've written about that, that I thought were really interesting. Is there anything from looking at your customer's incident data and incident response that's informed how you all think about incident response in general?

Lawrence Jones (Incident.io): 24:59

We work really closely with a lot of our customers, so especially our larger ones. It's genuinely fascinating to find out how different organizations are structuring their incident response. So we, we often learn from our larger customers about the processes that they've created, especially out the back of incidents. So things like reviewing incidents and kind of everything that happens after you've closed it. So incident debriefs and retrospectives and writing postmortem docs and processes around, following up on incident actions and things like that. We often look to them to give us, a view, as a very large org, how are you managing this and how can your tooling help you? There's also just a ton of stuff that we're doing around insights and helping those organizations understand their own behaviors when it comes to incidents. So I think if you look collectively, for example, over all of our incident metrics, I think it's really fascinating that you can kind of see a hole in all of our graphs around Christmas, and it's just... It's, it's like stuff like that is really interesting from someone who's interested in incidents because you go, well, actually that's the, that's the combined impact of people putting in change freezes or people going on holiday and just simply changing things less. There's also generally all the stuff that happens around the holidays that that kind of leads to this sort of thing. But you can, you can help a larger org, for example, identify seasonality in their trends or, I hope in future we can take some of this data and with permission from obviously our customers, understand it a bit more in terms of how the industry is responding to incidents and try and help people, for example, implement the stuff that we know empirically is working for some of our other customers. Yeah, there's a huge amount of potential in incident data. I'm super excited about it, having spent a lot of time working on our, our insights product. I think it's a very untapped area, at least when you have this type of data on how people are running their incidents, you can do lots of really fascinating things with it.

Courtney Nash: 26:43

You have this richness of the people involved and sort of what those communication webs look like and kind of the patterns and ebbs and flows of those things, which I think tells you so much more about how an organization's handling their incidents than, than the numbers we've had so far, as an industry.

Lawrence Jones (Incident.io): 26:59

There's many things that we're looking at at the moment, like around operational readiness and helping you get a pulse on that, for example. But one thing that really lands with a lot of our customers at the moment as a result of us having this depth of data, is, being able to automatically kind of assign a number of hours that you've invested to each incident. Now if you've got organizations who are running incidents for small, all the way up to very large incidents. And you can segment that data by the users, so the people who are actually responding to the incident, and also by causes of the incidents of services that are affected and things like that. You can do some really, really cool things. We, for example, are able to, where we've had our product roadmap and we've had certain projects come out, draw lines between that and a spike in the amount of workload that we've had to put into incidents, especially related to, for example, Jira. If we've just built a Jira project and you can come up with a cost of the back end of this project, how much are you going to cause in terms of operational workload at the back of this so that you can kind of predict and understand a bit more about how your organization is working. And that's stuff that I had never been able to do before. I mean, I think we've all built those spreadsheets at companies

Courtney Nash: 28:03

Yeah, but you don't believe them like, they you don't believe them and, and hand waving...

Lawrence Jones (Incident.io): 28:08

Yeah. And it's not your job to spend full-time all your hours trying to figure this out. Whereas we've sat there and we've got some way of trying to detect when you're active in an incident and done all the stuff, like slicing it and making sure that we don't double book you across the hour in multiple incidents. It comes together into something that is it's a lot more of a rich picture on how you are responding to incidents across your organization, and that's where I'm really excited to try and do this. Cause I think for people in leadership positions at these companies, they're so far away from the day to day, but they're so very much interested in it and it's kind of this, this horrible, contrast or where the further away you get, the more interested you are in it, the less data you get or, or the less you can trust the data that you're getting out of this stuff. And I think it helps in both of those situations, right? A tech lead who wants to articulate that their operational workload is increasing, can be helped humongously if you can provide them with some of that data to. Back up their claim and then the executive that they're speaking to asking for more investment can see the ROI that they get from putting in the investment and that sort of stuff. I think we should be trying to get out of our incidents and if we can then, yeah, I'd be very happy with it.

Courtney Nash: 29:15

Oh, I was just on a podcast talking about this. Like how data driven our industry is to our advantage. And I think sometimes to our detriment. Right. Of

Lawrence Jones (Incident.io): 29:24

course

Courtney Nash: 29:24

There's plenty of things that we should do that we just can't get the data for. It's very costly or time or whatever. Or you feel like you're kind of pulling them out of thin air or your, your own intuition is probably right, but they're like, well, you just made these numbers up, you know? And, so for better or for worse, when you can get your hands on those kinds of data and you're trying to make a case further up, you know, leadership further away, the blunt end versus the pointy end. That's a huge advantage to engineering teams who know after these kinds of incidents, the kinds of investments they want to make.

Lawrence Jones (Incident.io): 29:56

Yeah, so I think especially around training and making sure that people are onboarded correctly, we've got a ton of plans going forward into this year. One of them is being able to model this idea of what someone needs to have done to be recently fresh for an incident. If you look at our situation here, one of the issues that we had was really we want a corpus of people who are able to, or have recently responded to a major incident or something of that kind. Now you can express that either as they've dealt with a real major production outage or they've ran a game day recently that helps'em simulate it. Either one of them is fine. And we are planning on helping you track that within the product and express it as well. I want to see how many people within the last 60 days have done these type of criteria. And then when you do that, hopefully over time you can see as you hire more people, But you might have lost some of your more tenured staff so that number of people who are ready and able to respond might have decreased. And it's the type of stuff that if I, I always think about interviews if you've ever been upskilled, but for interviews at a company, you often have that spreadsheet for the onboarding and it's kind of tracked very manually. I want something very similar for incidents, but something a lot more automated that can help us keep a pulse on how many people are actually ready and onboarded, ready to respond to this stuff. It's super important that you know, that the one person who has recently responded to Postgres incidents has left the company. That shouldn't be a thing that you should have to guess. So.

Courtney Nash: 31:15

Or well or find out the hard way.

Lawrence Jones (Incident.io): 31:18

Or find out the hard way. Yeah. You, you do one or the other, I guess. Exactly.

Courtney Nash: 31:23

Is there anything else you want to wrap up with or tell folks about, about Incident.io?

Lawrence Jones (Incident.io): 31:28

If you just head over to Incident.io, then you can book a demo and we'll do our best to make sure that our services are running happily when you, when you try and take the demo... yeah

Courtney Nash: 31:37

Perfect!