The VOID

The VOID makes public software-related incident reports available to everyone, raising awareness and increasing understanding of software-based failures in order to make the internet a more resilient and safe place. This podcast is an insider's look at software-related incident reports. Each episode, we pull an incident report from the VOID (https://www.thevoid.community/), and invite the author(s) on to discuss their experience both with the incident itself, and the also the process of analyzing and writing it up for others to lean from.

All Episodes

The VOID

Episode 1: Honeycomb and the Kafka Migration

November 01, 2021 • Courtney Nash

"We no longer felt confident about what the exact operational boundaries of our cluster were supposed to be."

In early 2021, observability company Honeycomb dealt with a series of outages related to their Kafka architectural migration, culminating in a 12-hour incident, which is an extremely long outage for the company. In this episode, we chat with two engineers involved in these incidents, Liz Fong-Jones and Fred Hebert, about the backstory that is summarized in this meta-analysis they published in May.

We cover a wide range of topics beyond the specific technical details of the incident (which we also discuss), including:

Complex socio-technical systems and the kinds of failures that can happen in them (they're always surprises)
Transparency and the benefits of companies sharing these outage reports
Safety margins, performance envelopes, and the role of expertise in developing a sense for them
Honeycomb's incident response philosophy and process
The cognitive costs of responding to incidents
What we can (and can't) learn from incident reports

Resources mentioned in the episode:

Kafka Migration and Lessons Learned by Honeycomb
Managing the Hidden Costs of Coordination by Laura McGuire
Above the Line, Below the Line by Richard Cook
"Those found responsible have been sacked": Some observations on the usefulness of error by Richard Cook and Christopher P. Nemeth

Published in partnership with Indeed.

Courtney Nash: 0:31

I'm your host, Courtney Nash. And welcome to the inaugural episode of the void podcast. Today I'm joined by Liz Fong-Jones and Fred Hebert of Honeycomb. We are going to be talking about a Kafka related multi incident report that they recently published. And I'll start by asking Liz what motivated you all to write this meta report in the first place.

Liz Fong-Jones: 0:56

We are a company that is very transparent. And we try to be candid, both about our engineering successes and our engineering failures. And it's part of our commitment to really foster this ecosystem in which people openly discuss incidents and what they learned and how they debug them, because that is our bread and butter. Our bread and butter is helping you debugging incidents better. And sometimes that's what honey comes to tool itself. And other times it's with the lessons that they learn from our own explosions.

Fred Hebert: 1:25

From my perspective, there's a different alignment with that and this sort of idea that there's a trust relationship between us and our customers and users. We have the huge pleasure of having a technical audience and sets of customers. If we see ourselves as going down or being less reliable, it's a way to retain trust or to be very transparent about the process and hopefully make it something where the customers themselves appreciate being able to understand what went on. And to me, the ideal thing would be that they're almost happy when there's an incident at some point, because they know there's going to be something interesting at the end of it.

Liz Fong-Jones: 2:03

Honeycomb at this pointdoesn't tend to go down because of the easy things, Honeycomb tends to go down because of the hard things. And therefore they're always noteworthy to our customers. And also it's an interesting sign in terms of helping people who are choosing Whether to build or buy an observability big data solution, whether they should really be building it themselves. And certainly seeing the ways in which our system has failed is a lesson to other people in,"you probably don't want to do this at home."

Courtney Nash: 2:29

Can you say a little bit more about why it is that you're failure modes or incidents are not the garden variety? Because people might think,"oh, Honeycomb is so sophisticated, they shouldn't have really crazy outages or really big problems."

Liz Fong-Jones: 2:46

The simple answer is that we have processes in place. We have automation in place that ensures that the typical ways that a system might fail at first glance, pushing a bad deploy having a customer, overwhelm us with the kind of traffic, those kinds of failure modes we can automatically deal with or deal with in a matter of seconds to minutes. So therefore, any kind of longer lasting performance impact or outage is something that is hard to predict in advance and that normal automation and tooling is not going to do a good job of dealing with because it's kind of the, if it were simple, we would have done it already kind of thing. Whereas we know that a lot of situations that many companies in the software industry find themselves in is that they have a lot of technical debt. They are not necessarily doing the simple things and they need to do the simple things first. Whereas in our case, we've been proactive in paying down the technical debt. And that's kind of why we're in this situation where most of our failures are interesting failures.

Fred Hebert: 3:49

The complexity of a system increases and it always has to do, just to be able to scale up. Usually it's that the easy early failure modes are at some points stamped out rather quickly. And then the only thing that you're left with are the very, very surprising stuff that nobody on the team could predict at the time. And those are last are left to do. Fuzzy surprising, interesting interactions and causing these outages. So it has to do with one, with the experience of the team, the engineering team, and the team in general, and what can surprise them. And for Honeycomb, our engineering team is pretty damn solid. And so we're left with some of these issues. And this incident specifically is interesting because the sort of trigger for it is what could be considered just a typo or miscommunication, the kind of stuff that you usually don't want to see in a big incident like that. But for us, it was important to mention that this is a starting point of this thing to some extent, or the direct trigger of it.

Liz Fong-Jones: 4:49

The other interesting thing about this incident is that we did it in the course of trying to improve the system and make it more reliable, which was particularly ironic.

Fred Hebert: 5:00

So every couple of years, or at a given time, we need to scale up the Kafka cluster that's kind of at the center of the ingestion pipeline that we have for honeycomb. And every time there's also a bit of a re-evaluation that comes through, like, what are the safety margins that we keep there, which could be the retention buffer in Kafka that we have, how many hours, how much space? How do we minimize the cost while keeping it safe and all of that. By the end of last summer, in 2020, Liz and a few other engineers before my time joining the company had started this project of looking into the next generation or the next iteration of the Kafka cluster. And it's starting around December/January to, until the spate of incidents that was like February to March I think. The changes were being put in place to change the version of Kafka that we have to go for something that confluent does that handles tiered storage, where instead of offering everything locally on the device, it sends a bunch of it to S3 and keeps a smaller local buffer. For us, the major scaling point was always the disk size and disc usage on the instances that we have in our cluster to have something like 38 instances. Moving to that would let us do the same amount of throughput with more storage and better safety on something like six instances, that could be cheaper to run. And we would move away from something where the disc is the deciding factor to something where we can scale based on CPU, based on ram, and disk is no longer this factor. So. We were going to update the Kafka software, change the options that were in there, adapt the Confluence stuff, try to improve this tech at the same time, because it's a big, complex, critical thing. So all the small updates you wait to do until you do the big bang deploy, because it's scary to touch it all the time. The spate of outages came from all these rotation and changes that we were slowly rolling out over the course of multiple weeks and regaining control of the cluster. And in some cases it had to do with rotten tools. It had to do with confusing parts of our deployment mechanisms and stuff like that, which were not necessarily in the public reports, because it would not be useful for all our customers to know about that. And then we had the biggest issue, which was a 12 hour outage of, the Kafka cluster is going down and out of capacity because we picked the wrong instance type for production.

Liz Fong-Jones: 7:26

The other interesting bit about this of course, is that we had successfully deployed all of these changes in a pre-production environment over the course of October, November, and December. And none of these issues surfaced in the smaller dog- food cluster. They only really surfaced once we started doing that production migration. so we talk a lot about this idea of"Test in Production, test in Production!" and this doesn't mean that we don't test in Production, but this set of outages really exemplifies that no pre-production environment can be a faithful reproduction of your product environment.

Courtney Nash: 7:58

The very things that you just described there, Fred could be considered technical, but there's a whole bunch of sociotechnical things in there, right? Like decisions and forces and pressures. And, fear of big bang changes and then you batch everything up. Those are very much socio-technical decisions. I don't see this language a lot in incident reports and I would love if you could talk a little bit more about what you mean by"safety margin." Are those codified, are those feelings—what is a safety margin at honeycomb?

Liz Fong-Jones: 8:27

This is why I'm so glad that we brought Fred on board. So Fred joined the company very, very recently, and this is Fred's language, right, Socio-technical systems, safety margins.

Fred Hebert: 8:38

Right, right. So in this case, it's a thing I noticed in one of the incidents. In fact, it's one of the near incidents that we had. It wasn't an incident. And it was the most glaring example of that. I have seen here at a company, which is we had this bit where we were deploying the system. I can't even remember the exact cause of it. But at some point the disk on the Kafka instances started filling up. And so we were reaching like 90%, 95%, 99%. It was not the first near the near outage that we had. Liz was on the call. I had the idea of saying like, run this command. It's going to drop the retention by this amount of time. And this size on this, like 99 point something percent of this usage on the Kafka cluster we managed to essentially cut back on the usage, dropped their retention and avoid like a disaster of the entire cluster going down because all the disc were full and it would've been a nightmare,

Liz Fong-Jones: 9:34

It's trading one margin, one mechanism of safety for another, right? Like we traded off having the disc buffer, but in exchange, we lost the ability to go back in time. Right? Like to replay more than a few hours of data. Whereas previously we had 20 hours of data on disc

Fred Hebert: 9:49

Right for me, that's the concept of safety margin. We have that thing where ideally we had like 24 to 48 hours of buffers, so that if the core time series storage has an issue where it corrupts data, then we have the ability to take a bit of time to fix the bug and replay the data and lose none of the customer information. And so for me, that's a safety margin because that 24 hours. It's the time you have to detect an issue, fix it, roll it out and replay without data loss. And so when we came to this near incident where almost all the data was gone, and we have a huge availability issue that trade-off was done between the disc storage and that margin of buffer, given we're at 99% right now, 95%, the chances are much, much more likely that we're going to go down hard on missing disks than we are corrupting data right now. So you take that extra capacity on the disk in this case, quite literally. And you give it to something else in this case, just like buy us two, three hours so that we can understand what's going on, fix the underlying issue and then go with it. And so rather than having like incident that goes from zero to a hundred extremely quickly, we have three hours to deal with it. This speaks to the expertise of people operating the system and that they understand these kinds of safety margin and measures that are put in place in some areas. Those are essentially anti-optimizations, right? The optimization would have been to say,"we don't need that buffer. We only have one hour and that's about it." And then you move on, but there are inefficiencies in sort of pockets of capacity that we leave throughout the system to let us cope with these surprises. And when something unexpected happens to people who have a good understanding of where all that stuff is located, are able to tweak the controls and change the buttons and turn some demand down to give the capacity for some other purpose. That's one of the reasons why, when we talk about sociotechnical systems the social aspect is so important. Everything that's codified and immutable is tweaked and adjusted by the people working in the system. And I try to be extremely aware of all of that taking place all the time.

Courtney Nash: 11:56

There's one phrase that was in the report that, that caught my attention. And it's, it's what you're describing here. And I just want to ask a little bit more about it. I'm going to read the phrase out. You wrote:"We no longer felt confident about what the exact operational boundaries of our cluster were supposed to be." I'm really curious, who was involved in those conversations? you say"we," was that you, Liz, one other person, three other people? Take me back into those conversations a little bit, and how you reached that point of not feeling confident. What did that look like to get to that point?

Liz Fong-Jones: 12:29

I think going through the cast of characters is really interesting because I think that's how we wound up in this situation. Right? So originally, two years ago we had one Kafka expert on the team and then I started doing some Kafka work and then the Kafka expert left the company. So it was just me for a little while. And then the platform engineering manager made the decision of, okay we think we're going to try this new tiered storage thing from Confluent, let's sign the contract, let's figure out the migration. And we thought we had accomplished all of it. And then we had one engineer from our team Martin, sign up to finish the migration rate, like already running in dog food, like make it right in Prod. And then when we started having a bunch of incidents. That's when it kind of took Fred, Martin, the platform engineering manager and me, all the four of us sit down together and figure out where do we go from there?

Fred Hebert: 13:20

Yeah, and we had extra help from other people just joining heads. Ian Dean, who are on the team, if ever they're listening to that where it's just, at some point, the incidents lasts a long time and people lend a hand and help each other doing that. But there was this transfer of knowledge and sometimes pieces fall on the ground and that's where some of the surprises come from. But the lack of confidence in one aspect, it's just this idea that people are tired about the incidents and you kind of see that easily. The fact that you no longer trust that you know, the pre-prod environments are not reliable. We don't know what's going to happen every time we touch it. There's this feeling that something explodes. And for me, it's, it's not something for which you can have like quantitative metrics, right? It's something you have qualitative metrics for, which is Do you feel nervous around that? Is this something that makes you afraid and I'm getting a feeling for the feelings that people have towards the system is how you kind of figure that one out. Someone who feels extremely confident are not necessarily going to take extra precautions for the sake of them. but everyone was sort of walking on eggshells around that part of the system, and feeling or self imposing pressure about the uptime because people take pride in the work they do.

Courtney Nash: 14:33

The feelings part is really interesting. Right? And, how people build up expertise in these kinds of systems and start to understand those boundaries much more, sometimes very intuitively. Right? in the writeup, you mentioned some near misses and ended up those were good training for the bigger incident, that big 12 hour incident that eventually happened. But most people don't say,"yay! We had an incident but we learned something from it." Can you talk a bit about some of the near misses, and that aspect of it being training for the bigger incident?

Fred Hebert: 15:04

The probably obvious one is the 99% disk usage. One. Which, you know, Liz sort of saved the day on that one by chiming in and coming through. But we had other ones that had to do with a bad deployment where the script to deploy that we thought we had, that didn't work the way exactly we thought it did and caused this sort of cascading failure or near failure. And in this case, I recall people were on a call specifically Martin and Dean, and then a few other onlookers. And I came in and watched but Liz made a point of remaining back on chat and monitoring the update/ status situation, and this was a great move because there's also a sort of semi explicit policy to avoid heroes at the company; one person concentrating all the knowledge And in the first one where we have the 99% near outage, Liz was there to explain to Martin and Dean to sort of commands to be running about that one. And when the other incident happened, Liz stayed back and let them do their thing, while still keep an eye from afar without coming in and saying"I know how to fix this one. I'm going to fix it for you." for me, that was a sort of transfer of knowledge that happened over the course of multiple incidents. And it was only this visible because they were sort of close together.

Courtney Nash: 16:17

Liz, I think that's a really interesting and very obviously conscious choice about your role that you played and allowing the people who run the systems to understand the properties of their systems. Can you talk a little bit about what incident response looks like at honeycomb and how that's structured and how you all tend to handle that just in general?

Liz Fong-Jones: 16:41

Yeah, incident response at Honeycomb is particularly interesting because we expect all engineers to be responsible for the code that they push. That if something that you pushed breaks production, you are the immediate first responder. Now that being said, we also do have people who are on-call at all times. There are typically two to three people on call for honeycomb at any given time. So those are kind of the first people who will jump in and help and also the first people to get alerts. And then after that, it's are we declaring an incident who's going to become the incident commander" and then how do we delegate and assign the work so that people are not stepping on each other's toes. At the time that the both of these instances broke out, I was neither the person who proximately pushed a change, nor a person who is on-call and therefore it was a matter of coordinating and saying, hey, can I be helpful here? What role would you like me to play?" In the first one, I said that this looks like it is on the brink of going really, really bad. Here's a suggestion, would you like to do that? And the team decided to go with it. It's not like I stepped in and ran the command silently and no one knew what happened. I think what's interesting is for the second one, I was actually not available that day. I was not available to hop on call. I was really tired in fact, from the work in the previous instance. So part of me staying out of it was not my choice. And also part of it was the team hasn't asked for my help. So I'm just going to keep an eye on things and just make sure that, nothing is about to go off the rails." And other than that, we have some safety room in our systems for people to try to solve the problem in whatever way they deem acceptable, even if it's not necessarily the most efficient one.

Courtney Nash: 18:14

You mentioned something really important in the report in a couple of places, which is also not something I see in almost any other, incident write-up. and, and you allude to it, Liz cognitive costs" is the phrase you use Fred when people experience these kinds of cascading or repeating or related or close together incidents, people get tired. these are stressful things, like you said, Fred people are deeply invested and take pride in their work. And it it has all of these other kinds of costs that aren't just technical. Liz, you've alluded to that for yourself, personally, but Fred, maybe you could talk a little bit more too about broadly what those cognitive costs looked like for the team over the course of this set of incidents.

Fred Hebert: 18:53

For me the basic perception, there comes from the learning from incidents community, and this idea of blamelessness and not necessarily having what I like to call the shallow blamelessness where you say"we don't blame someone." And what it means is that we don't do retribution over anyone, but we still assume that all the mistakes are coming from someone fucking up at some point. It has to take this perspective of people come here to do good job. They were making decisions that locally were being rational based on what they were perceiving. And so a lot of these investigations for me are oriented on the idea of what were the signals people were looking for. What made this look reasonable? What was the situation they were in? And that's what influences how they interpret things as they happen. So the cognitive costs for me, in some cases, this idea that when you have to make these decisions, like how many things do you need to keep track of? How many signals are coming through? What are the noises. What's your capacity to deal with them. Are you busy doing something else? Are you tired? Are you doing all of that? Because it all has an impact on the quality of the work we do when the kind of decisions, and to think that we can keep track all of that comes in there. So for me to cognitive cost is to kind of burden of keeping everything and all the interactions in your mind.

Liz Fong-Jones: 20:03

And that also goes to the question of how many people are on call for honeycomb, because originally the answer was one and then it grew to two. And then it grew to three because the surface of honeycomb increased such that you could no longer have one engineer remember how do all of the pieces of front end work? How do all the cases of the backend work? How do all the pieces of our integrations work? That's why we divided and conquered the problem space at honeymoon.

Fred Hebert: 20:26

One has their own mental model of how the system works. And a mental model is never up-to-date, it's never perfect. It's based on the experiences and the mental model is how you make predictions."I'm seeing this happen. And by my understanding, this could be the cause of this could be the sort of relationship with the other components." The cognitive burden is also that capacity of tracking that mental model, how complex it needs to be to make good predictions. It doesn't need to necessarily be very complex, but the moment it becomes outdated, the signals you see are not interpreted to mean what is going to happen in the system. And this is normal, right? The idea that all of this and this sort of drift is normal. So for me, that all speaks to the cognitive burden. There it's. Yeah, everything is too complex to understand there's many things going on, people are tired. What were they seeing? How were they interpreting that? And rather than seeing, like how could we have prevented this incident from happening? The question becomes, how do we change the conditions? So that next time something as surprising happens we either come with a different preparation or the signals are made more legible by people operating the system.

Liz Fong-Jones: 21:33

I love what Fred said there about the difference between the sophistication of the model versus the freshness of the model. And I think that that is something that we talk about all the time, when we think about Honeycomb's product philosophy and design, which is that Honeycomb shouldn't get in your way. Honeycomb shouldn't make decisions for you. You need to be in the driver's seat, you need to be developing that mental model yourself or else. If we take agency away from you as an operator, then you are going to do a worse job of operating the system over time, because you don't have exposure to the context and the signals to know when your mental model is out of date.

Courtney Nash: 22:10

this conversation brings to mind two pieces of scholarly work that we don't have a lot of time to get into, but I'll drop some resources for listeners, in the list of resources for this podcast. One of which is Laura Maguire, she's a researcher at Jeli now. her work was on managing the hidden costs of coordination, which is what you were sort of talking about, fred. It's not just the individual cognitive load and cognitive costs, but when the surface area of your systems becomes more complex, you need more people involved in incidents and you need all those people and their mental models. and you have to now deal with all of that. So it's this whole other system on top of the system. And then the other piece is, Richard Cook, who is a researcher who's spent a lot of time on complexity and safety and all these kinds of systems, who talks about above the line and below the line thinking I'll, I'll drop some resources related to that for anyone who wants to dig into that a little bit more, cause that speaks to the notion of what we think the system looks like. how humans, step in and fill gaps and make things work when things go wrong. there's one more thing I really kind wanted to get to, about this, which well there's two, one was you kind of alluded to this to Fred. He said, oh, well, here's some details that our, our audience doesn't need. and what I thought was really interesting about the sort of meta writeup you did, there's two versions of it, the way, for folks, um, you spelunk the the one that I'll link to, there's a link to a much longer version, has a lot more of sort of the engineering details and that engineering background, but that's like two degrees removed from what might be known internally, obviously at honeycomb. I would love to get your perspective on how you write these, who you're writing them for. know, and I think we all know that what what's available in public write-ups is very, it's not the whole story, obviously. think yours is much more of the story than almost anyone else ever gives us, but I I'd love to get a little bit of your take on how you all approach that.

Fred Hebert: 23:54

Yeah. Initially this was a joined report of multiple incidents. And the reason for that is first one, economical, because we could have done like four or five incident reports of near misses and everything like that. But everyone felt that it would be taking a lot of time. And so. The task I took on as site reliability engineer was to make this one overview of what were the decisions that were made throughout the project, that kind of things that were happening figure out what the surprises were that we had and make a sort of inventory of the lessons learned. And so in that case, being a bit of a retrospective, it's a bit like being a project historian. So I go and dig into the chat logs, the older documents, try to rebuild that context and see what happened and build it that way. And in that way, the internal report is a lot about the internal audience. We felt that the deployment system worked this way. Here's how it works instead, and here are the fixes that we did about this one and this sort of typology of surprises that we might see when we have things about the incident itself that you see in the report, all of these were already in the private one and picking the audience then is it's a question of"What are the things they might be able to learn from that?" I tend to go very, very much in depth. I think right now, the report, the full one that was public is 13 pages. The internal one, it was something like 26 or 27 pages. And for me, that's kind of my personal challenge I write and I talk a lot. And so it needs to be taken down and trimmed down because there's this balance between how much information do you want to put in there and how much attention do people have to give to it, to get the stuff that's really important out of it.

Liz Fong-Jones: 25:36

The question basically is how much of this is going to be similar to what someone else might see, right? Our build pipeline is bespoke to us, deployment tooling is bespoke to us. It doesn't make sense that, you know, to talk about the details in which our mental model is built up because no one else has a mental model of that. Whereas everyone in the world who operates a streaming data solution understands that they have a Kafka that they need to hear about how the Kafka works.

Fred Hebert: 25:59

And the interesting exercise here is that the full public report is 13 pages. But then I think there's the blog post, which is not even a third of that. It's about under 2000 words. and so this one is even more boiled down, which is,"what is the most interesting thing about your state of incidents that people who have like 15 minutes, aren't going to get out of it. It's an interesting exercise because it forces you to kind of figure out, okay, what's really the core thing I would like someone to remember from this. In our case with that one, it was just kind of idea of shifting the dynamic envelope and performance that lets you predict how a thing behaves. For the full report, there's the interesting stuff about the bugs. We've seen the issues we had with some of the Confluence stuff with some of the processors that we have, the EBS drive issues that we encounter in internal one. There's this focus or this kind of approach of here's, how we can build or improve our own tooling. it really depends on the audience, right? Not all the same facts are relevant to the same people, depending on where they're interpreting from.

Courtney Nash: 26:58

The last thing I wanted to discuss is this culture of sharing. I know you've mentioned this in the context of your customers and wanting your customers to understand what happens when you have incidents, but maybe you could talk a little bit Fred, about your perspective on the the importance of these kinds of reports for software and technology industry as a whole, I'd like to know, should everybody do this? What do you think?

Fred Hebert: 27:23

I think more of everybody should do this. For honeycomb, we have this interesting fact where it kind of can line up with some of our technical marketing stuff. So it's easier for us to do then a lot of places. But I think there's a lot of value there. The tech industry in my mind is really keen on commoditization and externalization, whether it is of components or of expertise. So everyone uses these frameworks, assuming that people get the knowledge on how to operate them in the wild, in their free time, in previous employers, and then just goes around and brings that to the table when they do. This is deeply entrenched in the tech industry. And so for me, part of it is that the way the tech industry workers have gotten around that is to have these parallel systems where they do share the knowledge, the type of conferences that we have to blog posts and whatnot. And so Having these reports for me is that sort of idea that We're benefiting from that commoditization and it should be normal to give back and share some of that knowledge back to everybody else. Because, you know, we, haven't had to write Kafka. we're using Amazon for a lot of components, and we're making drastic savings on a lot of open source projects that people are usually working in their free time for. It's only fairness to return the knowledge to other people as well.

Courtney Nash: 28:43

So there's the fairness aspect, which I think is incredibly important. and a rising tide lifts, all boats. Software is running so much of our world now. And some of that is incredibly safety- critical: healthcare systems, financial systems, voting systems, um, God help us all. much like Other industries, I think notably the airline industry. They took on this mantra of sharing this information of being transparent and not just transparent for transparency sake, but because it was going to increase the safety profile of everyone by doing so

Liz Fong-Jones: 29:21

It definitely though was a place where they had to offer certain protections, right? Like when you report a aviation safety incident to the NASA system, you are protected from action from the FAA. Right. I think that that's a huge thing in getting people to self-report and that's kind of where honeycomb can go first, because it's a competitive advantage to us, to self-report and where we do have a culture where engineers speak up freely. Even if we're going to publish details afterwards, because they feel safe from retaliation, they feel safe from blame, and that's not necessarily the case everywhere. And that's something that we, as an industry, we're going to have to work on.

Fred Hebert: 29:57

There was recently an article about a big company having an incident. And the entire thing is blaming the one worker for not respecting procedure. And there's this super interesting paper that I know you can give a reference to in the show notes. This is"Those found responsible have been sacked; some observations or usefulness of error" by Richard Cook. it has this mentioned in there that as an organizational defense, the idea of error, and human error specifically, is a kind of lightning rod that directs all the harmful stuff away from the organizational structure and into an individual. And so the organization is able to sort of ignore all the changes it would have to do in terms of operating pressures and just say oh, this was a one-off and next time we're going to respect the procedure harder. having that ability for us to also put reports out there in the wild, that act as good examples of what we think is a humane, respectful, and helpful report, can help counteract these I would say directly bad ones that then smaller companies or people might emulate just because this is what the big companies do. So there's this importance of putting it out there and giving a positive example in

Courtney Nash: 31:04

doing that. I think you've done exactly that. It's a long road to get companies, especially much larger ones and in highly regulated and all those kinds of environments to approach this kind of a direction. I want to thank you all for having that approach and, sharing it with us and for joining me today. Thank you both so much.

Fred Hebert: 31:21

Yeah. See you on the next incident because they're going to keep happening.