The VOID

Episode 8: A Tale of A Near Miss

Courtney Nash Episode 8

On this episode of the VOID podcast, I’m joined by Nick Travaglini, who is a Technical Customer Success Manager at Honeycomb. Nick wrote up a near miss that his team tackled towards the end of 2023, and I’ve been really wanting to discuss a near miss incident report for a very long time. What’s a Near Miss you might ask, or how is that an incident, or is it? What IS an incident? Keep listening, because we’re going to get into those questions, along with discussing whether or not it’s a good idea to say nasty things about other companies in your incident reports. 

Related Resources

Courtney:

Greetings fellow incident nerds. On this episode of The VOID Podcast, I'm joined by Nick Travaglini, who is a Technical Customer Success Manager at Honeycomb. Nick wrote up a near miss that his team tackled towards the end of 2023, and I have really been wanting to discuss a near miss incident report for quite some time now. What's a near miss, you might ask? Or how is that an incident? Or is it? What IS an incident? Keep listening, because we're going to get into those questions, along with discussing whether or not it's a good idea to say nasty things about other companies in your incident reports. Nick, super excited to have you joining me on the VOID podcast today. for those listening, I've known Nick for some time. so this guess, in some ways, an extra special treat to have you joining me on the VOID podcast for folks who might not know you, lucky them. They get to now know you. Why don't tell us a quick bit about yourself dive into this report.

Nick Travaglini (He/Him):

Great. Thank you, first of all, so much for inviting me on to the show. I really appreciate it. My name is Nick. He, him pronouns. I work at honeycomb as a technical customer success manager. I have a background in philosophy and organizational studies and science and technology studies. I like to dabble in a bunch of different things. I started working in the tech industry, my first big kid job after graduating from, after finishing my undergrad, working at a SAS CICD business, ended up getting acquired by GE, worked there for a while, worked at another company before doing a master's, and then from there, came to work at Honeycomb in the DCS patent program.

Courtney:

I love how people who are drawn to resilience engineering, learning from incidents, all this stuff, have these unique backgrounds, philosophy, psychology, all of these things, it's a, in some ways, a hallmark, I feel like not a requirement, but maybe a hazard of caring about the things we care about. and it comes through in this report that you've written up, actually, and we're going to get into some of the details. So you wrote a post for Honeycomb called Preempting Problems in a Sociotechnical System. we get into Some of, especially the one big word in there, can you, this is so unfair, but I do this to everyone. Can you do the sort of TLDR version what happened? and then we'll, like I said, we'll dive into the details. Oh, The blog post

Nick Travaglini (He/Him):

about an almost incident that we had at Honeycomb a couple years ago. This is 2023. And the situation was that the open telemetry project, open source project, creating an industry. The goal is to create an industry standard way of instrumenting your services that they emit telemetry data, then you can send it to your, Analytics platform of your choice. they were updating their semantic convention. So how this the syntax of things like how HTTP requests are recorded by the instrumentation and then put into a file and sent to that analytics, analytics platform, that was getting updated. And there had been an announcement that this was happening. Several months before the hard cutover was set to occur, and all of the libraries and everything that were involved that had to make this change for that interim period before the before the hard cutover had to dual send with both the old syntax and the new syntax. And then on a particular date, they were permitted to stop sending the old syntax. Now, what happened was that this date came and actually went and several engineers at honeycomb who participate in the project. We contribute quite a bit to it. they knew that. This was happening and that there were going to be, it was going to be required to be in release notes that this was happening, the sort of quote unquote normal mechanisms of letting people know that like a change is happening would be conveyed to engineers and it's all well and good. a couple of days after that date, a member of my team over in the customer success department. Flagged it to the rest of us saying, Hey, I see that this could be really problematic for people because we have a lot of people. A lot of customers of honeycomb rely on a sampling service. We have open sourced a project called refinery. It's a means of sampling this telemetry data. So you only need to keep, say, 1 in 10 of. data that conforms to specific, classifications, and you define those classifications and what you how you want to define your sampling rules in a config file. It's just the YAML file that's part of refinery, but you have to hard code in the syntax. So when you're defining things like, hey, I want to keep 1 in 10 of my HTTP requests. It's got a particular like syntax that it's looking for. And so if that changes, then Refinery doesn't know to do that. And that means it's probably just going to let through everything with the new syntax. And,

Courtney:

just say hard coded YAML? I'm like, I'm just like, Oh boy. Sounds fun.

Nick Travaglini (He/Him):

yeah, I'm, you know, I'm not a software engineer. I don't pretend to play one on TV. Being a YAML engineer is about the best I can do. So if anyone's looking for a YAML engineer.

Courtney:

I'm here with love and respect for that. I just feel like I've been, I have broken an entire website with one wrong line of YAML. sorry, I did not mean to interrupt though.

Nick Travaglini (He/Him):

Yeah, so Mike, my colleague, foresees that people will have problems with this. They won't catch that this has changed in their instrumentation. They've upgraded, but potentially they've upgraded to a new version that supports this. They don't realize what they've done, and they start sending a flood of traffic to honeycomb, and they have to pay for that. Whereas, you know, if they were sampling at a 1 in 10 rate, they're now sending 100%. And their potential bill, or at least the, you know, the rate that they're sending traffic goes up correspondingly. and

Courtney:

I

Nick Travaglini (He/Him):

were really concerned.

Courtney:

for a second here? First of all, a lot of companies would see that as a feature, not a bug. but right. so I'm just, I'm not sponsored by Honeycomb. I have no affiliation, but I would just like that little bit in the report also caught my eyes as kudos that you all were trying to do the right thing by your customers for this.

Nick Travaglini (He/Him):

Yeah.

Courtney:

again,

Nick Travaglini (He/Him):

So we, we figured that would be a, that would be a poor customer experience. other things that Mike didn't explicitly call out was that. Are the alerts that you can build in honeycomb based off the data that we're receiving are also hard coded. So like triggers are burn alerts based on SLOs. These are all going to be affected by this. And so people may not get alerts that something is up with their system where they're anticipating it. So, like, there's, there's more than just the, like, Financials at stake. There's also like the ability for them to understand and for us to really provide a good service, right? Like in the sense of being able to do observability. So we, myself included, go and ask. Engineering. Hey, are you familiar with this? Do you know what's going on? and they say, yes, this is a thing that was announced a while ago. And we're like, oh, my God, this is going to cause problems for people. and. I declare an incident and so I, we use Jeli now part of pager duty. at the time they were independent company, I use the Slack bot, the Jeli Slack bot to declare an incident and that starts rallying the troops. We get. Engineers, folks in CS, folks in product, particularly, I want to shout out my colleague, Mary, who works in docs, who starts pulling together like, hey, how are we going to communicate that? This is a potential problem to customers. while we're investigating 1 of the things here is that these libraries and what not have been permitted to upgrade. To a version that has the breaking change, but they may not have done so. So that's the thing that we, we realized pretty quickly that we need to go check to just see what's the latest version. And when did they release that latest version? Right. Pretty simple to actually go check. You just go look in the GitHub repositories, but we need to start communicating this so that in case something has gone out. One of these has upgraded. We need to let customers know we need to forewarn them in case they haven't upgraded yet, but they should be on the lookout for this. so myself and other TCSMs send quick notes to accounts that we work with customers that we work with that. Hey, go take a look at this. We're checking on our side, but want to give you a heads up so you can, you can also do this. You can also take some action here. Turns out, fortunately, none of them had actually upgraded since the date. It was only just a couple days later, so we got really lucky. so what we, what did we do from there? We Obviously sent out that initial quick communication. We drafted up a longer form instructions like, hey, these are what you should be looking for. It's actually there's a blog on the honeycomb blog that that details some of this stuff we reached out to more folks in the Otel community. including like maintainers and whatnot that like, Hey, we got to do more to publicize this. Like, we're really concerned that folks are using tools like, say, Dependabot, which will automatically generate a PR that engineers can accept when one of their dependencies releases a new version. But depending on the production pressure and how busy they are, like, there's a lot of dependencies out there.

Courtney:

There's a lot in here and you, like, you refer to this as a near miss, a, what you said something very early on, a near miss. Near, an almost incident, a near miss, what does a near miss mean to you? I of want to get specific on that terminology from you, at least I know other people can have different meanings for this.

Nick Travaglini (He/Him):

Sure, so a near miss, or as George Carlin would say, a near hit, is, when one is looking at trajectories, I'm going to say, of various forces, just to be as abstract as possible, and it looks like they're going to intersect each other in a way that would make your life unhappy. You would be very unhappy if these things intersected. and then you get lucky, or maybe you've done something, maybe you've taken some action, so that they don't actually end up hitting each other. And so you've avoided this collision, you've avoided this problem. So it assumes that things are going well, and you either luck out, or you, you know, through your own through an active agency are able to avoid a problem.

Courtney:

It sounds to me like what you're saying here is that avoiding this near miss was definitely not luck, but it was anticipation and expertise of folks who are very intimately familiar with these aspects of how the system works.

Nick Travaglini (He/Him):

Right? So we, we did get lucky in the sense that none of these libraries had actually upgraded

Courtney:

yeah.

Nick Travaglini (He/Him):

during the couple days since they had been permitted to do so. So I totally want to say that, like we did, there was an aspect of luck here, but I'm not going to pretend that, but we in customer success understand our customers' behavioral patterns that they want to sample, but they use alerts and use various features of our product and how those are technically implemented. Such that we can trace out, we can anticipate how various aspects of the technical parts will behave, given this social change, which is the change to a convention, a naming convention, and the specific syntax that's used, not to mention the social aspects of the production pressure that engineers may be under. when things like, if they're using something like Dependabot or reading through a PR, like maybe they read the release notes, but they don't read the actual code, like the diffs themselves, and that, you know, they may miss something, they may introduce a bug where, where there hadn't been.

Courtney:

I have a question about that. What I like, about what you just explained there and the way you write it up is. I think it lays bare what we mean by a socio technical system, which is trying to drive home the fact that it's not that we work in software, but we as humans work in with software, we are a part of that system. And so the and the successes and the failures and all those things are not inherently one or the other. It's this. It's how humans work with machines. And the central thesis early on that you have is that This stuff can't work, and we won't, and what we tend to is have these near misses because of adaptability of humans in, a joint system with machines. Is that, would you say that's a fair restatement of your central

Nick Travaglini (He/Him):

Absolutely. Absolutely. So when it comes to technical artifacts, technical objects, they are typically considered to be at best. I'm going to use a little bit of jargon here. They're complicated. So they are things that are built up from, they're composed or decomposable according to sort of discrete, say, digital, digital components. There's a part whole relationship, and that whole can be broken down cleanly. That's what sort of complicated means. A complex system, is not decomposable nicely. And that's sort of like discrete. You've got a certain number of discrete components that you can. aggregate into back into the same whole. if you decompose it, you actually get things that are qualitatively irreducible to each other. That's that's the sort of idea of complexity. And certainly in an organization like a business You've got organic beings living things like people, and then you've got technical artifacts, which are often considered to be at most complicated, at least simple. To sustain itself, you require both because people need technical objects to do things, to extend our own powers and capacities. And there's a lot of things that have to work together that were not built to work together in an organization, a bunch of technical objects that were not built to work together. And humans are the things that are able to mediate that and make them work together.

Courtney:

and not just mediate that, but anticipate and machines cannot anticipate and adapt and plan based on accumulated expertise, that is an inherently human capacity. By looking at this near miss, you're looking at normal work, right? Like somebody could argue, well, why are you calling this a near miss? This is just stuff we do all day long. It's just stuff that product teams and engineering teams do. That's just our daily work. Why would that be a near miss?

Nick Travaglini (He/Him):

So one of the interesting things I think about this is that we had an incident and nothing was broken, right? There's no technical component here that wasn't functioning properly, or the way that we, the way that we really wanted it to, where the sort of breakdown was, was in the communication portion about the change from one syntax from the in the semantic convention. Right, so refinery working exactly as you would want, honeycomb ingesting data exactly as you would want. the dependabot, you know, if that, if that is being used, that's working precisely. And so this is understanding how these technical components work when they're functioning appropriately, when they're functioning properly. So that's understanding normal work, because most of the time they are up. We're able to keep them up. and then understanding, having a sense for human culture and sociology and how people work in organizations and the sort of pressures that they face in their day to day activities, in a Business that is for profit business. You know, you've got to innovate. You've got to make money so you can reinvest that so you can say your competition and so on and so forth. And so just thinking about the collisions of these or possible collisions of these forces. Is where understanding normal works allows you to get ahead, you can anticipate you're not just like heads down, just focused on your work in the moment. You're not immediately engaged. You're also somewhat temporally. And this is where the philosophy stuff comes in. you're, you're temporally disengaged. You're actually of multiple times. You're of the immediate present. Where I have to know, like, what each of these things do in their normal operations, and I'm able to take a step back and reflect on where is this going? Right? And that's a totally different dimension to time than the immediacy of the present. And that's, that's something that only humans can do. And so that's where we provide the adaptive capacity to use some language from resilience engineering, to make sure that as we understand normal work, we can actually deflect and redirect and adapt and change to make sure that things continue to work.

Courtney:

What was particularly interesting to me about the way you talk about production pressures in this, is in the notion of causality It’s something that comes up in lots of incident reports, but also in a, in a near miss. I have seen, and I have one company in mind who I'm not going to name right now. but incident reports that slag other companies in the incident report, as a, as the cause or, or the problem or what have you. I'm generally speaking, not a fan, of that approach, regardless of how egregious the situation was, but you but go the opposite direction. Instead of throwing Otel under the bus, you identify production pressures on your end for this incident.

Nick Travaglini (He/Him):

What I put into the blog post is that, you know, it's totally possible to say that, Oh, Otel didn't do enough to let people know that this change is coming. They had one post about it. That was issued months before their blog, how many people are reading this blog, they should have known that, this is a sort of hindsight, now that this problem has arisen, we know something that back at the time, they probably didn't know, they probably weren't thinking in particular about, for example, how honeycombs refinery system work, right? They may have no, they may have had no idea that it even existed, right? Whoever wrote this post, like, it decides on on announcing these things like, right? So, like, but it's totally possible to say, well, they should have issued another one 2 weeks before the change and a week before and 24 hours and like, It's all their fault. Like if they had just done more, we could have avoided this whole situation. And I think that that's problematic. And I think that there's a lot of, there's a lot of folks in resilience engineering and you will appreciate that there's more going on here. There's the way that Honeycomb has designed refinery. To use a yaml file to do this hard coding of the syntax in there, right, like the what we have done and we also like we built refinery not knowing that this change was coming. So like, it's also not our fault, right? so we needed to. We need to take some responsibility, not more than our responsibility, like, I'm not going to claim that, like, we, Honeycomb had made any decision about the semantic convention change, like, or that, the point is not to cast aspersions, the point is to understand how things work and to use our wet brained capacities as humans to, like, just think about the situation, take a step back, from the immediacy and do something that only we can do, which is to really think, to think about the situation. Like that's, that's really what thinking is, is the ability to like step back and be like, Oh F, this is going to happen.

Courtney:

Cause it's something you brought up that, that, and I think you linked to another talk from the learning for incidents conference that happened a couple of years ago, but I was also thinking about some work that Sarah Butt and Alex Ellman had done on this front how do we think about incidents and and our systems in context of third party or other organizations that we work with, which is probably the reality for I'm going to just go out on a limb and say almost every single software company out there now, right?

Nick Travaglini (He/Him):

So the thing about engineering is that because it involves people, People are these open, energetic systems, like where these biological organisms, we have to ingest food in order to survive. We talk to people like very, very communicative. Organizations are not hard boundaries. Right, I work with customers all the time. It's literally my job, right? And we have engineers who interact with engineers at our customers all the time. This is a thing that is actually super important to the engineering process generally, is that organizations are not hard, closed systems. There's a fantastic book called Hitting the Brakes by a woman named Ann Johnson. And it's a study of actually how the anti lock braking system was created. And one of the things in her study of this that she points out is that the engineers at the various organizations that were competing with each other to get to market with anti lock braking systems and dealing with a problem in cars and vehicles. it required them to communicate with each other outside with other with employees at these competitors, right? There had to be an industry conversation at a higher level, at the industry level in order for the engineers at a given organization in a given business to have the ideas to make progress on the problems that they were dealing with in their local environment. And so this idea that businesses are like hermetically sealed, and that you can just like, when things are good, I can get resources and help from others. Like if you're listening to this podcast, you're, you're doing this, you're participating in this, right? Like you're part of the software industry. You're listening to this. You're, you're literally engaged in this. And when things go well for me. I will give them all the credit in the world, but when things go bad, then I'm gonna throw them overboard. That is, that's just total hypocrisy, right? Like, take, you've got to take responsibility here. That's, that's one of the things that I really like actually about resilience engineering and this sort of safety science that's coming out of is that it really is about: let me take responsibility for what I did in a reasonable way like I contributed to this I will totally cop to that. I will totally say yes. I did this don't then throw me overboard

Courtney:

Yeah.

Nick Travaglini (He/Him):

and Put the blame and in an inordinate excuse me in in An inappropriate, inordinate, that was the word I was going for, an inappropriate, yeah, an inappropriate amount of blame on me. If something goes wrong, I already feel bad, like I want to do something about that because I want to take responsibility. So like, help me to take responsibility to make this better, right?

Courtney:

As a system, as a piece of a whole system. This makes me think of one other note I had on the report. And so I'm going to loop backwards a little bit, because of want to talk about soft and hard boundaries if we have time. But the thing I want to loop back to is in the report, you're, you talk about We weren't sure if this was an incident or not an incident. Is it an incident? I don't know."I cut the Gordian knot and declared an incident." And there is like an iceberg of context underneath that, those couple of sentences in this report, because I feel like anybody who has been involved in incident response, in incident management, understands the existential question of“Is this an incident?” Being the person who pushes the this is an incident button is a pretty scary place to be and When you're not sure I think it's the scariest Maybe I could be wrong. People could please come and disagree with me on that, actually. what was that process like? And what does that look like at Honeycomb? If you have other experiences to compare it to, that's great. But I think that's a less examined piece of this world than has given enough attention.

Nick Travaglini (He/Him):

Let me say, first of all, I'm very grateful to Jeli for having The slack bot, the ability for anybody to just do it in slack, you know, that was super helpful for, for us getting together and being able to, to treat this with the severity that I thought it merited, you know, maybe other, other people wouldn't, but, you know, I'm, I'm at least grateful for that. I'm also grateful to my colleagues for actually including me in the training of how to, how to declare it and use the Jeli bot and do all this sort of stuff. That read the Howie guide like all that sort of material. It was great. so one of the things that I'm going to go ahead and define an incident. Which is that there is a certain expectation of how a system will perform. And an incident is where the behavior of that system changes in a way that somebody doesn't like. That's it. That's an incident. And, you know, you start getting into all kinds of contextual questions. Who is this person? When did this change happen? Under what other conditions? You know, what else is going on? It requires a lot of context. And that's the point. There's not going to be a single definition of an incident, it's always going to be a judgment call. So that, for me, is what an incident is in terms of My thinking about declaring this is an incident. One of the things that I'm also very grateful about honeycomb for a honeycomb is that we don't count the number of incidents that we had. We don't try to track things like MTTR they're not helpful. My colleague Fred has written a blog post about how counting forest fires is not a great way of understanding whether or not your firefighters are doing a good job. Right? It may be indicative of something else, a bigger systemic issue like climate change. it's not, it's not good for like, are your firefighters doing a good job? And so I felt no compunction about declaring this in an incident like it. In fact, it's beneficial to declare it an incident because then we get to practice it. It's a low severity incident, which, by the way, honeycomb has moved away from severity levels and categorizing incidents as severities. We have a typology a different classification system. we get to practice communicating and handling the unexpected and learning how our colleagues work and updating our prior understandings of how the system works. There are things about honeycomb that I learned from participating in this incident and from attending and reading incident reviews about incidents that have happened at honeycomb. It's a real learning opportunity. And we get to practice working together in an ambiguous situation during the incident. And then afterwards, we get the chance to learn things. So it's actually beneficial to declare this as an incident. And to pay attention to and recognize, acknowledge that something about how things are going has changed and understanding that things have changed that something has changed and getting into the details of what has changed is good for me going forward, right? Because now I can take that into account like now. I know. Hey, customer, look out for this change. It's coming down the pipe. Eventually they're going to update. There's going to be this breaking change. I can advise my customers how not to get caught in this trap.

Courtney:

Yeah, and, and this is the notion again from resilience, engineering of work as imagined versus work as done. The short version of that is there's ways we think about maybe we're not even aware of that we think about how our organization functions, and then there's opportunities like this incidents near misses that throw into focus how it actually functions. And it's a really unique opportunity to be able to understand that distinction. There are things you learned about how your own company works honeycombs not that big, but the system As a whole is complex enough that none of you can possibly know enough about all of it. And this is one of the ways that you all begin to learn those pieces and how things actually work, right?

Nick Travaglini (He/Him):

Yes, 100%.

Courtney:

Sorry. This ends my TED talk on, work as imagined versus work as done. I find is every time now I talk to somebody about an incident, an incident report, anything like this, all of these other things I've learned to come to mind because I love about this field of resilience engineering is it's not a fuzzy, academic, know, unattainable field of study, which not all academic things are, but it's easy for people to feel that way about them, that Resilience engineering and the topics that we talk about in this come from these lived experiences of practitioners of previously people who are like airline pilots or, know, air traffic controllers or surgeons or what have you. And we're just bringing that lens and to software and it just. It's, it stops me in my tracks sometimes how prevalent these papers I read in academic journals are to like the experiences that you're describing to me right now. Nick is also, I'm going to do a little pitch here, a member of this group, which is the Resilience in Software Foundation. So I'm going to put a little bit more about that on the bottom, because I feel like we've covered so much territory of these things that we discuss in that space a lot, but closing thoughts. you really talk a lot about humans being the pieces that keep these systems together and running and handling edge cases. do you have any closing thoughts on that before I wrap this one up?

Nick Travaglini (He/Him):

So the ability for people to make technical artifacts, technical objects work together. That really came to me as part of a part of my study is reading- a philosopher again, surprise philosophy- named Gilbert Samandhan. He's got a great work called"On the Mode of Existence of Technical Objects". It's dense, just heads up dense, but it's really great because he, he talks about how people really are these transducers like we're able to convey and transmit information and modulate it between technical objects that were not designed to accept information from each other in particular formats. We handle the rough edges of things-- there's better and worse ways to do that, you know, but it's what people do and it's it's essential for collaboration and Just anything that we want to do in any way that we want to be productive.

Courtney:

It's funny because I will end with a Bluetooth rant. I was on a call yesterday with someone and I had recently applied for a home, like a HELOC and my phone just kept ringing because it also, heads up folks, if you do an online application for a HELOC even through your own bank, that information gets out into the internet and then people just start calling you, which is horrible. But they started calling me during this recording and I had my Bluetooth headphones on and I am. Intimately aware of this problem about how my phone and my computer don't communicate very well because if I am recording a zoom thing and my phone rings, it hijacks the Bluetooth on my headphones and then when I turn the phone off the headphones start playing the last thing that my son was watching on my computer into those headphones while I'm on the zoom call and my adaptive capacity for that was to put my phone in airplane mode you and I got on recording, which Yeah, which is a perfect bow on A) how much Bluetooth sucks and B) how much humans are good at working around it. So that's a great note to end on. Thanks for joining me.

Nick Travaglini (He/Him):

Thank you so much for having me. Thank you for the work you do with the VOID. It's great. People read it, read the reports. Courtney's work is fantastic.