When Bugs Break Production: A Pragmatic Response Guide for Software Engineers

Feb 04, 2024

The only bug-free code is the code that doesn’t exist.

As a software engineer, you read this and smile, but I know why.

First, you know such code doesn’t solve many problems.

Second, writing code still pays better than not writing it.

As a side effect, we create bugs. And nothing’s more frustrating than the alerts when something blows up in production. On weekends. We’ve all been through this, and such debugging sessions brought lifetime memories for some of us.

So why even use a system for responding to bugs? Don’t you just go in there and fix it?

Sure.

When we see an error message, we have a hunch where the error is coming from, but we don’t know the reason for the error, so we make an assumption… – hunch and assumption in the same sentence, recipe for a disaster.

Uncaught TypeError: Cannot read properties of undefined (reading 'length')

We’re tempted to roll up our sleeves and change the code, and push commits like this should fix it. Second attempt. Third time’s the charm.

But unless you remember everything you wrote (I don’t), this process quickly gets out of control.

Blinds Venetian GIF - Blinds Venetian Peter Griffin GIFs

➡️ Key Takeaways

Don’t jump right into your editor, instead:
Raise awareness, let people know you’re on it to avoid duplicate work
Look for evidence, understand what’s affected, and prioritize based on data
Document for yourself, your team, and the next person who will look into this bug if it reappears. But most importantly, document to understand.

Here’s my pragmatic response guide:

1. Status Updates

Regardless of team size, I give a heads-up to the code owners before I start looking into the issue.

This is especially important in remote teams for two main reasons:

Raising awareness of the issue - some people might have missed the alarms or the earlier messages in Slack/Discord/Teams, and they might know exactly what’s going on and why the bug appeared.
Avoiding duplicate work - maybe you missed someone else’s heads-up. 🫣 Your peers a couple of hours ahead of you have already looked into it and are working on a fix.

2. Look for Evidence

Alerts, logs, Slack notifications, Datadog outputs–evidence comes in different shapes and sizes.

Debugging with a comprehensive logging and reporting system is a walk in the park.

On a lucky day, you see what needs to be fixed from the logs.

Other times, especially with B2C software, the evidence is more like:

This 💩 doesn’t work!

This is when you have to ask for evidence.

It can be a screencast or a short, step-by-step style description of how to reproduce this bug.

Finding evidence will help you in:

Identifying the error: a specific logline or whatever the error is causing when it appears will be crucial in verifying that it’s actually “gone” once you deploy the fix. Test reproducers are nice (more on them later), but they usually check the fix in isolation, which is detached from your production software.
Categorizing the error: looking up the error message in your issue tracker or Slack, you might find earlier attempts to fix the same bug. Is it a reoccurring bug? Maybe you should raise its priority. Where does the error surface? Is it a low-priority or top-priority production bug based on the system affected?

3. Prioritize

Not every bug has to be solved now.

Sometimes, it’s enough to turn off part of the UI so it doesn’t crash and users can use the rest of the app without interruption.

Sometimes, it’s enough to add that .skip for that flaky integration test so engineers can keep deploying their sweet code to production without interruption.

Other times, it’s only a dropdown that doesn’t have the correct value selected by default.

Because you already know how important it is to categorize the errors, you might decide it’s not a top priority, create an issue in your issue tracker with all the evidence, and simply solve it later.

4. Fix

When solving later doesn’t work, it’s time to act.

Bugs need to be understood and fixed.

Without diving into the nitty-gritty details of how you exactly find the bug, whether it’s the good old log lines or using a debugger, when you fix it, I consider these as an essential part of PRs fixing the issue:

- a unit and/or an integration test

- if it’s a UI bug, a video or screenshot showing the issue fixed

- if it’s a non-trivial bug, i.e., why something works when the initial value is set to undefined, but it doesn’t work when it’s set to null a description of the fix (great use of code comments, by the way)

One thing I use all the time is GitHub Pull Request Templates. They’re easy to set up, and whenever you open a pull request, they remind you of what to include in a PR.

5. Document

You should be documenting your progress (especially in remote teams working in multiple timezones) for two reasons:

Teams work around the clock. You found the issue, it’s Friday night, and you shouldn’t be working anyway. But Rob’s just getting started but doesn’t have any context yet.
To understand the problem.

The first is a practical thing. Making notes where you left off allows someone else to use your findings for good, figure out the problem, and push a fix to production while you sleep.

Documenting is problem-solving. But don’t believe me, read: What does solving a 300-year-old mathematical problem can teach software engineers?

There’s something magical happens when we try to organize the contents of our mind into meaningful, structured text that helps other people understand what’s happening:

Overall, you get a better picture of what’s happening.

6. Prevent

The best way to fix bugs is to prevent them from happening.

Enumeration attacks, optimistic access, returning unexpected values. While some issues can be fixed with static type checkings and linters, some problems come from coding conventions, or they’re simply part of other tools we use.

For Hall-of-Fame type bugs that brought down your entire app, Postmortems are a great way to prevent all this from happening next time.

These documents provide a structured approach to identify what happened, why, and how to prevent similar incidents in the future. They usually have sections like the incident summary, timeline of events, root cause analysis, impact assessment, and action items for resolution and prevention.

It emphasizes clear communication and learning from incidents to improve future operations and processes.

🗣️ Join the discussion

What’s your way of dealing with production bugs? Is there anything missing from my list? Leave a comment below or simply hit Reply.

I answer every email I get.

📰 Weekly shoutout

🐲 How to prepare the system design interview by
Fran Soto
. This guide has been a tremendous help because I’m preparing for a system design interview. It contains Fran’s recommendations on how to study and many other resources! 📚
Simple tricks to level up your technical design docs in 2024 👌 Learn from
Jordan Cutler
how to write technical design docs that get read.
The ridiculous policy that forced engineers to come to the office at 2 AM twice a year. Who doesn’t love a good production story? Find out from
Anton Zaides
the organizational complexity a wrong data type in your DB can bring. 😃

📣 Share

There’s no easier way to help this newsletter grow than by sharing it with the world. If you liked it, found something helpful, or you know someone who knows someone to whom this could be helpful, share it:

🏆 Subscribe

Actually, there’s one easier thing you can do to grow and help grow: subscribe to this newsletter. I’ll keep putting in the work and distilling what I learn/learned as a software engineer/consultant. Simply sign up here:

Fran Soto

Feb 4, 2024

Thanks for the mention, I'm glad you found the article on interviews useful!

I love postmortems after I wrote one for an issue I caused.

1. It's a document where the writing matters (at least in Amazon culture). This is definitely what started me in writing.

2. You are focusing on setting mechanisms in place to work better next time. It's not only about this occurrence of an issue but how to prevent, detect, and mitigate better in a future occurrence.

I think operations can't be just a training you do and that's it.

You wouldn't do just a self-service training to be a firefighter, or a doctor working in emergency rooms. Their training has to be through drills on how would they react to those emergencies.

I think we have a lot to improve on the training side to make sure everyone has the working principles you described: Visibility on your work, prioritize mitigation instead of root-causing...

Expand full comment

5 replies by Akos Komuves and others

Anton Zaides

Loved the broken commits printscreen, definitely guilty of that too 😂

I would flip the order, and start with prioritize. I think the biggest mistakes engineers do is to not consult with anyone before trying to fix bugs. I often saw cases where a hotfix caused a much bigger issue than the bug it solved.

Once you identify the scope of the problem, I would first talk to my team leader to prioritize.

1 reply by Akos Komuves

6 more comments...

Bitsy

Discussion about this post