How to Debug and Fix Software Without Making Things Worse

Feb 11, 2024

Measure twice, cut once.

Carpenters have known this for more than seven thousand years.

The proverbs suggest that you double-check your measurement for accuracy before cutting a piece of wood because, if it is inaccurate, you must cut a new piece, wasting time and resources.

Does it ring a bell?

Before we dive into this issue, HUGE Shoutout to

Anton Zaides

Raviraj Achar

and

Fran Soto

. Your encouragement and endorsement of this newsletter resulted in a ginormous growth.

Most importantly, thanks to the 45 new subscribers who have joined since the last issue. I respect your time and will make the most of it.

Thank you! 🙏

Here’s how you can break down this proverb and apply it to your work as a software engineer:

Measure - Identify
Twice - Understand, the more, the better
Cut once - Fix once and for all.

However, from my experience, this is not always how we approach fixing stuff.

We’ve all seen those PRs where we asked ourselves how this could possibly resolve the issue, but it worked. 🤷‍♂️

I’m even guilty of making them.

I’m so glad we Squash PRs so people won’t easily find such commit messages 😅

➡️ Key Takeaways

Create reproducible experiments, aka. Tests
Understand before you act (exceptions apply)
Don’t make a change unless you know why you’re making it

If you remember last week’s issue, this is basically expanding on take 4. If you missed it, check it out here:

When Bugs Break Production: A Pragmatic Response Guide for Software Engineers

Akos

February 4, 2024

Read full story

Let’s dive into it!

🪪 Identify

To hunt down a bug, you must identify it.

Typically, you look at server logs and alerts or turn up the logging levels. If this doesn’t bring the desired results, you might check out the codebase and try to reproduce the bug locally.

What you’re looking to get out of this process is a reproducer.

It can be a manual one, where you replicate the buggy behavior by repeatedly walking through the same steps, but I don’t like manual reproducers for two reasons:

First, it can take forever in complex apps to get to specific parts of the system. When something is buggy, it’s guaranteed to be the 10th step of a wizard, visible only after filling out ten input fields and card information.

Second, some bugs are unstable. Even if you walk through all this, the bug might not appear, and you just wasted all your time coming up with fake data and filling out forms.

You need a reproducible1 experiment.

Instead of relying on luck to reproduce the issue, you can examine it in a controlled environment, usually as a unit, integration, or system test.

🧠 Understand

A way to face an issue almost instantly through our experiment, as often as we want, is the best way to understand it better. It’s time to ask some questions!

Does the code behave the same way for a different set of inputs?

Did we cover all the edge cases?

But why ask all these questions?

Why not pick the top-voted answer from StackOverflow or the one suggested by ChatGPT or GitHub Copilot?

I believe rushing into code changes is harmful for two reasons.

First, lucky changes strip us of learning opportunities. A hasty change that makes your code work, but you don’t know why, might stay there forever.

There is one exception. That’s when business is blocked. But be careful because once such fixes are deployed and things start to work again, you need a strong engineering culture to return to such fixes and understand what was happening.

Second, are you sure you fixed the bug?

We might have fixed one form of it, but we can’t guarantee it won’t fail for another set of inputs or that we didn’t break something else (although tests can reduce the chances of this).

So, I made up a formula:

🧑‍🔧 Fix

This equation can describe the quality of the fix.

\(F=I*U\)

Let me explain.

If we fail to identify the bug (I is 0), the fix is a guess at best, bringing F to 0.

If we don’t understand what’s happening (U is 0), we might have replaced the bug with another one, resulting in a 0 quality fix.

Your fix should be a deliberate change, altering the code so it behaves in an expected way for a specific set of inputs, aka.:

Measure twice, cut once.

🥊 Take Action Now

Next time you debug a piece of code, be sure what you’re fixing before deciding to fix it. What your F value would be?

Quick fixes are fine until:

they keep the businesses rolling
they don’t cause more harm than good
followed up with an investigation and understanding of what happened

📰 Weekly Shoutout

🦆 Are you blocked? (And how to unblock yourself) - Do you have any blockers? Nah, it’s g just gonna take an extra week! Identifying if you’re blocked can be tricky. Luckily
Fran Soto
’s article will help you with that!
You are hurting your team without even noticing - Great ego check for team leads from
Anton Zaides
and
Eugene Shulga
. While ego can be helpful in some cases, it might shut down voices you want to hear in your meetings.
2024 Guide to Mentoring for Software Engineers - Mentoring is hands down one of the most unique experiences in software engineering.
Jordan Cutler
gives you some actionable tips on how to approach it.

📣 Share

There’s no easier way to help this newsletter grow than by sharing it with the world. If you liked it, found something helpful, or you know someone who knows someone to whom this could be helpful, share it:

🏆 Subscribe

Actually, there’s one easier thing you can do to grow and help grow: subscribe to this newsletter. I’ll keep putting in the work and distilling what I learn/learned as a software engineer/consultant. Simply sign up here:

”that results obtained by an experiment or an observational study or in a statistical analysis of a data set should be achieved again with a high degree of reliability when the study is replicated.” Reproducibility - https://en.wikipedia.org/wiki/Reproducibility

Thanks for the shoutout Akos!

I do think that sometimes it's ok to not come back to understand the root cause, and be satisfied with the lucky fix. I had such a case just today, where we touched an area of another team, and we had a bug, and suddenly it dissappeared - I think because of a cron job that runs every hour and does something, but I'm not sure.

I can go deeper here, but I think my time is better invested in other things.

Expand full comment

3 replies by Akos Komuves and others

Thanks for the mention, Akos. I'm glad I could encourage you in some way. I also felt the same with other people in the community when I started. And I think that's the beauty of what we are doing here :)

About the post, you just triggered a memory of the template I have for these situations. It's called "🧪Labnotes debugging". When I need it, I go completely into an experimentation mode, getting rid of my biased understanding of the problem and approaching it with fresh eyes. The name and emoji help me get into this "scientist persona".

As you mentioned, we identify and understand with experiments and finding reproducible behaviors. Taking notes of these brings a better understanding. Only then we have the knowledge to go for a solution.

1 reply by Akos Komuves

6 more comments...

Bitsy

When Bugs Break Production: A Pragmatic Response Guide for Software Engineers

Discussion about this post