When Bugs Break Production: A Pragmatic…

Akos Komuves

Feb 4, 2024

The only bug-free code is the code that doesn’t exist.

Read →

8 Comments

Fran Soto

Feb 4, 2024

Thanks for the mention, I'm glad you found the article on interviews useful!

I love postmortems after I wrote one for an issue I caused.

1. It's a document where the writing matters (at least in Amazon culture). This is definitely what started me in writing.

2. You are focusing on setting mechanisms in place to work better next time. It's not only about this occurrence of an issue but how to prevent, detect, and mitigate better in a future occurrence.

I think operations can't be just a training you do and that's it.

You wouldn't do just a self-service training to be a firefighter, or a doctor working in emergency rooms. Their training has to be through drills on how would they react to those emergencies.

I think we have a lot to improve on the training side to make sure everyone has the working principles you described: Visibility on your work, prioritize mitigation instead of root-causing...

Expand full comment

Reply (1)

Akos Komuves

Feb 5, 2024

That's a great point. As I'm applying for a position where on-calls are mandatory, I can't stop but think about what happens when something goes wrong in production in a system that's entirely new for you. I hope they think along the same lines and provide training and scenarios for what you need to do and when.

Expand full comment

Reply (1)

Fran Soto

Feb 5, 2024

Yeah same on my side with mandatory on-call.

I see it as positive to "change it from the inside". When I'm on-call I see the pain points and I can take action on them

Expand full comment

Reply (1)

Akos Komuves

Feb 5, 2024

I have many questions regarding on-call 😃 First, how do you wake up in the middle of the night if there's an incident? Can you even code in that state? How does that work?

Expand full comment

Reply (1)

Fran Soto

Feb 5, 2024

> How do you wake up in the middle of the night if there's an incident?

the pager app in my phone is loud enough to wake me up and the entire building if I put it at max volume :)

> Can you even code in that state?

The purpose of operations is not coding. Not even rootcausing.

But mitigating the impact.

For instance, you may need to understand if there was a deployment on a dependency and it broke something. Or if you have a feature behind a feature flag and you need to turn it off...

In many situations, you won't know WHAT caused it, but HOW to mitigate the impact.

And it's not up to the person being on-call. Otherwise, junior engineers would be so unlucky. Good runbooks should have a clear picture of what to do in hypothetical scenarios

> How does that work?

And of course, it's not a curse over you forever. The on-call rotates and in a healthy, big enough team you'd be on-call once every 2 months or so. For me, the minimum acceptable is once a month

And new joiners are never joining the on-call immediately. I've changed teams, been there 4 weeks already, and still don't have a grasp on half the services we own. So I'm not on-call

Expand full comment

Reply (1)

Akos Komuves

Feb 6, 2024

Thanks for the insights! Here the rotation is said to be one week on-call every one or two months. And no on-calls in the first 6 months. Sounds exciting! I was hoping I don't have to deploy hotfixes 😅 thanks Fran!

Expand full comment

Anton Zaides

Feb 4, 2024

Loved the broken commits printscreen, definitely guilty of that too 😂

I would flip the order, and start with prioritize. I think the biggest mistakes engineers do is to not consult with anyone before trying to fix bugs. I often saw cases where a hotfix caused a much bigger issue than the bug it solved.

Once you identify the scope of the problem, I would first talk to my team leader to prioritize.

Expand full comment

Reply (1)

Akos Komuves

Feb 4, 2024

I'm grateful we squash commits 😃

> I think the biggest mistakes engineers do is to not consult with anyone before trying to fix bugs.

Totally. I got burned several times when I ended up with git checkout . my changes because another team was working on the fix already. 🤦‍♂️

> I would flip the order, and start with prioritize.

Good point. There's no need to dive into the logs if it's not something outstanding. This works best for bugs where you're almost certain what's happening. We have a production bug right now in a client app where the Submit button on a Forgot password screen doesn't have any feedback after clicking it. The result? Users click it a couple of times 😂 It raises errors, but it's not prioritized because we know both the cause and the end result.

Expand full comment

Bitsy

When Bugs Break Production: A Pragmatic…