COE
One of my favorite things about the time I spent at Amazon was its COE process.
COE stands for correction of errors: its a document your team fills out whenever you cause a customer-facing outage.
Some people (both inside the company and outside the company) balk at the process, but I think it is terrific: it is a chance to dispassionately examine the context and institutions that resulted in a failure and provide a means to address them.
Things that a COE contains:
- The context of the problem.
- The impact of the problem.
- Immediate steps to remediate the problem.
- The root cause of the problem, generally as investigated through a five whys analysis.
Some things that a COE does not contain:
- Names of folks. You don’t blame someone for writing buggy code. As soon as that code is pushed to live, its the team’s responsibility; and if the code is buggy then the onus is on the systems and culture that allowed the error to sneak through an organization’s defenses.
- Downplaying of the problem. It is, quite literally, an advertisement of failure — avoiding the issue or trying to diminish the impact/ramifications of it defeats the point.
Here’s an example of something I ran into recently with Buttondown! (By recently, I mean, uh, literally last night. Don’t worry — it’s already fixed. [And yes, it is slightly nerve-wracking to write about how I broke the software I built was broken!])
The problem: Emails that take a while to send (emails with a lot of links to check, tweets to render, that kind of thing) could time out — meaning that the final Email object itself wouldn’t finish creating even though emails have been sent. Plus, this gets propagated up to the UI as just a generic error, which means users might try and resend the request and do it all over again.
Why (did the request fail)? I deployed a change to the big ol’ email sending API (as in the thing that gets kicked off when you click “Yes, I want to send this email, I promise!”) to be atomic, meaning that it only saves in the database if everything goes off without a hitch.
Why (did I make the change)? I did this because someone else ran into a bug where they were getting multiple emails created due to errors downstream.
Why (didn’t the issue with the change get caught)? Because I didn’t test the atomicity with a sufficiently intense data set.
Why (is there a lack of test coverage for this scenario)? Because, well, testing stuff with emails is tough.
Why (is testing stuff with emails tough)? I’ve been reticent to build out a lot of good email testing infrastructure; best practice is to mock out emails in a testing environment with consoles or file mailers or something like that, but also it’s just a bad reason — effort.
The solution: build out more email testing infrastructure at a building-blocks level, making it trivial to instrument changes to the most delicate (and important) part of the application.
It all comes back to tests, I guess: tests, confidence in the changes you make, and durability. It’s not exciting to build out test infrastructure rather than, like, a new analytics dashboard (though that’s coming too!), but excitement is a bad heuristic.
Happy Sunday.
You should go see Call Me By Your Name. It was terrific.