COE

COE

                February 5, 2018

            COE

            One of my favorite things about the time I spent at Amazon was its COE process.  
COE stands for correction of errors: its a document your team fills out whenever you cause a customer-facing outage.  
Some people (both inside the company and outside the company) balk at the process, but I think it is terrific: it is a chance to dispassionately examine the context and institutions that resulted in a failure and provide a means to address them.
Things that a COE contains:

The context of the problem.
The impact of the problem.
Immediate steps to remediate the problem.
The root cause of the problem, generally as investigated through a five whys analysis.

Some things that a COE does not contain:

Names of folks.  You don’t blame someone for writing buggy code. As soon as that code is pushed to live, its the team’s responsibility; and if the code is buggy then the onus is on the systems and culture that allowed the error to sneak through an organization’s defenses.
Downplaying of the problem.  It is, quite literally, an advertisement of failure — avoiding the issue or trying to diminish the impact/ramifications of it defeats the point.

Here’s an example of something I ran into recently with Buttondown!  (By recently, I mean, uh, literally last night.  Don’t worry — it’s already fixed. [And yes, it is slightly nerve-wracking to write about how I broke the software I built was broken!])
The problem: Emails that take a while to send (emails with a lot of links to check, tweets to render, that kind of thing) could time out — meaning that the final Email object itself wouldn’t finish creating even though emails have been sent.  Plus, this gets propagated up to the UI as just a generic error, which means users might try and resend the request and do it all over again.
Why (did the request fail)? I deployed a change to the big ol’ email sending API (as in the thing that gets kicked off when you click “Yes, I want to send this email, I promise!”) to be atomic, meaning that it only saves in the database if everything goes off without a hitch.
Why (did I make the change)? I did this because someone else ran into a bug where they were getting multiple emails created due to errors downstream.
Why (didn’t the issue with the change get caught)? Because I didn’t test the atomicity with a sufficiently intense data set.
Why (is there a lack of test coverage for this scenario)? Because, well, testing stuff with emails is tough.
Why (is testing stuff with emails tough)? I’ve been reticent to build out a lot of good email testing infrastructure; best practice is to mock out emails in a testing environment with consoles or file mailers or something like that, but also it’s just a bad reason — effort.
The solution: build out more email testing infrastructure at a building-blocks level, making it trivial to instrument changes to the most delicate (and important) part of the application.

It all comes back to tests, I guess: tests, confidence in the changes you make, and durability. It’s not exciting to build out test infrastructure rather than, like, a new analytics dashboard (though that’s coming too!), but excitement is a bad heuristic.
Happy Sunday.
You should go see Call Me By Your Name.  It was terrific.

Minor Arcana

COE

Happy Sunday.