Happy little accidents
You’ve probably heard this anecdote before, right?:
A very large government bid, approaching a million dollars, was on the table. The IBM Corporation—no, Thomas J. Watson Sr.—needed every deal. Unfortunately, the salesman failed. IBM lost the bid. That day, the sales rep showed up at Mr. Watson’s office. He sat down and rested an envelope with his resignation on the CEO’s desk. Without looking, Mr. Watson knew what it was. He was expecting it.
He asked, “What happened?”
The sales rep outlined every step of the deal. He highlighted where mistakes had been made and what he could have done differently. Finally he said, “Thank you, Mr. Watson, for giving me a chance to explain. I know we needed this deal. I know what it meant to us.” He rose to leave.
Tom Watson met him at the door, looked him in the eye and handed the envelope back to him saying, “Why would I accept this when I have just invested one million dollars in your education?”
I can’t find the ‘original’ permutation of what is somewhere between apocrypha and the perculiar brand of HBS Short Case mythos. Sometimes it’s a few million dollars. Sometimes it’s an engineer (and a broken mainframe) rather than a salesman. Sometimes it’s a meeting and not a resignation letter. But you get the idea, right?
Around a year or so ago, most of the internet went down. I don’t want to get into the boring tedium of whether or not consolidated cloud infrastructure is a boon or a boondoggle for the health of the overall Internet, but I think a lot of people focused on the way it happened (‘human error’) rather than the why it happened:
Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
Notice the passive phrasing here! Notice how the subject is the input rather than the engineer who mistyped. Notice how the desxcribed next steps focus on the tool being bad because of insufficient guard rails — it’s not that the user Did A Catastrophe, it’s that the tool is bad because it lets a user Do A Catastrophe:
We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks
I’ve been biking a lot lately, much to the chagrin of all of my friends and acquaintances whom I bombard endlessly with bike talk. I am neither a good biker nor an accomplished biker — my relationship is one of good-natured and zealous incompetence. But it’s just so fun.
Here is one of my favorite things about biking: it is very hard to steer oneself too far astray. The cost of any given wrong turn is minimal at best, and usually just a fun little wrinkle with which you can explore a new path.
This Friday, I was meeting up with a couple friends in Georgetown and, given my aforementioned zeal, decided to yank one of the LimeBikes outside my office and bike there. I do one of the things that I always do:
- Glance at Google Maps real quickly
- Foolishly assume that I can instantly mentally translate a blue zig-zag into actual directions.
- Miss like three turns in a row.
But this turned out great! My trip ended up being like six miles instead of four but I found this lovely little trail that ran alongside the light rail, cutting through Airport Way and treating me to a series of gorgeous brick-stack murals.
One of the slightly unpleasant consequences of Buttondown’s growth is that I can’t quite, you know, ship broken code with the same cavalier spirit that I used to be able to, now that there are a non-trivial amount of folks actually using the broken code.
I think failure-resilient organizations can be distinguished by three axes:
- Lots of engineering effort has been invested in surfacing potential failures or errors as quickly, transparently, and painlessly as possible.
- Lots of architectural effort has been invested in reducing the surface area and depth to which things can fail.
- Lots of organizational effort has been invested in creating and preserving an environment where failures are blameless and reflective.
As it stands, Buttondown is…kind of bad at most of these! Which is not to say it’s unstable so much as it’s fragile. I still have pangs of terror whenever I deploy; I still sweat bullets whenever I get a bug report. (Though I love the bug reports, I really do; I just wish I didn’t need them at all.)
This is also perhaps the primordial bugaboo of the entire app; the failure mode of a malformed email is irreversible. Once you send an email with a typo or a formatting error or a broken link, it’s out there in the SMTP ether, unclean and unrepentant.
This is the time of month when I start thinking about what I want to work on for the next thirty days, and my heuristic is usually “what would I wish for if I could wave a magic wand of +2 Technical Debt?” Sometimes it is a feature; sometimes it is more users; sometimes it is an extra couple dozen hours to read the Iliad.
Right now, I think, it is a little more institutional rigor. If codebases are tiny little plots of land, I have been expanding Buttondown’s borders without properly improving its infrastructure. There is a world where I have a couple hundred more frontend tests; I have an actual performance regression harness; I have integrations around deliverability. I think those things will be what I work toward.
Happy Sunday.
I hope you accidentally find something neat.