On call
How do you explain on-call to a friend?
I mean, they are your friend — they are smart and kind and patient. When you fumble through the first inexact metaphor (well, it’s kind of like being a doctor, but all your patients are computers, and sometimes the computers get sick) they smile and nod. They ask completely reasonable questions (so, if your stuff goes down then the company is losing money?) to which sometimes the answers are simple and sometimes less so (well, not exactly, but there’s this thing called an SLA — have you heard of five nines?).
Eventually the conversation steers towards the raw mechanics (how often do you get woken up in the middle of the night?, so what’s an example of a problem that you’d be paged for?), to which the answers can either inspire a certain level of existential dread (it’s not the waking that’s so bad as the fear of every ring, every blip and boop of your phone an omen) or a comical banality (sometimes there are too many files on the computer and the computer can’t do anything so we burn the computer alive and replace it with a new one).
At this point, the conversation tends to peter out: perhaps your friend grabs another drink and makes a mental note never to ask someone why they brought their laptop to the bar ever again. Sometimes, though, especially if your friend is in the industry, you will lapse into war story mode, trading anecdotes of the time a fleet of two hundred servers collapsed due to too many heap dumps, or the time someone fat-fingered a provision, or or or. This is always fun.
Anyway.
I think there are a number of useful models for thinking of on-call rotations. The two I like the most, though, are:
On-call rotations internalize and incentivize ownership of your software and processes..
When you are responsible for the negative externalities of a Thing, you are provoked into making the Thing as good as possible. This, I think, does not require elaboration.
Perhaps less self-evident, though; On-call rotations are a process by which we reify operational wisdom and processes.. (Or, phrased by its inverse: on-call rotations are a process by which we systematically reveal and destroy tribal knowledge.)
Your data pipeline goes down at 12pm on a Tuesday with a FooBarIntegrityException; you poke Bob, who wrote the pipeline, and he immediately tells you oh, yeah, that happens every time there’s a full moon, run these three lines and it’ll be fine.
What happens, though, when the BazBarCheckFailure triggers at four in the morning on a Sunday and Bob is in the San Juans working on his fly fishing? (Can you fly fish in the San Juans? Let’s assume, for the purpose of this exercise, that you can.)
- You check the exception to see how actionable the information it contains is. (Does it just say “Job failed?” That’s pretty bad.)
- You check the runbook to see if the exception (and remediation steps) are outlined.
- You check old emails, pages, Slack messages, carrier pigeon missives, etc. to see if anyone had recently run into it, so as to retrace their steps.
- You, at the very last resort (and with much gnashing of teeth) call Bob, who dictates the three-line fix over the phone.
A lot of people might read that list and be tempted to blame Bob for being the sole owner of information, for bringing down the bus factor. But of course it’s not Bob’s fault. Bob, nor anyone else, is not aware of the delta between his accumulated knowledge and yours; he just knows what he knows, and when he runs into the BazBarCheckFailure he doesn’t need to check the runbook because he already knows what to do. This is why on-call is good: it pushes that information up like some kind of forcing function, transforming it into concrete.
(And sure, that concrete gets old over time; runbooks become outdated, exceptions become irrelevant or noisy, the map starts losing the territory. But it’s a process, and processes are good.)
I finished up my first week of on-call this week at Stripe. It wasn’t perfect, but I never got woken up and I got to update a bunch of runbooks and I never had to call Bob.
And I remember the first time I was ever on-call: for Amazon, in 2013, still terrified of the idea of being paid to do what I did and faintly convinced that my first on-call rotation would be the proof the company needed to fire me.
And I remember the last time I was on-call for Amazon, and how it was painless and breezy: not out of luck, but because of years of a smart team and a smart manager that relentlessly prioritized minimizing debt, friction, and operational burden.
Things are good; but it’s nice to be reminded how they can be even better.
Have a good Sunday.
I hope that, if you have to go on-call, it goes just swimmingly.