DevOps principles and practices are all about the goal of creating and operating faster, more reliable, more resilient applications and systems. Some recent incidents from my own work illustrate these important DevOps themes.
Our team does a good job of making sure all customer jobs complete correctly — not just nearly all, but all. One of my responsibilities centers on a specialized machine-learning application having to do with legal documents.
It has demanding requirements for reliability, fidelity, and performance, but not latency; several operations are guaranteed to finish only within 24 hours, not 24 seconds, as might be typical for conventional web-oriented applications. And our services know how to restart themselves and report outages, which is essential for the resiliency required.
We started one particular Tuesday with a few alarms, which I … casually dismissed. I had already scheduled partial outages for that day to accommodate hardware maintenance, including blade renewal, memory swaps, and so on. I saw that other worker nodes appeared to be taking up the slack as they normally do, and headed into a day filled with meetings.
Around six hours later, a customer representative asked why jobs for one of his customers hadn’t made progress in a while. About 30 seconds later, I got over my first reaction, that he was just excessively impatient, and promised I’d look into it. Two minutes later, I agreed there was a problem.
Four minutes after that, I sent out the word to apologize for a delay, saying we’d start to have results again within the next hour. Eleven minutes after that, jobs began to advance again. After an additional seven minutes, operators told me that everything looked normal. Within roughly two more hours, all the jobs from the morning completed, and operations were back to normal from my perspective.
My apology was superfluous, in the sense that all customer commitments were met; recall that several of our quality-of-service commitments are on a 24-hour clock. Still, I was sorry, because several jobs took six or seven hours to complete. While that span didn’t exceed the guaranteed time span, it’s a lot more than the 20 to 40 minutes I expect for those particular jobs.
For end-users, the day was a non-event. Only the most sophisticated of them could have detected any delay, and all the delays stayed within contracted bounds.
The episode sure got my attention though. For more than a year, most of our incidents amounted to delays of under half an hour. They hardly merit labeling as an “outage,” because backup and failover mechanisms worked so well. This affair lasted nearly an entire working day. What did I do wrong on this Tuesday?
The answer turned out to be a lot and a little. Every upset, in my observation, inevitably involves a wealth of detail beyond a binary judgment of “that was the mistake.” At a high level, nothing was wrong: The recovery mechanisms worked, and customers weren’t aware of any problem.
The root cause of my surprises on that Tuesday had to do with a small skew in a Puppet script’s assignment of privileges. For reasons that weren’t worth fully investigating, the upgraded servers were configured with slightly different security settings than my application-level scripts expected. I didn’t notice at first because the diagnostics looked like the timeouts and announcements that are normal during reboots. Once I realized in the afternoon that something truly was in error and wasn’t correcting itself, it took only a few minutes to update the configuration, at which point everything began again to work as normal.
Coincidentally, one of our internal subject matter experts, a colleague who often monitored the customer dashboard, was also tied up all day, so he also didn’t notice the anomalies that he would have caught quickly on nearly any other day.
Incidents, particularly dramatic ones, are always like this — multiple causes are in play.
Build, Test, and Deploy with Confidence
Scalable build and release management
The week’s excitement wasn’t over. Friday, we went through more of the same, refreshing more hardware. This time the updated Puppet scripts did their part, but application-level configurations that were updated during the week turned out to be broken: They assumed more of the Puppet scripts than they should have.
In this case, I had to work for a few hours to get the right corrections in place. No one on the customer side cared though; as clumsy as it felt to chase down the errors, we had enough capacity in place from Monday to meet all operational needs.
How could I allow servers to be unavailable for hours in two separate incidents in one week, after months with little more than seconds of downtime? In large part, it was a choice. It was organizationally preferable to test lightly and push forward with the hardware swaps quickly, even though the result was a few hours of disruption and anxiety. The alternatives would all have involved scheduling time in the data center for several weeks or even months later.
Things turned out fine, as far as I was concerned. If the costs and benefits for our operations were different — if we needed more stringent assurances on downtime, for instance — I would have made a different choice.
For me, the week reinforced the importance of staying in touch with people well outside our own small department, having applications that largely take care of their own operations, and practicing roll-forwards often enough to be comfortable with updates, even with alarms going off. Our hardware ended up fresher than when it started, and we identified several weak points in the software to reinforce for more robust responses. It was a good week.
All-in-one Test Automation
Cross-Technology | Cross-Device | Cross-Platform