Days in the DevOps Life

Nov 21, 2019 | Test Automation Insights

Day in our DevOps lives

DevOps principles and practices are all about the goal of creating and operating faster, more reliable, more resilient applications and systems. Some recent incidents from my own work illustrate these important DevOps themes.

Our team does a good job of making sure all customer jobs complete correctly — not just nearly all, but all. One of my responsibilities centers on a specialized machine-learning application having to do with legal documents.

It has demanding requirements for reliability, fidelity, and performance, but not latency; several operations are guaranteed to finish only within 24 hours, not 24 seconds, as might be typical for conventional web-oriented applications. And our services know how to restart themselves and report outages, which is essential for the resiliency required.

Tardy Tuesday

We started one particular Tuesday with a few alarms, which I … casually dismissed. I had already scheduled partial outages for that day to accommodate hardware maintenance, including blade renewal, memory swaps, and so on. I saw that other worker nodes appeared to be taking up the slack as they normally do, and headed into a day filled with meetings.

Around six hours later, a customer representative asked why jobs for one of his customers hadn’t made progress in a while. About 30 seconds later, I got over my first reaction, that he was just excessively impatient, and promised I’d look into it. Two minutes later, I agreed there was a problem.

Four minutes after that, I sent out the word to apologize for a delay, saying we’d start to have results again within the next hour. Eleven minutes after that, jobs began to advance again. After an additional seven minutes, operators told me that everything looked normal. Within roughly two more hours, all the jobs from the morning completed, and operations were back to normal from my perspective.

My apology was superfluous, in the sense that all customer commitments were met; recall that several of our quality-of-service commitments are on a 24-hour clock. Still, I was sorry, because several jobs took six or seven hours to complete. While that span didn’t exceed the guaranteed time span, it’s a lot more than the 20 to 40 minutes I expect for those particular jobs.

For end-users, the day was a non-event. Only the most sophisticated of them could have detected any delay, and all the delays stayed within contracted bounds.

The episode sure got my attention though. For more than a year, most of our incidents amounted to delays of under half an hour. They hardly merit labeling as an “outage,” because backup and failover mechanisms worked so well. This affair lasted nearly an entire working day. What did I do wrong on this Tuesday?

The answer turned out to be a lot and a little. Every upset, in my observation, inevitably involves a wealth of detail beyond a binary judgment of “that was the mistake.” At a high level, nothing was wrong: The recovery mechanisms worked, and customers weren’t aware of any problem.

The root cause of my surprises on that Tuesday had to do with a small skew in a Puppet script’s assignment of privileges. For reasons that weren’t worth fully investigating, the upgraded servers were configured with slightly different security settings than my application-level scripts expected. I didn’t notice at first because the diagnostics looked like the timeouts and announcements that are normal during reboots. Once I realized in the afternoon that something truly was in error and wasn’t correcting itself, it took only a few minutes to update the configuration, at which point everything began again to work as normal.

Coincidentally, one of our internal subject matter experts, a colleague who often monitored the customer dashboard, was also tied up all day, so he also didn’t notice the anomalies that he would have caught quickly on nearly any other day.

Incidents, particularly dramatic ones, are always like this — multiple causes are in play.

Build, Test, and Deploy with Confidence

Scalable build and release management

Another Alarm

The week’s excitement wasn’t over. Friday, we went through more of the same, refreshing more hardware. This time the updated Puppet scripts did their part, but application-level configurations that were updated during the week turned out to be broken: They assumed more of the Puppet scripts than they should have.

In this case, I had to work for a few hours to get the right corrections in place. No one on the customer side cared though; as clumsy as it felt to chase down the errors, we had enough capacity in place from Monday to meet all operational needs.

How could I allow servers to be unavailable for hours in two separate incidents in one week, after months with little more than seconds of downtime? In large part, it was a choice. It was organizationally preferable to test lightly and push forward with the hardware swaps quickly, even though the result was a few hours of disruption and anxiety. The alternatives would all have involved scheduling time in the data center for several weeks or even months later.

Things turned out fine, as far as I was concerned. If the costs and benefits for our operations were different — if we needed more stringent assurances on downtime, for instance — I would have made a different choice.

For me, the week reinforced the importance of staying in touch with people well outside our own small department, having applications that largely take care of their own operations, and practicing roll-forwards often enough to be comfortable with updates, even with alarms going off. Our hardware ended up fresher than when it started, and we identified several weak points in the software to reinforce for more robust responses. It was a good week.

All-in-one Test Automation

Cross-Technology | Cross-Device | Cross-Platform

Related Posts:

Model-Based Testing with Ranorex DesignWise

Model-Based Testing with Ranorex DesignWise

Model-based testing (MBT) has emerged as a powerful strategy for maintaining high standards of quality in an efficient and systematic manner. MBT transforms the development process by allowing teams to derive scenarios from models of the system under test. These...

What Is OCR (Optical Character Recognition)?

What Is OCR (Optical Character Recognition)?

Optical character recognition technology (OCR), or text recognition, converts text images into a machine-readable format. In an age of growing need for efficient data extraction and analysis processes, OCR has helped organizations revolutionize how they process and...

Support Corner: API Testing and Simple POST Requests

Support Corner: API Testing and Simple POST Requests

Ranorex Studio is renowned for its robust no-code capabilities, which allow tests to be automated seamlessly across web, mobile, and desktop applications. Beyond its intuitive recording features, Ranorex Studio allows custom code module creation using C# or VB.NET,...