Software tests that fail in TDD

High-quality projects need to deliberately practice failing tests in order to help ensure that successful tests succeed properly. It may sound controversial or counter-intuitive, but these examples show how this idea embodies basic principles of test-driven development (TDD) in four distinct ways.

1. TDD best practice

TDD’s key methodology is “Red, Green, Refactor” (RGF). In other words, the very first step of proper programming is a test that fails. Numerous articles elaborate the intentions of RGF; for the moment, just recognize that development only begins with a Red.

As I’ve written elsewhere, even this simplest possible model presents at least one severe practical problem: Red never leaves the individual contributor’s desktop. It’s nearly universal practice for Red to have no shared artifact. It’s equally universal, at least from the reports dozens of senior developers have shared with me, to skip Red accidentally on occasion. Even the best programmers do it. The consequence is inevitable: implementations that do something other than what was intended. That’s a true failure.

At this point, I have no remedy more satisfying than moral exhortation. Conscientious programmers need to take Red seriously, at the start of every RGF cycle. Red is essential, even though experience shows that it’s often overlooked, and its neglect always harms the quality of the final implementation.

2. Handling exceptions

An entirely different kind of test failure is a unit test confirming that an implementation correctly handles an error. Test frameworks generally support this kind of test well: It looks like just another verification of a requirement. Instead of verifying, for instance, the “happy path” that 2 + 2 is indeed 4, a distinct test might confirm that 2 + “goose” accurately reports, “The string ‘goose’ cannot be added to the integer ‘2.’”

Test frameworks usually have the technical capability for this requirement. If a framework can verify that 2 + 2 yields 4, it can equally well verify that 2 + “goose” yields a specified error.

The problem is that organizations too rarely specify these requirements. In isolation, the incentives for marketing, product, engineering or other departments are to focus on features — ”affordances” — with positive capabilities. Decision-makers don’t often sign for services based on high-quality exception-handling. The best that can happen, it appears, is that user experience, technical support or another secondary department makes a point of advocating for and documenting requirements having to do with errors. Once those are in writing, engineering and QA generally test these requirements adequately. Teams need to be aware, though, of the importance of testing error-handling, as well as the need to be alert to its unintended absence.

3. Warning lights

Return, for a moment, to the second and third steps of RGF. TDD teaches that the whole programming sequence should be relatively quick and lightweight; “heavy” development segments into multiple manageable RGF cycles.

Sometime it happens, though, that a Green or Refactor step doesn’t go as planned. Errors turn up. Progress stalls.

This is important information. These errors in RGF are a symptom of a design or architecture that deserves improvement. Be sensitive to errors that turn up in these stages, as they can be guides to hotspots that might deserve rework — that is, additional RGF cycles.

4. Validating the validators

A fourth and final kind of test failure for your attention has to do with false positives that turn up in systematic validation.

We construct quality assurance practices and continuous testing (CT) implementations, deliver software artifacts to them, and then relax when an “all good” result emerges. This is exactly as it should be. It’s the way our system is supposed to work.

One of its frailties, though, is that it leaves us with no immediate evidence that the tests themselves are reliable. One failure mode for tests is to pass more artifacts than they should. They stay Green even when an error is present.

This is especially common for locally customized validators. Suppose a particular system is dependent on a number of XML sources. (The analysis applies equally to JSON, INI or other human-readable formats.) The CT for this system is good enough to include a validator for the XML. Notice that the validator is specific to this system, at least in its configuration. The configuration embodies local rules about the semantics of the XML.

At the product level, what we want is for the validator to pass all the system’s XML. It’s easy, though, to misconfigure such validators so they pass too much. Passing this kind of validator tells us less about the XML than we expect — maybe, in an extreme case, nothing at all.

A good remedy for this vulnerability exists, though. It’s generally easy to automate generation of perturbations or mutations of XML instances into invalid instances, and verify that those result in appropriate error reports. With these supplementary tests in place, we can have confidence not just that the XML passes a test for validity, but that the XML passes a discriminating test for validity.


Software’s purpose is to produce correct results. Careful thinking about different kinds of failure, though, helps bring certainty about correctness.

All-in-one Test Automation

Cross-Technology | Cross-Device | Cross-Platform

About the Author

Cameron Laird is an award-winning software developer and author. Cameron participates in several industry support and standards organizations, including voting membership in the Python Software Foundation. A long-time resident of the Texas Gulf Coast, Cameron's favorite applications are for farm automation.

You might also like these articles