TDD red green traffic light

Test-driven development (TDD), continuous testing and related practices have many advantages, including more flexibility, easier code maintenance, and a faster development cycle. But they also have a weakness: failure exhibition, or what I call the “Red moral hazard.” This vulnerability is common, but many software professionals are still unaware that they’re causing it. Here’s what you can do about it.

Red, Green, Refactor

First, a bit of context is necessary. “Red, Green, Refactor” is the widely recognized motto of TDD:

  •  Red: Create a test and make it fail
  •  Green: Minimally modify the implementation to pass the test
  •  Refactor: Eliminate duplication and otherwise enhance the style of the implementation

Tutorials and references appropriately focus on the challenges and subtleties of the Green and Refactor steps; these are more time-consuming. The Red step should be straightforward and quick.

Too often, though, programmers entirely skip Red. To do so undercuts the value of Green and Refactor, and therefore all of TDD. Worse, all the possible remedies for the omission have their own problems.

Consider a concrete, simplified example: Imagine a new requirement that passwords be at least nine characters long. Python is an effective pseudocode for this example; exactly the same conceptual challenge arises in such languages as C# and JavaScript.

First, a programmer might write:

def is_long_enough(trial_password: str, minimum: int = 9):
    '''
    The return value should be True for trial_password of
    nine characters or more; otherwise it's False.
    '''

Next, the programmer writes a test:

import unittest

class Tests(unittest.TestCase):
    def test_length(self):
        self.assertTrue(is_long_enough("long-password"))

At this point, the test probably will show something like AssertionError: None is not true. That AssertionError is a good thing — exactly what TDD tells practitioners they need.

A normal TDD sequence might then continue by updating the is_long_enough definition to

def is_long_enough(trial_password: str, minimum: int = 9):
'''
The return value should be True for trial_password of
nine characters or more; otherwise it's False.
'''
return len(trial_password) >= minimum

At this point, the test result goes to OK or Green.

A Red-less example

Imagine a slightly different example now: A programmer begins to write:

import unittest
def is_long_enough(trial_password: str, minimum: int = 9):
    '''
    The return value should be True for trial_password of
    nine characters or more; otherwise it's False.
    '''
    return True

class Tests(unittest.TestCase):
    def test_length(self):
       self.assertTrue(is_long_enough("long-password"))

The programmer is interrupted. After a delay, the programmer returns, runs the test, sees OK and concludes it’s time to move forward.

This example is so simplified that it’s easy to see the error: Return True is wrong, and certainly no substitute for return len(trial_password) >= 9. As obvious as that seems now in isolation, I can testify that mistakes of this sort happen in real-world commercial shops on a daily basis. Even top-notch programmers have a tendency to skip Red until they’ve experienced its consequences a few times for themselves.

Understand what “experience” means: incorrect code becomes part of an application, and everyone involved believes it has been tested. The software includes a weak link–doubly weak because it passes its test. That broken piece is likely to turn up when no one expects a problem, as a mystery that demands expert diagnosis and remedy. Worse: customers are as likely to find the breakage as any employee. Red’s neglect can damage reputation with customers, create a crisis, and make for a problem that’s far harder to solve than if TDD had just been done right in the first place.

A temptation

Even then, starting with Green is a powerful temptation. Part of the point of routines and checklists and habits is to shift the execution from our “executive brain” to muscle memory. When we pause to judge whether Red is truly necessary in a case at hand, we deviate from our routine, and we are more prone to stumble in one step or another.

And the melancholy reality is that Red always remains on the periphery of our routines, because we don’t share Red with our peers. The most experienced, zealous TDD practitioners check in or commit their Green work products, but rarely anything that goes Red. Many organizations practice a culture, in fact, that explicitly expects all check-ins to pass all tests. That means that Red is a transient never seen by other programmers, unless the organization conscientiously practices pairing or mobbing or swarming. There’s no record of the source at the time of Red and no review of the Red test or its details. Red lacks accountability. Any challenge in the day — a well-meaning effort to hit a deadline, or annoyance with the morning’s commute, or distraction by a corporate re-organization — is likely to result in loss of the attention that Red deserves.

In the normal TDD programming world, Red depends on the good intentions and even ethics of individual practitioners, unsupported by any systematic review, measurement or peering. That’s a formula for failure.

And failure is the too-frequent result. In my own observation, programmers at all levels, even at the very top, at least occasionally lose their footing in the vicinity of Red.

A remedy

What’s the solution? I’ve found no entirely satisfying one. As I see it, these are the main possibilities:

  1. Continue as we are, with Red subordinate to Green and Refactor
  2. Relax any continuous integration (CI) rules to accommodate commits that are supposed to fail — of course, this complexifies the CI
  3. Expand tests to cover more cases
  4. Redefine Red as an inverted test that must pass

Possibility 3 is the promise of property-based testing. It solves the case at hand by writing not only test_length, but also

def test_length2(self):
 self.assertFalse(is_long_enough("shrtpswd"))

While this approach certainly has fans, it’s different enough from traditional TDD to demand a separate analysis.

Possibility 4 looks something like this: a first commit along the lines of

import unittest

def is_long_enough(trial_password: str, minimum: int = 9):
    '''
    The return value should be True for trial_password of
    nine characters or more; otherwise it's False.
    '''
    return False

class Tests(unittest.TestCase):
    def test_length(self):
        self.assertFalse(is_long_enough("long-password"))

followed by a correction to

import unittest

def is_long_enough(trial_password: str, minimum: int = 9):
    '''
    The return value should be True for trial_password of
    nine characters or more; otherwise it's False.
    '''
    return len(trial_password) >= minimum

class Tests(unittest.TestCase):
    def test_length(self):
        self.assertTrue(is_long_enough("long-password"))

The advantage here is that both check-ins become part of the permanent history of the source. Both are available for inspection. The change set between them has a strong, clear relation to the Red-Green transition.

It also looks like more work, and it’s correspondingly unpopular with developers. While I’ve experimented with various modifications of the unittest definition to support the Red step better, none of them yet have the polish working teams deserve.

Notice, too, that the permanence of this approach helps even with pairing or other collaborative methodologies. Pair programmers should “skip” Red less often, of course; it’s natural to assume at least one of the pair surely will remember Red. My own experience, though, is that sometimes even a mob of programmers will silently, even unconsciously, skip over Red, perhaps with the shared thought that the Red condition is obvious. A permanent milestone such as a commit to version control seems to be necessary to ensure Red’s reality.

Mind the Red

Nearly universal practice treats Red in a way that makes it easy to skip, or at least shortcut. This damages the overall reliability of TDD. No comprehensive solution exists; the best we can do in the short term is to be aware of the hazard and plan organization-specific workarounds.

All-in-one Test Automation

Cross-Technology | Cross-Device | Cross-Platform

About the Author

Cameron Laird is an award-winning software developer and author. Cameron participates in several industry support and standards organizations, including voting membership in the Python Software Foundation. A long-time resident of the Texas Gulf Coast, Cameron's favorite applications are for farm automation.

You might also like these articles