A while back I was sitting at my desk slinging some code when my boss came by to tell me, “you validate too much.”
“What?”, I replied.
“Every function in your classes is validating the data it’s getting. I can understand validating when the data comes into the web server, but for classes such as the ones that you are making, it’s just too much validation.”, he said with a stern look.
“Oh, OK”, I said as he walked away. At the time, I really didn’t want to get into an argument. So I just did as he requested. From that point on, none of the functions in my classes validated the data being passed to them. I just trusted that the data was good.
It was a lot to ask.
When it comes to validating data, I am reminded of the Russian proverb, Доверяй, но проверяй, which means, trust, but verify. I’ve been bitten too many times to forgo thorough validation of the data my code is going to process. Still, my boss had a point. Data validation eats cycles and takes time that adds to the expense of running the code. My boss’s assertion forced me to move beyond the habits formed by my opinions and question if my validation practices really justified the expense they incurred.
Validate According to the Deployment Unit
So I gave it some thought. What I came up with is that reliable validation depends on the deployment unit. For example, imagine I’m writing a Node.js package, Who Lives There, that takes a street address and returns some profile information of the people living at that address. Of course, in order to get the package to provide the service expected, the information submitted needs to be a valid address, as does the format of the information itself. (See Figure 1.)
Figure 1: A Node.js package is an example of a deployment unit
The package accepts data at an entry point, index.js. The index.js code uses the code in addressValidator.js to validate the address and the code in profileAnalyzer.js to get the information about people living at the address. The profileAnalyzer.js does not need to validate the address because that work is done by addressValidator.js. Both addressValidator.js and profileAnalyzer.js reside in the same deployment unit, the Node.js package. This means that profileAnalyzer.js can have a good deal of trust that index.js, the controller object, has done the address validation already.
Now let’s consider a scenario in which the deployment unit for Who Lives There is a Docker container, as shown below in Figure 2.
Figure 2: A Docker container is an example of a deployment unit
Figure 2 describes a scenario in which the service for providing profile information about people living at a particular address is encapsulated into web server running under a Docker container. In this case, address information is submitted as a request. And validation of that address information is performed by code in addressValidator.js when the request is received by server.js, which is the running instance of the web server. This is the only place in the containerized web application where address validation is performed. Neither routes.js nor profileAnalyzer.js perform address validation. Rather, they trust that such validation has been performed upstream. Is such trust warranted? Yes, it is because all the code is running within the boundary of the deployment unit, the container.
Granted, in order to avoid doing redundant work there needs to be a conventional policy agreement among those working on the web application that all data validation logic is to be performed by the entity accepting the request, in this case, server.js.
However, there is a risk. Should logic change in profileAnalyze.js that requires a change in the expected data structure of the address, the validation logic will need to be rewritten in addressValidator.js. There is no magic here. Somehow, the person making the change in profileAnalyze.js needs to communicate with the person writing addressValidator.js to make a corresponding change. If addressValidator.js does not change, things can get weird.
Remember, profileAnalyzer.js is always at risk of being exposed to bad data. Without its own validation mechanisms, profileAnalyzer.js will just emit errors that are particular to its operation and are so far down in the code stack, that those experiencing the error will have little understanding as to the nature of the problem. The nice thing about well-written validation logic is that when things go wrong, the error messages usually describe how to fix the problem. Errors emitted outside of validation logic can be cryptic and hard to understand, let alone fix.
The Importance of a Data Validation Policy
When it comes to testing data-driven code, such as Who Lives There, understanding the data validation policy in force is an important part of the software development process. For developers, this means knowing where to test happy and sad paths in terms of data validation. In the scenarios above, it makes little sense to sad path unit test profileAnalyzer.js for anything more than general failure when providing bad data. However, sad path testing addressValidator.js for more fine-grained error responses does make sense.
In terms of test practitioners who implement functional tests, this means having a clear understanding as to where data validation takes place in terms of the associated entry point and then executing tests that exercise validation accordingly. Also, performance tests will also monitor validation activities to make sure that they’re being done efficiently in terms of CPU consumption and time to execute.
Putting It All Together
Data validation is a critical aspect of application activity. If I had a dollar for every bug I had to fix that ended up being about bad data, I’d be well on my way to an all expenses paid trip to an island resort. But, there is a good argument to be made that a developer can be overzealous when writing validation code. Remember, each if/then statement you write is a programming expense. Over time, they add up. The trick is to use CPU cycles wisely. Thus, creating a data validation policy that is well known and easy to follow by all those involved in a product’s software development life cycle is a good way to make fault tolerant code that runs efficiently.