Machine learning (ML) is expanding explosively. ML’s applications are so diverse, though, that testing that makes perfect sense in one situation might be pointless in another.
If you work in the internet of things, legal technology, the financial industry, or several other regulated fields, you need to be careful to distinguish the scopes of software testing and scientific testing for your models — and to do both well.
Machine Learning Oracles
Consider the credit-card industry. Processors invest a great deal in fraud detection and mitigation. Machine learning is one of their techniques: in broad terms, automated detectors are trained to conclude from certain combinations of data that particular charges are likely to be fraudulent. ML appears to be indispensable in this domain because of the volume and complexity of data involved. The above linked article concludes with a 52-arrow flow chart visualizing an ML algorithm; it’s characteristic of expectations in this area that the article’s author says the model is explained “very simply.”
In this technique, the training is on a control set. The ML is then, ideally, validated on a different data set or test set. This is a situation that matches conventional software-testing approaches: Given a fixed control set as an input, does our algorithm provide consistent predictions about the test set? Can we use the data as a regression test to validate the invariance of certain results?
The deliverable of this kind of ML is an oracle that learns from an entire population of characteristics pertinent to all members. ML in other domains acts to distinguish photographs of cute kittens from terrorists, or to diagnose illnesses based on medical histories, or to predict when transport engines are best overhauled. When new data arrives, the ML improves the “knowledge” of a single central oracle that applies to the entire population.
A second, less-publicized kind of ML acts on individuals. A product in this category might receive data from a particular bank on its check transactions, apply ML, and predict fraud or other problems about checks received by that single bank. There’s no assumption that the oracle for one bank works for another.
Multiple Oracles in Play
This turns out to be a harder testing problem. The product in this case is actually a generalization of the first kind of ML: The software doesn’t just create a single oracle from a control set, but it must be capable of creating a different oracle for each of a universe of control sets. This additional level of abstraction makes testing combinatorially harder.
More than that, testing is more subtle. In the first category, where the outputs are fraud predictions, outputs have a simple score based on the number of frauds correctly detected and missed in the whole population. The quality of the second product has to aggregate the qualities of the oracles produced by the ML for each individual bank.
Another way to express this difficulty is to note the decoupling of the software testing from the scientific testing. Suppose the oracle generated for a particular customer gives bad predictions. It’s possible the software perfectly embodies the required ML, but the ML applies so poorly to this customer that it produces a low-value oracle. It’s equally possible that the ML is sound, but the implementation mishandles an edge case that previous quality assurance didn’t detect.
While regression testing the second category demands richer data sets, just to multiply instances is unlikely to be sufficient. Statistics isn’t linear, in the sense that inferences don’t scale on a straight line: To double the quality of a test might, depending on the details involved, take four times or 60 times as many data points, or even more. Separate oracles for separate consumers is an abstraction whose testing can only be tamed through introduction of more introspection on the oracles.
Introspection, in this case, means that a trained oracle not only gives an answer — “This check is unusually likely to be returned for insufficient funds” or “Replace the diaphragm assembly now” — but also insight into the facts that led to this conclusion. While the oracle might not make available to the end-user intermediate propositions such as “The check was presented to a small bank out of state” or “Phosphorus level were more than one standard deviation above average,” it’s crucial for quality assurance purposes that those intermediate propositions be tested, along with the final results. That’s generally the only effective way to control the extra level of ML abstraction.
Securing Adequate Testing
Conventional ML can be tested much like other software. While the data involved might be particularly large or complex, in principle it’s still possible to synthesize appropriate regression tests for this kind of ML.
ML that produces a customized oracle for each client or end-user, though, demands more specialized and sophisticated testing. Usual coverage and quality scores can be high, yet many functionally significant computations may be left unchecked. A deeper background in data science is usually necessary for adequate testing of this second class of ML.