Breaking it down

What does “testing artificial intelligence” even mean? AI is now often talked about in relation to how it can be used to help with testing, so how do we go about testing a technology employed in testing?

AI effectively means “advanced and difficult,” or even “whatever machines haven’t done yet.” AI is hard, and testing AI is hard. But a little care with basic concepts demonstrates that familiar testing skills apply, even on software’s bleeding edge.

The first and most strategic advance when testing AI is to take it a bite at a time: Identify, isolate and resolve individual difficulties in AI testing. What’s left is a little bit smaller and easier to tackle.

AI is part of a larger whole

AI applications inevitably have plenty of parts that are not AI. Traditional testing applies to these just fine. Test the user interfaces, networking, storage, configuration, error-handling, arithmetic and so on as you would any other application, and the AI core that remains will be more manageable.

“AI” covers a number of specific technologies, including machine learning, natural language processing, perception and robotics. It’s generally important and valuable to identify and test different AI technologies individually.

A speech recognition application, for instance, probably combines a number of advanced technologies, such as digitization informed by neural networks, natural language processing, generative modeling, supervised training, and more. Savvy testers will find ways to test the individual segments of a speech recognition system. Some parts will involve testing big data at scale. Variations on the histogram method are standard for analog-to-digital conversion, and standardization of test suites for natural language processing is in at least its third decade.

Don’t expect to be an expert in all of these technologies. Your greatest contribution as a tester is to clarify the different components for which you’re responsible and construct realistic, useful test plans for each of them.

One common pattern in AI is that a process has different phases or aspects, which demand different kinds of testing. Often a training or learning phase results in a model that can be applied to yield actionable results. Construction of the model demands one kind of testing – accurate application of the model is a different process, with different testing techniques.

Models tend to be “fuzzy,” meaning that a single set of training data can yield different models that only experts can judge and discriminate. Model validation often is hard to automate and involves considerable human effort. Model use, on the other hand, is more deterministic and easier to test in traditional ways. In all cases, it’s important to distinguish AI phases and craft tests appropriate to each one.

Yet another dimension of model testing focuses on transparency, intelligibility or interpretability. The literature is full of “deep learning” neural nets and other AI applications that produce good results but that no human understands. According to one popular article, “No one really knows how the most advanced algorithms do what they do.”

Whether that’s a problem or not, testing specialists can help in at least a couple of ways:

  • Leverage experience with white-box and black-box testing to clarify which models are transparent and which can only be judged in terms of their outputs
  • Teach implementers that interpretability is a virtue, like testability. Just as it’s often possible to rework naive conventional software to make it more testable, AI applications can often be instrumented and otherwise enhanced to make their models more transparent. Testable and transparent software is better and more reliable software, even if sometimes it requires a bit more initial effort

Solve Your Testing Challenges
Test management tool for QA & development

AI and UX

Another challenge of AI testing is that because AI is so advanced, users’ expectations about it are unrefined. Our collective expectations for how such legacy applications as a spreadsheet or web browser should behave have largely converged. But that’s not true for an AI-augmented security surveillance tool. Consumers end up relating to AI applications in surprising, innovative, and sometimes dysfunctional ways

Consider this example: A basic profile form might ask for first and last name. While even this basic interaction embeds plenty of cultural assumptions and pitfalls (people in some countries commonly possess only a single name, the meanings of first and last are reversed in certain societies, and so on), we at least can build on centuries of experience in government censuses and other contexts that range over the possible values for names. AI applications, in contrast, are so new that it’s not always clear what users will do with them, let alone when testing should pass or fail a particular result.

One consequence is that AI puts user experience testing at a premium. Whenever possible, AI testing should include a component of observing use by real users, because AI is new enough that all of us are learning what it can do.

Another way to say this is that we have good knowledge about the range space or possible inputs for conventional computing applications, like accounting or presentation or media playback software, and we can shape tests in terms of that knowledge. AI not only involves arcane technologies and meager testing backgrounds, but we don’t even have a firm grip on what users end up doing with the applications. What kinds of inputs might show up in the real world of users? Uncertainty about this question amplifies the difficulties of testing AI.

Put it all together

AI software is still software, and many of our typical testing techniques apply equally well to AI.

The hard parts of AI are hard to test, but we can usually isolate them for testing purposes, cutting the difficulties of testing down to size. Look for the different segments or phases of an AI application, and learn from the specialists for each technology involved. Work with your team not just on testability, but, in the case of AI components, interpretability.

And remember to emphasize observation of how real consumers relate to AI software. It’s likely to surprise you.

If testing still presents problems after you put all these tips in play, at least you’ll be in a position to report those problems cogently and constructively.

All-in-one Test Automation

Cross-Technology | Cross-Device | Cross-Platform

About the Author

Cameron Laird is an award-winning software developer and author. Cameron participates in several industry support and standards organizations, including voting membership in the Python Software Foundation. A long-time resident of the Texas Gulf Coast, Cameron's favorite applications are for farm automation.

You might also like these articles