Evaluation Metrics to Determine AI Model Performance

Jan 20, 2021 | Programming, Test Automation Insights

Robot AI being evaluated by a software tester

If you ask anyone working in a technology company what one thing would help them grow faster and change the world, the answer would be data. It is the new currency. To analyze trillions of data sets and find common patterns that would otherwise be hard for humans to recognize, companies are turning to AI.

AI-based systems can make decisions based on these data sets far faster than humans. But how do we know the systems are working correctly and won’t have harmful effects when released to end-users? AI-based systems, similar to other systems, have acceptance criteria in the form of evaluation metrics. These metrics determine whether the performance of an AI model is at an acceptable level.

There are three commonly used evaluation metrics:

  • Accuracy
  • Precision
  • Recall

Before training the AI model, the team collectively decides on acceptable values for these metrics to determine an AI model’s performance.

How to calculate evaluation metrics for an AI model

Here is how to calculate these metrics, where:

  • True positives (TP): The cases where we predicted YES and the actual output was also YES
  • True negatives (TN): The cases where we predicted NO and the actual output was NO
  • False positives (FP): The cases in which we predicted YES and the actual output was NO
  • False negatives (FN): The cases in which we predicted NO and the actual output was YES

Performance example

For example, say we build an AI model to determine if a coffee mug has a crack. Let’s take three coffee mugs and figure out how we evaluate the performance of this AI model.

Coffee mug 1: No crack (Correct prediction: NO)
Coffee mug 2: Has a crack (Correct prediction: YES)
Coffee mug 3: Has a design that looks like a crack but has no actual cracks (Correct prediction: NO)

Our AI model analyzes the above coffee mugs and gives the following predictions:

Coffee mug 1: No crack (Actual output: NO)
Coffee mug 2: Has a crack (Actual output: YES)
Coffee mug 3: Has a crack (Actual output: YES)

In the last case, the coffee mug does not really have a crack, but its design looked like a crack and confused the AI model into giving the incorrect output.

Let’s apply the evaluation metrics in this example:

Before starting the AI model training, the team would have already decided on the acceptable value for each of these metrics.

Say the team decided that accuracy should be >90%, precision should be >90% and recall should be >85%. Then the AI model has not met two of the three acceptance criteria.


There are other evaluation metrics to determine AI model performance, such as a receiver operating characteristic curve, the area under that curve and the F-score. It all depends on the type of AI model used, such as regression, classification, clustering or something else.

Using AI doesn’t mean you simply feed data into a model and then have to accept whatever results come out. Testers can indeed determine whether the AI model is working as expected, ensuring they do not have surprise consequences when end-users use the system.

All-in-one Test Automation

Cross-Technology | Cross-Device | Cross-Platform

Related Posts:

5 Software Quality Metrics That Matter

5 Software Quality Metrics That Matter

Which software quality metrics matter most? That’s the question we all need to ask. If your company is dedicated to developing high-quality software, then you need a definition for what “high-quality” actually looks like. This means understanding different aspects of...

The Ins and Outs of Pairwise Testing

The Ins and Outs of Pairwise Testing

Software testing typically involves taking user requirements and stories to create test cases that provide a desired level of coverage. Many of these tests contain a certain level of redundancy. Traditional testing methods can lead to a lot of wasted time and extend...