Accuracy Is Not Always Accurate
Fake news and media manipulation are becoming ever more common, and with the complexity of modern life, assessing what is true is increasingly difficult, particularly when data is involved. Data has always had the ability to mislead – as a student, How to Lie with Statistics was one of my favorite books. The situation is even more pronounced today, with data influencing more of life through complex AI-based solutions, and commentators, either through ignorance or deliberate intent, misinterpreting the results.
While AI is destined to be a tremendous force for good for individuals, organizations, and industries, there must be a clear commitment to use AI responsibly, with the public able to make an informed opinion about its benefits.
Accuracy Is Suspicious
In this post I want to discuss an important aspect that we must all be conscious of when evaluating AI: How do you fairly determine the performance of AI–based systems? In particular, I want to focus on classification systems which, as the name implies, are designed to classify a given sample into two or more separate groups. AI has shown great promise in this area, outperforming human experts in many fields. For example, is this skin discoloration cancerous? Is this tomato fresh enough for sale? What song is being played in this TV show?
I was prompted to write this article in June after reading an opinion piece in a leading UK newspaper. The author was unashamedly arguing against the use of AI-based facial recognition software by police forces. A key argument was the lack of accuracy. He quoted the results of trials run by the London Metropolitan Police searching for known offenders:
Of 150 people identified by the facial recognition software, only 10 were actual offenders, the other 140 were innocent bystanders. This, the author argued, gave an error rate of over 93%, or an accuracy of less than 7%. On the face of it, that’s a terrible result. Accuracy of less than 7%! The facial recognition system is clearly unreliable and unfit for purpose.
But is it?
Actually, the reality is a lot more complex.
Let’s make a couple of reasonable assumptions about the London trial to illustrate my point. Assume the system was looking for 1,000 known offenders out of the greater London population of 8 million people. Now let’s suppose that the system scanned 10% of the population – that’s 800,000 people. From that sample, we know it identified 150 possible offenders. This would mean it classified 799,850 people as not being offenders. Even if we assume a worse case that all remaining 990 offenders were included in the 799,850 people scanned and eliminated by the trial, the error rate would only be 0.1% (990 divided by 799,850) – giving an accuracy of 99.9%!
This is a major problem when evaluating the power of AI classification systems in such situations. Is the “accuracy” of the solution 7% or 99.9%? Both calculations are correct, but neither are particularly helpful in determining the value of this solution. Unfortunately, people with an agenda can and do exploit this confusion to mislead, either accidently or knowingly. So whenever you see “accuracy” quoted in an article related to AI, look very closely at the motivations of the author.
To understand why “accuracy” is such a poor measure of performance, consider the detection of lung cancer. In the UK, approximately 1 person in every 1,000 is diagnosed with lung cancer annually. Now, I could easily develop an AI lung cancer test that is 99.9% accurate – all the test needs to do is always report a negative result – in 999 cases out of 1,000 it will be totally correct. The system is not helpful, in fact it would be downright harmful, but it is still 99.9% accurate. This is a common problem – many AI tools are aimed at detecting rare occurrences; be it criminal identification, cancer detection, equipment failure, or more mundane things like targeted advertising.
When searching for a needle in a haystack, calling everything hay is very accurate, but not very helpful.
If Not Accuracy Then What?
If accuracy is a bad measure, how can you evaluate the value of an AI solution? Let’s revisit our criminal offender detection AI solution. How would a perfect facial recognition system perform? Well, it would successfully identify every offender detected by the cameras, and it would never falsely accuse anyone that is innocent. Unfortunately, such algorithms are never perfect, so some innocent people may get labeled as offenders (false positives), while some offenders may not get detected (false negatives). There is nearly always a trade-off between these two events. By raising the threshold for positively identifying an offender, perhaps by specifying that the person’s full face must be seen and match exactly, the false positive rate can be greatly reduced, but this will also increase the likelihood that offenders will be missed, increasing the False Negative Rate. This trade-off is shown in the figure below:
The figure shows an example of the performance of a general AI-based classification system. In most cases such a system can be adjusted to operate at any point along the curve, reducing the False Positive Rate or reducing the False Negative Rate, but not both at the same time. The aim is to create a solution that will offer perfect performance, as indicated by the star. No false negatives and no false positives. However, until such a solution exists, the user will always face a trade-off.
Where do you draw the line in this trade-off? Like the answer to many questions, it depends. What are the consequences of false negatives and false positives? For example, if AI is predicting when to maintain your household boiler, you probably want to bias the algorithm to give a low false positive rate, otherwise you will be spending money on unnecessary maintenance when it might be cheaper to wait until it fails before seeking help. Conversely, if you are trying to detect cancer, you want a very low false negative rate, since missing the diagnosis is likely to have severe consequences. (Note that my own hypothetical algorithm, where every sufferer is classed as healthy, has a False Negative Rate of 100% – so it is clearly dangerous, even though I can still legitimately claim 99.9% accuracy.)
So where does this leave us? Well, my recommendation is that whenever you see an accuracy or error-rate figure mentioned with respect to an AI-based solution, firstly look very closely at the motivation of the author. Might they have an agenda to promote? Secondly, look twice at the data. Is there sufficient information for you to calculate the False Positive and False Negative Rates? Which is most important, and where would you strike the balance to maximize economic and societal gain?
With this in mind, have another look at the result of the Metropolitan Police facial recognition trial. Is 10 offenders out of 150 flagged up really such a bad result? You decide.