This is the fourth in a series of articles intended to make Machine Learning more approachable to those without technical training. Prior articles introduce the concept of Machine Learning, show how the process of learning works in general, and describe commonly used algorithms. You can start the series here.
In this installment of the series, we review some common measures and considerations when assessing the effectiveness and value of a classification algorithm.
To get started, let’s assume that an algorithm has been developed on a training set and has generalized well. We now consider how we can measure the performance of the algorithm on a test set. If you didn’t understand what I just said, you need to go back to this installment.
The confusion matrix
Let’s assume that our algorithm is a simple discrete classifier which returns p if the instance is predicted to be part of a class of interest (positive) and n if not (negative). Let’s also assume that, in the actual data in the test set, p’ represents a positive instance and n’ represents a negative instance. Recall, from an earlier article the definitions of True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN):
TP = p AND p’
TN = n AND n’
FP = p AND n’
FN = n AND p’
It is useful to display the total instances of each of these in a matrix, known as a Confusion Matrix (or Error Matrix or Contingency Table), for example:
A number of valuable measures can be calculated from the confusion matrix, as follows:
- Accuracy = (TP + TN)/(P + N). The Accuracy is the proportion of all instances which were predicted correctly.
- Precision = TP/P. The Precision is the proportion of predicted positives which were accurate.
- TP rate or Recall or Sensitivity = TP/P’. The TP rate is the proportion of positive examples which were predicted accurately.
- FP rate = FP/N’. The FP rate is the proportion of negative examples which were incorrectly predicted to be positive.
- Specificity = TN/N’ = 1- FP rate. The specificity is the proportion of negative examples which were predicted correctly.
All of these measures are of interest to the data scientist depending on circumstances. However, of particular interest is the trade-off between the TP rate and the FP rate. For an algorithm to be of value, it needs to correctly classify positive examples (benefits) while ensuring that the number of negatives that are incorrectly classified as positives (costs) is kept to an acceptable level.
Example of a confusion matrix
Let’s go back to our original example from Installment 2. Let’s assume that we have developed an algorithm that predicts which job applications are ‘High Potential’.
The algorithm simply returns ‘HP’ or ‘Not HP’. We run our algorithm through a test set of 500 prior job applications and generate this confusion matrix:
For this algorithm we can therefore calculate:
- Accuracy = 80%. Four-fifths of all test set instances were correctly predicted.
- Precision = 50%. Half of the positive predictions were accurate.
- TP rate = 75%. Three-quarters of the positive examples were correctly predicted.
- FP rate = 19%. About a fifth of the negative examples were incorrectly predicted to be positive.
- Specificity = 81%. About four-fifths of negative examples were correctly predicted.
While a 75% TP rate may be considered to be very good in itself, it is achieved at the expense of a 19% FP rate. This FP rate may be unacceptably high. It is down to human decision makers, working with the data scientist, to determine an acceptable balance of TP rate and FP rate, assuming one can be achieved. It is usually necessary to consider volume as part of this decision. If the company receives 5000 applications, this means that almost 1000 will be incorrectly classified as ‘high potential’, and the company may not have the capacity to deal with this.
ROC graphs
Given that the trade-off between the TP rate and the FP rate is of such interest, it is natural to visualize and compare algorithms by plotting their TP rates and FP rates on a two dimensional graph. Such a graph is called an ROC graph. (Receiver Operating Characteristics, a term that originates from the engineering and radar development during the Second World War).
Using our ‘high potential’ example above, and assuming that the data scientist produced four other algorithms that had a different TP and FP rates, we could use an ROC graph to compare these algorithms, plotted below with Algorithm A as the one discussed in the previous section. We could immediately identify Algorithm C as the best performing algorithm. Algorithm E is a perfectly random algorithm which selects equal rates of true positives and false positives. Algorithm B is better at identifying false positives than true positives.
ROC curves
Now let’s assume that our algorithm is a probabilistic classifier, and returns a probability that the instance falls into a certain class. In this case, a confusion matrix can only be determined based on a ‘cutoff’ probability.
Using our example, we could imagine that the algorithm may have calculated a probability for each application in the test set, and the company may have decided to try a cutoff of 0.8 as the minimum probability that would be considered ‘high potential’.
Thus, such a probabilistic classifier will have different confusion matrices for different cutoff points and so the TP rate and FP rates will change according to the cutoff points being applied.
In this case, it is helpful to draw the ROC curve of the algorithm, that is, to plot how the TP rate and FP rate change relative to each other as different cutoff points are used. As the cutoff point is lowered, the TP rate stays the same or increases.
As a consequence of this the FP rate must also stay the same or increase. This type of curve is known as a Lorenz curve.
An example of what an ROC curve for our probabilistic ‘high potential’ algorithm might look like is given below with the specific cutoff point that gave rise to our earlier example marked with an X.
ROC curves can be used in combination with volume considerations to determine the optimal cutoff point of a probabilistic classifier. If volume is a concern, such as if the organization’s resources are limited, then False Positives will need to be minimized. If not, a higher level of false positives can be accepted as a trade-off for a high TP rate.
What is a ‘good’ ROC curve?
Imagine we developed a completely random probabilistic classifier algorithm which, at each cutoff point, identified true positives and false positives at the same rate. If we were to plot such a classifier on an ROC graph, it would look like a diagonal straight line linking the point (0,0) to the point (100,100). This algorithm is of no value to us as we might as well use blind luck to classify our dataset.
Usually, if we are measuring an algorithm that has been developed on a test set and has generalized well, the algorithm will be significantly better than a random classifier, and its ROC curve will take a convex form, such as the two algorithms A and B illustrated here.
Clearly, the ‘golden zone’ on an ROC graph is the top left hand corner, where there is a high TP rate and a low FP rate. Therefore, data scientists like to see ROC curves that have an apex as far as possible into this corner of the ROC graph.
In this case, a data scientist might consider Algorithm B ‘better’ than Algorithm A based on a comparison of their ROC curves.
Area under the curve
Given the observation that ‘good’ ROC curves apex as far as possible to the top left of an ROC chart, the area underneath an ROC curve can act as a general indicator of the strength of an algorithm. The greater the area the curve occupies ,the stronger the chance that the curve might occupy the ‘golden zone’ in the top left part of the ROC graph.
Therefore the Area Under the Curve, or AUC, is often used as a general indicator of the strength of a probabilistic classifier. Note that the AUC for a random classifier is 50%, so ideally a promising algorithm will show an AUC significantly above this, and usually greater than 80%. Here is an illustration of this.
Note that the AUC is linearly equivalent to the measure known as the Gini coefficient, which is a common macroeconomic measure of wealth distribution. Specifically,
Gini = 2 × AUC — 1.
The AUC performs very well as a general measure of the strength of a probabilistic classifier. However, it should be used with caution. It is entirely possible that an algorithm which produces a high AUC might not perform well at specific points of interest on the ROC graph. An example of two algorithms with similar AUCs is illustrated below, but the performance of these algorithms is very different depending on the point of interest on the ROC chart.
Summary
Summarizing the observations in this installment:
- Discrete classifiers can be assessed using a simple confusion matrix together with measures such as Accuracy, TP rate, FP rate and others. These measures, combined with considerations of volume, can help determine if the algorithm meets operational requirements.
- Probabilistic classifiers can be plotted using an ROC curve. The Area Under the Curve (AUC) is a good general indicator of the predictive utility of the algorithm, and an AUC greater than 80% should be aimed for. However, even with a high AUC, a deeper look at specific points of interest in the ROC graph is necessary, along with volume considerations, to determine if a cutoff exists which meets operational requirements.
Next time, and to close off the series, I will provide a glossary ‘cheat sheet’ of key terms related to Machine Learning. Find it here.