The machine learning novice comes over and takes you to a comprehensive understanding of the classifier evaluation indicators

640?wx_fmt=gif&wxfrom=5&wx_lazy=1

AI Workshop Press: Choosing the right evaluation metric for your classifier is critical. If you don't choose well, you can end up in a situation where you think your model is performing well, but it's not.

Recently, an article on TowardsDataScience introduced in-depth the evaluation indicators of classifiers and the scenarios in which they should be used. The AI ​​Research Institute compiled the content as follows:

In this article, you'll learn why evaluating a classifier is difficult; why a classifier that appears to have high classification accuracy does not perform so well in most cases; what is the right classifier to evaluate; what you should When to use these evaluation metrics; how to create a classifier with the high accuracy you expect.

640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1

  content

  • Why are evaluation metrics so important?

  • confusion matrix

  • Precision and Recall

  • F-Score

  • The tradeoff between precision and recall

  • Curves of precision and recall

  • ROC, AUC curves and ROC, AUC values

  • Summarize

  Why are evaluation metrics so important?

In general, evaluating a classifier is much more difficult than evaluating a regression algorithm. A good example is the well-known MNIST dataset, which contains multiple pictures of handwritten digits from 0 to 9. If we want to build a classifier to tell if a value is 6, build an algorithm to classify all inputs as non-6, then you will get 90% accuracy on the MNIST dataset, since there are only about 10% of the images in the dataset is 6. This is a major problem in machine learning, and the reason why you need to test your classifier with several evaluation metrics.

  confusion matrix

First, you can learn about the confusion matrix, which is also known as the error matrix. It is a table describing the performance of a supervised learning model on test data, where the true value is unknown. Each row of the matrix represents an instance in the predicted class, and each column represents an instance in the actual class (and vice versa). The reason it's called a "confusion matrix" is that it's easy to see where the system confuses two categories.

You can see the output of using the "confusion_matrix()" function in sklearn on the MNIST dataset in the image below:

640?wx_fmt=jpeg

Each row represents an actual class, and each column represents a predicted class.

The first line is the number of images that are actually "not 6" (negative class). Of these, 53,459 images were correctly classified as "non-6" (referred to as the "true class"). The remaining 623 images were incorrectly classified as "6" (false positive).

The second row represents images that are really "6". Of these, 473 images were incorrectly classified as "not 6" (false negative class), and 5445 images were correctly classified as "6" (true class).

Note that a perfect classifier will be 100% correct, which means it only has true classes and true negatives.

  Precision and Recall

A confusion matrix can give you a lot of information about how well your (classification) model is doing, but there is a way to get more information, like calculating the precision of the classification. To put it bluntly, it is the accuracy of the predicted positive samples (accuracy), and it is often seen together with the recall (recall, that is, the proportion of correctly detected positive instances among all positive instances).

sklearn provides built-in functions for calculating precision and recall:

640?wx_fmt=jpeg

Now, we have a better metric for evaluating our classifier. Our model is correct 89% of the time it predicts a picture as a "6". Recall tells us that it predicted a "6" for 92% of the instances where it was actually a "6".

Of course, there are better ways to evaluate.

  F-value

You can fuse precision and recall into a single evaluation metric called "F-value" (also known as "F1-value"). F-values ​​are useful if you want to compare two classifiers. It is computed using the harmonic mean of precision and recall, and it will give more weight to lower values. This way, the classifier will only get a high F-1 value if both precision and recall are high. F-values ​​are easy to compute with sklearn.

From the image below, you can see that our model got an F-1 value of 0.9:

640?wx_fmt=jpeg

However, the F-score is not a panacea. Classifiers with close precision and recall will have better F-1 scores. This is a problem because sometimes you want high precision and sometimes you want high recall. In fact, higher precision leads to lower recall and vice versa. This is called the precision-recall trade-off, which we will discuss in the next chapter.

  The tradeoff between precision and recall

To better explain, I'll give some examples of when you want high precision and when you want high recall.

High accuracy:

如果你训练了一个用于检测视频是否适合孩子看的分类器,你可能希望它有高的精确率。这意味着,这个你希望得到的分类器可能会拒绝掉很多适合孩子的视频,但是不会给你包含成人内容的视频,因此它会更加保险。(换句话说,精确率很高)

高召回率:

如果你想训练一个分类器来检测试图闯入大楼的人,这就需要高召回率了。可能分类器只有 25% 的精确率(因此会导致一些错误的警报),只要这个分类器有 99% 的召回率并且几乎每次有人试图闯入时都会向你报警,但看来是一个不错的分类器。

为了更好地理解这种折衷,我们来看看随机梯度下降(SGD)的分类器如何在 MNIST 数据集上做出分类决策。对于每一个需要分类的图像,它根据一个决策函数计算出分数,并将图像分类为一个数值(当分数大于阈值)或另一个数值(当分数小于阈值)。

下图显示了分数从低(左侧)到高(右侧)排列的手写数字。假设你有一个分类器,它被用于检测出「5」,并且阈值位于图片的中间(在中央的箭头所指的地方)。接着,你会在这个阈值右边看到 4 个真正类(真正为「5」的实例)和 1 个假正类(实际上是一个「6」)。这一阈值会有 80% 的精确率(五分之四),但是它仅仅只能从图片中所有的 6 个真正的「5」中找出 4 个来,因此召回率为 67%(六分之四)。如果你现在将阈值移到右侧的那个箭头处,这将导致更高的精确率,但召回率更低,反之亦然(如果你将阈值移动到左侧的箭头处)。

640?wx_fmt=jpeg

  精确率/召回率曲线

精确率和召回率之间的折衷可以用精确率-召回率曲线观察到,它能够让你看到哪个阈值最佳。

640?wx_fmt=jpeg

另一种方法是将精确率和召回率以一条曲线画出来:

640?wx_fmt=jpeg

在上图中,可以清晰地看到,当精确率大约为 95% 时,精准率升高,召回率迅速下降。根据上面的两张图,你可以选择一个为你当前的机器学习任务提供最佳精确率/召回率折衷的阈值。如果你想得到 85% 的精确率,可以查看第一张图,阈值大约为 50000。

  ROC、AUC 曲线和 ROC、AUC 值

ROC 曲线是另一种用于评价和比较二分类器的工具。它和精确率/召回率曲线有着很多的相似之处,当然它们也有所不同。它将真正类率(true positive rate,即recall)和假正类率(被错误分类的负实例的比例)对应着绘制在一张图中,而非使用精确率和召回率。

640?wx_fmt=jpeg

当然,在这里也有所折衷。分类器产生越多的假正类,真正类率就会越高。中间的红线是一个完全随机的分类器,分类器的曲线应该尽可能地远离它。

通过测量 ROC 曲线下方的面积( AUC),ROC 曲线为比较两个分类器的性能提供了一种方法。这个面积被称作 ROC AUC值,100% 正确的分类器的 ROC AUC 值为 1。

一个完全随机的分类器 ROC AUC 为 0.5。下图中是 MNIST 模型的输出:

640?wx_fmt=jpeg

  总结

通过以上介绍,大家将学习到如果评价分类器,以及用哪些工具去评价。此外,还能学到如何对精确率和召回率进行折衷,以及如何通过 ROC AUC 曲线比较不同分类器的性能。

我们还了解到,精确率高的分类器并不像听起来那么令人满意:因为高精确率意味着低召回率。

下次当你听到有人说一个分类器有 99% 的精确率或准确率时,你就知道你应该问问他这篇帖子中讨论的其它指标如何。

资源链接

  • https://en.wikipedia.org/wiki/Confusion_matrix

  • https://github.com/Donges-Niklas/Classification-Basics/blob/master/Classification_Basics.ipynb

  • https://www.amazon.de/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291/ref=sr_1_1?ie=UTF8&qid=1522746048&sr=8-1&keywords=hands+on+machine+learning

via towardsdatascience

Compiled and organized by Lei Feng.com AI Research Institute.

640?wx_fmt=jpeg

640?wx_fmt=gif

640?wx_fmt=jpeg


Click "Read the original text" below to learn about the [Artificial Intelligence Experiment Platform] ↓↓↓

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326146196&siteId=291194637