cs231n learning record - understanding linear classifiers

Table of contents

Foreword:

brief introduction:

1. Disadvantages of NN classifier

2. What is a linear classifier

3. The principle of linear classifier

4. Parametric mapping from image to label score

Five: Explain the "b"

6. Think of linear classifiers as template matching

Seven, the end


Foreword:

        The learning content is Stanford University's "Convolutional Neural Network for Computer Vision" course, the course code is cs231n, the original link is Stanford University CS231n: Deep Learning for Computer Vision , and there is a Chinese translation of the course notes on Zhihu . Congratulations! CS231n Official Notes Authorized Translation Collection Released - 知乎

        The main reason for writing this blog is to urge me to complete the entire course and record my learning journey, so the content of the blog will probably be some of my own understanding after reading the course notes and consulting materials. As I am a young boy who knows almost nothing about this aspect, there will definitely be mistakes and incomprehension in the content I wrote. If there are big guys who happen to see this blog by chance and want to enlighten me, thank you very much, warmly welcome. 

brief introduction:

        In this blog, I plan to focus on how to understand linear classifiers, because this is also the most difficult part to understand when I was learning. I didn’t understand the mathematical principles at all when I read it for the first time, so I decided to study it carefully. Check the information and write while watching. At the same time, I think the order of explanation in the official notes is a bit unfriendly to beginners, so I plan to write this blog according to the steps I understand the linear classifier. At the same time, I refer to other materials and add some blogs to it with my own thinking. If there are no things, there may be more loopholes or deviations, please give me advice.

1. Disadvantages of NN classifier

        The book is connected to the above, and the previous notes introduced the NN and K-NN classifiers, but these two classifiers have obvious defects:

        1. It is necessary to store the entire training data set for one-to-one comparison during testing, which consumes a lot of storage resources;

        2. During the test, it is necessary to compare the test picture with each training set picture, which is expensive and time-consuming;

        Therefore, if there is a classifier that can break away from the "cradle" of the training set after using the training data for training (even if it takes a little more time), and at the same time can adopt a faster classification method in actual use, then it will be greatly improved. Reduce storage space cost, computational resource cost, etc. of using classifiers. The linear classifier solves this problem very well.

2. What is a linear classifier

        Regarding the linear classifier, the original note directly put the high-level sentence "parameterized mapping from image to label score". After understanding, I plan to use a super simple example to illustrate what a linear classifier is.

        First, let's take a look at the picture I put below:

                                        3009b560c4f145b0a872319d43bab4b2.jpg

        In this plane, there are many points of two colors. Now we want to distinguish these two types of points. What is the easiest way? Of course, draw a line as shown in the picture (or other places that meet the conditions), and cut it in two. This end of the line is yellow, and the other end is blue. Taking this distribution in two-dimensional space as an example, we have a point map as shown in the figure as the training set data (each point is a data, so many points together form a training set), we want To pass through this picture, determine a straight line : this line can not only separate the dots of the two colors well on this training set, but also have yellow and blue dots in other (called test sets) (The distribution of points of the same color in this space should be similar, and the distribution positions of points of different colors in the space are obviously different) on the picture, divide the points of the two colors with the highest possible accuracy separated.

        This is a linear classification on a two-dimensional plane. And in higher dimensional space, it is similar.

       Tips: Here is one more word for Meng Xin like me, don’t look at the high-dimensional space so profoundly, although it is difficult for normal people’s brains to imagine a space with more than three dimensions, but if we use the multi-dimensional space as follows It will be much easier to understand if you explain it in terms of terms: an n-dimensional space is a space composed of all n-dimensional vectors, and it is the same as jogging to form a line and line to form a surface to form a body.

The classification method of the linear classifier is to find a hyperplane as the decision boundary         in a set of distributions , so that the classification error of the classifier on the data (including the test set) is as small as possible.

Hyperplane: refers to the subspace with dimension n-1 in n-dimensional linear space. It can divide the linear space into two disjoint parts. For example, in two-dimensional space, a straight line is one-dimensional, and it divides the plane into two parts; in three-dimensional space, a plane is two-dimensional, and it divides the space into two parts.

Decision boundary: A boundary that can correctly classify samples. The straight line drawn just now between the yellow and blue dots is a decision boundary.

3. The principle of linear classifier

        We already know the classification method of the linear classifier, but now we face a problem: if we want to determine a decision boundary in the picture just now, we will find that it seems that countless straight lines that meet the standard can be drawn,        f96d70d8a4aa4c09a4ece25612dc190f.jpg

But the final classifier must have only one decision boundary, so which one should we find? In fact, among these decision boundaries, there are good and bad points. Now we draw two decision boundaries that meet the requirements in the test set, and then add a few points to the picture as test data:

        ​​​​​​​        06da5e607f8743b89d3273e5d730ae9c.jpga0b6af323e2b4450bc8e639e1134ed26.jpg

        In such a comparison, the judgment is judged. Here is a quote from someone else's blog. The working principle of the linear SVM classifier_圻子-的博客-CSDN Blog_The working principle of the classifier :     

         We cannot guarantee that this decision boundary will perform well on an unknown data set (test set). For the existing dataset, we have two possible decision boundaries B1 and B2. ​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​                        insert image description here

        We can translate the decision boundary B1 to both sides until we stop when we encounter the square and circle closest to this decision boundary, forming two new hyperplanes, namely b11 and b12, and we move the original decision boundary to In the middle of b11 and b12, make sure that the distance from B1 to b11 and b12 is equal. The distance between b11 and b12 is called the margin of the decision boundary of B1, usually denoted as d.

        It can be seen that a decision boundary with a larger margin has a lower generalization error in classification . If the margin is small, any slight perturbation can have a large impact on the decision boundary. The situation where the margin is small is a situation where the model performs well on the training set, but does not perform well on the test set, that is, the phenomenon of "overfitting" . So when looking for the decision boundary, we hope that the larger the margin, the better.

        After repeated research and thinking, I think that no matter what kind of linear classifier, the ultimate goal is generally the same, but the loss function used in training is different.

4. Parametric mapping from image to label score

        Next, based on the information I found and my own thinking, I will explain the nouns that appear in the note "Parametric mapping from image to label score".

        For this picture, we assume that this picture is composed of four pixels (there must be many more in reality), and reduce the dimension of this two-dimensional matrix into a column vector. Converting two-dimensional images into vectors is a frequently used method, which is convenient for feature extraction and training.

        Then for the pictures we want to classify, each time there is a label type to choose from, we will create a separate vector for that type as the exclusive classifier of this class, and finally we will combine all the classifiers together to form a matrix . This matrix is ​​called the weight matrix. Here you may find that the number of columns of the weight matrix in the above figure and the dimension of the image vector happen to be the same, both being 4. Is this coincidence? no. Students who have studied linear algebra will definitely say that this must be the same, otherwise how to do the matrix multiplication on the graph? Yes, but in fact there are more fundamental reasons.

        Here I still use the example of binary classification in two-dimensional space to explain the proof method, and then I will expand it to three-dimensional, higher dimensions and so on .

 ​​​​​​​  58fc4f2ccd924e338435894d6f79f875.jpg3b2ca5e960cd474bbc30fcda9b0e883f.jpg

                        (W here refers to all the parameter vectors of a single classifier, not the weight matrix!)        

        At that time, I raised a question when I checked the information: In the proof process in the picture, we only used one Wx+b for the two classifications, but it was said before that a separate classifier should be established for each label. Shouldn't the classification use two Wx+b straight lines? In order to understand this question, I have been chasing my brother (big brother) for a long time, and now I will explain it here: Take the classification of cats and dogs as an example, if we just want the computer to distinguish whether this picture is a cat or not , then a classifier is enough, we only need to let the computer judge that the picture character does not meet the cat standard, and directly paste a label of "not a cat" if it does not meet the standard, and the job is done; but if we want the computer to distinguish between these Is it a cat or a dog? If there are only pictures of cats and dogs in the data set we give to the computer, the result is probably accurate. After all, as long as it is judged that it is not a cat, it must be a dog, which is similar to the previous classification task. So what if a photo of an alien is mixed in this data set now? The computer will decide that the photo does not belong to a cat, and label it as a dog. right? This is not right. Therefore, we need to train a classifier for cats and dogs, respectively, to judge whether the picture is a cat or a dog. If it does not match, then put a label of "THE OTHER", which improves our classification. The ability of the processor to handle abnormal data. All in all, make a decision based on actual needs.

        The meaning of Wx+b=0 in the left note above is our hyperplane expression as the decision boundary. So what happened to the numbers in the rightmost three-dimensional column vector in the previous figure (that is, the so-called mapping result)?

        This is a judgment method we artificially determine: we regard this W as a vector, and now we give a meaning to the result size of Wx+b: if the value of the result of multiplying W and x is larger, it means that the position of x is consistent with The closer the direction of W is: For example, W in the notes on the right side of the picture above is a classifier of yellow dots, pointing to the area where the yellow dots are located, then the result obtained by multiplying the yellow dots by W will be larger, and the blue dots multiplied by The result obtained by W will be relatively smaller. We let the computer use the result size of Wx+b as the basis for judging what kind of label the data should be attached to. So how does this W come from? The correct W will not fall from the sky. The purpose of our continuous training with the test set is to find out the result of multiplying the yellow dots by W is large enough and the result of multiplying the blue dots by W is enough by constantly modifying the value of W. Small W, as a classifier with yellow dots. Putting these classifiers W together forms a weight matrix that looks scary in the above figure. Speaking of which, do you still think the phrase "parametric mapping from image to label score" is too high-end to be daunting?

Five: Explain the "b"

        Now let's explain this "b" in Wx+b. There is nothing scary about this b. It is the same thing as the b in "kx+b", which is just an intercept, but this intercept may be in the form of a higher-dimensional vector. Here b is called the bias vector (bias vector) , which is also a kind of "hyperparameter". This is because it affects the output value, but is not related to the original data xi. b can also be obtained, and after we get W, b can be reversed.

6. Think of linear classifiers as template matching

        Here is a way to understand the direct copy of the notes:

        Another explanation for the weight W is that each row corresponds to a template (sometimes called a prototype ) for a classification. The score of an image corresponding to different categories is to compare the image and template by using the inner product (also called dot product ), and then find which template is most similar. From this point of view, the linear classifier is using the learned template to do template matching for the image. From another point of view, it can be considered that k-NN is still being used efficiently. The difference is that we did not use all the images of the training set for comparison, but only used one image for each category (this image is our learned, not a certain one in the training set), and we will use the (negative) inner product to calculate the distance between vectors instead of using L1 or L2 distance.

————————————————————————————————————————

Shown here is an example of weights after learning using CIFAR-10 as the training set. Note that the boat template has a lot of blue pixels as expected. If the image is a boat sailing on the sea, then this template will give a high score using the inner product to calculate the image.

Seven, the end

       When writing a blog, I often refer to the working principle of the linear SVM classifier_圻子-的博客-CSDN Blog_The working principle of the classifier . I think it is very detailed and saves a lot of time.

        It stands to reason that my second blog should follow the picture classification (top) and start writing (bottom), but I always want to take the opportunity of blogging to take a serious look at the linear classification that I have been ignorant of before, so I changed the order. Although the length of this article is similar to what I wrote before, it took several times more time than before, because when I wrote it, I found that my understanding of linear classifiers was not enough, so I spent more time on it. It took a lot of time to find and compare information (I feel that I have checked hundreds of web pages for these thousands of words). Of course, after paying so much time, the harvest is that I have a much deeper understanding of this note. Although the understanding of the SVM and Softmax loss functions is not enough, the basic principles are easy to understand if you know the formula.

        Since I have hardly touched the code of cs231n, I decided to take a look at the huge project of assignment1 of the notes after updating the image classification (below), and then I will post the code and the writing process, bless me then.

        

        

                                         

Guess you like

Origin blog.csdn.net/qq_62861466/article/details/126330526