Machine Learning Notes (a) perception principle chapter machine algorithm of (a)

This study notes emphasize geometric intuition, while also focusing on the inside of the motive Perceptron algorithm. Due to space limitations, here only discuss the introduction of the general situation, the loss of function of perception machine works. About Perceptron's nuclear and dual form of Perceptron, will specifically write another article. Implementation code on perception machine, will not appear here, there will be a dedicated article describes how to write code to achieve Perceptron, there will be several classification using the perceptron to do little case.


Perceptron neural network algorithm is the classic model, although only one layer of the neural network, but the forward spread of ideas already have. In essence, the perceptron refers to a mapping function: \ (Sign (w_ix_i + B) \) , can be calculated into the data with the output value obtained, whether the data by comparing the output value and the original tag value corresponding to the same sign so that infer the results of the model training.

1. What is the perception machine?

The above description then, sensing for the machine mathematical model \ (Sign (w_ix_i + B) \) , where \ (w = (w_1, w_2 , ..., w_n) \) is the weight vector, \ (x_i = (x_i ^ {1}, x_i ^ { 2}, ..., x_i ^ {n}) \) is a sample point, readily appreciated from the symbol, which are defined in \ (R ^ n \) space (if difficult to understand, can be simply understood as a sample point with a \ (n-\) dimensions, corresponding weight vector must also have \ (n-\) dimensions and it matches).

Here is a classic Perceptron model diagram:

Single-layer Perceptron

The leftmost input feature vector representing a sample point, by respective weights to spread to a front weight neurons, neurons in an adder to determine whether the final activated by an activation function. Lowermost \ (x_0 \) is the bias term, the general literature referred to as \ (B \) .


2. perceptron can solve any problem?

Here we can see a classic example of the use of Perceptron Model to solve the classic \ (OR problem \) .

[Example 1] has four sample points \ ((0,0), (0,1), (1,0), (1,1) \) The \ (OR \) logic, you must be at least there is not a \ (0 \) can be determined to be true, translated into machine learning expression that is labeled \ ((- 1,1,1,1) \) , where a negative number indicates a negative sample, a positive number represents a positive samples.

OR logic

We hope to be able to find a straight line (classification), positive and negative samples will accurately separate point. The \ (\ Sl \) is described in, sample points are defined in \ (R ^ 2 \) space, apparently weight vector is \ (W = (W_1, w_2) \) , consider the bias term \ (B \ ) , typically a fill \ (x_i ^ {0} \ ) of \ (- 1 \) to the formula \ (\ sum \ limits_ {1 } ^ {n} w_ix_i + b \) written directly \ (\ SUM \ {0} ^ {limits_ n-w_ix_i} \) . Gives first weight update equation $ w_ {ij} ← w_ { ij} - η (y_ {j} - t_ {j}) · x_i $, an explanation is given later.

Set the weight of the initial value \ (W = (0,0) \) . And if and only if agreed \ (Sign (X = 0) = 0 \) , counted misclassification. Followed by substituting the sample points:

  1. Substituting \ ((0,0) \) point, \ (Sign (WX + B) = Sign (0 + 0 * 0 * 0 * 0 + (-. 1)) = 0 \) , determines whether or not the class is negative, error.

    • Update weights

      \(b = 0 - 1 * (1-0) * (-1)=1\)

      \(w_1 = 0 - 1 * (1-0) * 0=0\)

      \(w_2 = 0 - 1 * (1-0) * 0=0\)

  2. Substituting \ ((0,1) \) point, \ (Sign (WX + B) = Sign (+ 0 * 0 * 0 * + 1'd. 1 (-. 1)) = -1 \) , it is determined that the negative type, an error .

    • Update weights

      \(b = 1 - 1 * (1 - (-1)) * (-1) = -1\)

      \(w_1 = 0 - 1 * (1 - (-1))*0 = 0\)

      \(w_2 = 0 - 1*(1-(-1))*1 = 2\)

  3. Substituting \ ((1,0) \) point, \ (Sign (WX + B) = Sign (. 1 + 2 * 0 * 0 + (-1) * (-1)) =. 1 \) , it is determined that the class n ,correct.

  4. Substituting \ ((1,1) \) point, \ (Sign (WX + B) = Sign (0 * *. 1. 1 + 2 + (-1) * -1). 3 = \) , n-type is determined, the correct .

  5. Traversing the second round, substituting \ ((0,0) \) , \ (Sign (0 * 0 * 0 + 2 + (-1) * -. 1). 1 = \) , determines whether or not the class is negative, an error.

    • Update weights

      \(b = -1 - 1*(1-(-1))*(-1)=1\)

      \(w_1 = 0 - 1*(1-(-1))*0=0\)

      \(w_2 = 2 - 1*(1-(-1))*0 = 2\)

  6. Substituting \ ((0,1) \) , \ (Sign (2 + 0 * 0 * *. 1. 1 + (-1)) =. 1 \) , n-type is determined correctly.

Later step is omitted do not write, readers can self-checking. The calculation results of the code, if the initial weight set to a [0,0,0] Such extreme values, the learning rate, and the sample points in accordance with the selection sequence, a complete finish requires two iterations to obtain a stable Weights. Finally, solve for classification function is: \ (Sign (2x ^ {(. 1)} + {2x ^ (2)} -. 1) \) , the reader can take into account the data checking.


3. perceptron can not solve the problem?

Here is a classic detective, "perception machine" is the book appeared, directly to the neural network research in this genre stop for twenty years. The book describes in detail because the perceived machine principle, it is also pointed out the fatal flaw Perceptron: Unable to resolve \ (XOR \) problem.

\ (The XOR \) , also known as the exclusive OR, its logic is as follows:

Variable 1 Variable 2 Logic operation
0 0 0
0 1 1
1 0 1
1 1 0

This is a rather strange logic, if and only if between variables are not equal true.

XOR logic

With geometric intuition, we soon discovered that there can be such a super-plane, can accurately separate red and blue two types of samples.

Summary about here:

Perceptron can completely solve linearly separable problems.

This seems to be a nonsense, but powerful idea behind the machine does the perception derived from the peak of the traditional SVM machine learning. For sample linearly inseparable, you can use multilayer neural network ( \ (Multi \ Space Layer \ Space Network \) ) or use of nuclear Perceptron, in fact, two of the perceptron can be resolved \ (XOR \) problem


4. How Perceptron works?

In the above example a bit long-winded, we see the working methods of perception machine. Its algorithmic process can be summarized as follows:

  1. Selecting an initial value \ (w_0, b_0 \)
  2. In selecting a training set of data points \ ((x_i, y_i) \ )
  3. In this sample check points, the model is incorrectly classified, if misclassification, the update \ (w_i, b_i \)
  4. Go back to step 2 until the entire training set no misclassification point

From the above algorithm, so we can read at least several layers of meaning:

  1. The initial value is random given, 0 is assigned to a simple approach, but is usually set to a small random value, discussed later
  2. Each training actually only used one sample point, even if there are many data points in the training set is also true. This \ (w_ix_i \) This method of calculation expressions, numerical examples detailed above.
  3. Focus algorithm in misclassification here, then how to define the misclassification very important thing.
  4. Algorithm is an iterative process, and to meet all the sample points are correct classification was shut down. Conversely understanding that Perceptron algorithm can only be applied completely linearly separable data sets. If the data set is not completely linearly separable, the perceptron will never stop, unless he first set down conditions before training, such as setting boundaries on common training times or error (once the error is less than $ \ varepsilon $ can be shut down).

Obviously, how to define the error is perceptron algorithm in top priority.


5. How to define the error?

Popular definitions: mapping function \ (sign (wx + b) \) solving the original tag value and the number of different values obtained, i.e., represents a classification error.

According to this expression, we can easily classify the data set: correct classification and misclassification. Obviously we need to do is find a way to model the sample points on the dotted wrong, but to point to the already sub-divided can not be wrong.

If the geometric mapping method, continually adjusts hyperplane (i.e., the above-mentioned straight line, known as the term & hyperplanes) to achieve the target, as long as the given data set is linearly separable completely. However, if a high-dimensional space data sets, geometric intuition is not effective, then the need to introduce algebraic tools. The introduction of such a tool is the loss function, and we rely on it to do geometric intuition is the same: all dotted on \ (\ iff \) to minimize the loss function.

Defined loss function can be intuitively to: calculate the number of misclassified samples. There are, for example, sample space \ (M \) samples, wherein the samples belonging to the region misclassified \ (E \) , is clearly the loss function \ (\ SUM \ limits_ {I \ E} in x_i \) . But the problem comes, we do not have tools to optimize this function, so that the movement to a minimum. So naturally think, calculate all point to a misclassification hyperplane distance, optimized for distance and do. Specific idea is this:

  1. Found described sample points to calculate the distance of formula hyperplane
  2. Distance and discard the sample points correctly classified does not look
  3. Misclassification defined distance and, for the minimum it

In \ (R ^ n \) space, any sample point \ (x_0 \) to the distance of the hyperplane equation \ (\ frac {1} { || w ||} | w \ cdot x + b | \) . For misclassification is, \ (y_i (W \ X CDOT + B) <0 \) , then if there is a misclassification point \ (x_0 \) , then the distance from it to the hyperplane can be written as \ (- y_i \ frac {W}. 1 || {||} | W \ CDOT X + B | \) . For all misclassification, naturally, loss function can write the following: \ [L (W, B) = - \ SUM \ limits_ {I \ E} in y_i \ FRAC. 1 {{}} || W || | w \ cdot the X-+ b | \] . Add this purpose is such that the negative sign of the value of the loss function is a positive constant, so that it can be used for the minimum convex optimization tool. Obviously, the less misclassified sample points, \ (L (W, B) \) smaller; misclassified points smaller distance from the hyperplane, \ (L (W, B) \) smaller.

Discard out here \ (\ frac {1} { || W ||} \) because it is a positive coefficient, and for \ (L (w, b) \) having stretchability, and therefore does not minimize the effects \ (L (W, B) \) .


6. How to Correct Mistakes after mistakes and?

Recalling the differential knowledge, we know that a global minimum of a convex function is achieved in the derivative is 0. However, in the viewpoint of multivariate function, the gradient is essentially derivative of unary functions. When a function of the direction of motion is negative gradient, reducing the value of the function it is the fastest; positive gradient along the movement direction, the function value increases ah is the fastest. Here we have to find its gradient.

\[\frac{\partial L}{\partial w} = - \sum\limits_{x_i \in E} y_i x_i\]

\[\frac{\partial L}{\partial b} = - \sum\limits_{x_i \in E} y_i\]

We look on \ (w \) gradient of expression, is to find all of misclassified points essence, then the tag value \ (y_i \) as a heavy weight ( \ (+ 1 / -1 \) ). If \ (x_1 \) is a misclassified points, assuming that the point corresponding to a positive type original tag, now divided into negative type, a plus sign in front of it, showing the opposite direction to compensate. Therefore, \ (\ sum \ limits_ {x_i \ in E} y_ix_i \)

Can be understood as a sample point by fitting all misclassified a movement direction, movement in that direction can make the model \ (w \) error parameters.

In the embodiment 1, we select only one sample each time point, assume that the sample points is \ ((x_i, y_i) \) , then the formula is updated \ (W = W + \ ETA y_i x_i \) , where \ ( \ eta \) is updated each time step, the term learning rate. Similarly, for the updating formula bias term \ (B = B + \ ETA y_i \) .


7. \ (Novikoff \) Theorem

To describe what this theorem. Theorem "statistical learning methods" Second Edition Page42 from Hang Li teacher.

Provided training data set \ (T = {(x_1, y_1), (X_2, y_2), ..., (x_N, y_N)} \) is linearly separable, wherein \ (x_i \ in X = R ^ n , y_i \ in the Y = \ {-. 1,. 1 + \}, I = 1,2, ..., N \) , then

(1) that satisfies the condition \ (|| \ hat {w} _ {opt} || = 1 \) hyperplanes \ (\ hat {w} _ {opt} \ cdot \ hat {x} = w_ {opt } \ cdot x + b_ {opt } \) to completely separate the training data set is correct; and there \ (\ Gamma> 0 \) , for all \ (i = 1,2, ..., N \) have \ [y_i (\ hat {w} _ {opt} \ cdot \ hat {x} _i) = y_i (w_ {opt} \ cdot x_i + b_ {opt}) \ ge \ gamma \]

(2) Order \ (R & lt = \ max \ {limits_. 1 \ Le I \ N} || Le \ Hat {X} _i || \) , the number of misclassified machine perception algorithm on the training data set \ (K \) satisfies the inequality \ [k \ le (\ frac {R} {\ gamma}) ^ 2 \]

Starting with the instinct to understand the meaning of this theorem expressed. For the effect that the number of times you can find a linear hyperplane classification for data collection, there is a lower bound of $ \ gamma $, all sample points to Hyperplane distances are at least greater than the $ \ gamma $, while the misclassification of a bounded, the first \ (k + 1 \) after the second, the classification is correct, is the ultimate convergence of the algorithm.

prove

(1) Since the data is linearly separable, and therefore there must be a dividing hyperplane \ (S \) , may wish to take \ (S \) parameters for \ (\ Hat opt {W} {} _ \) , clearly for all points on the hyperplane \ (X \) , are \ (\ Hat opt {W} {} _ \ CDOT X opt = {} W_ \ CDOT B_ + X = {0} opt \) , and satisfies \ (|| \ Hat {W} {_} || = opt. 1 \) . For the training data set sample points, since now being correctly classified, so there is \ (\ Hat {W} _ {opt} \ CDOT x_i> 0 \) , then take \ (\ gamma = \ min \ limits_ { } I \ {y_i (opt {} W_ \ CDOT x_i + opt B_ {} \} \) , which give a lower bound.

(2) a \ (k \ le (\ frac {R} {\ gamma}) ^ 2 \) Release \ (K \ Gamma \ Le \ sqrt {K} R & lt \) .

Suppose \ ((x_i, y_i) \ ) is the first \ (K \) times the point of being erroneously classified, so that it triggers \ (y_i (\ hat {w } _ {k-1} \ x_i cdot) \ Le 0 \) .

Updated by the equation, \ (\ Hat _K {W} = \ Hat {W} {_}. 1-K + \ ETA y_i \ Hat {X} {I} _ \) .

  • \(\hat{w}_{k} \cdot \hat{w}_{opt} = \hat{w}_{k-1} \cdot \hat{w}_{opt} + \eta y_i \hat{w}_{opt} \cdot \hat{x}_i \ge \hat{w}_{k-1} \cdot \hat{w}_{opt} + \eta \gamma\) .

    Iterative \ (k-1 \) times to give immediate \ (\ Hat {W} {K} _ \ CDOT \ Hat opt {W} {} _ \ GE K \ ETA \ Gamma \) .

  • For \ (\ hat {w} _k \) on both sides of the square,

    \ (|| \ hat {w} _k || ^ 2 = || \ hat {w} _ {k-1} || ^ 2 + 2 \ eta y_i \ hat {w} _ {k-1} \ cdot \ Hat {X} + I {} _ \ ETA || ^ 2 \ Hat {X} _ {I} || ^ 2 \) . the middle note is negative (since this is misclassified points)

    Now give \ (|| \ hat {w} _k || ^ 2 \ le || \ hat {w} _ {k-1} || ^ 2 + \ eta ^ 2 || \ hat {x} _ {i || ^ 2} \ Le || \ Hat {W} {K-_. 1} || ^ 2 + \ R & lt ETA ^ 2 ^ 2 \) .

    Iterative \ (k-1 \) obtained immediately after the time \ (|| \ Hat {W} || _K ^ 2 \ Le K \ R & lt ETA ^ 2 ^ 2 \) .

You can create inequality

\ (k \ eta \ gamma \ le \ hat {w} _ {k} \ cdot \ hat {w} _ {opt} \ le || \ hat {w} _ {k} || \ cdot || \ hat _ opt {W} {} || \ Le \ sqrt {K} \ R & lt ETA \) . to get further slightly modified syndrome equation.

Here \ (R \) is really just a sign of it, it represents the means to find the maximum length of the mold from all the wrong category, and this sign is actually produced from the certification process.


As an introduction to the principles of Perceptron articles have been written in a very long you can stop there -

If there are any flaws error, interactive welcome comments.

drawing

Guess you like

Origin www.cnblogs.com/learn-the-hard-way/p/11810264.html