LDA- derived as a linear discriminant dimensionality reduction

LDA dimensionality reduction principle

In front of the LDA as a classifier detailed derivation, its core is Bayesian formula , known total probability, find the (conditional probability) maximum prior probability, similar problems as if LDA. Dimensionality reduction principle is:

a. The tape data point labels , by the projection, projected to a lower dimensional space,

b. that the projection point, the same category will be "cohesive", the different categories are "separated",

Similar writing code "high cohesion and low coupling." Note that the tag tape to a point, consider the value of X is different from the PCA dimensionality reduction, y is not considered. Intuitively understood on is relatively easy, but how to determine the direction of projection, which not Haonong.

Suppose, first computing center different types of data (assumed here that the two-dimensional, easy visualization). If the data is directly projected onto the vector w connecting the centers (dimension reduction), will find that there is overlap , (brain fill it, do not want to map).

Recently I had a strange point of view: to understand an abstract concept, "Shuoxingjiege" In a way, but not conducive to an intuitive understanding.

For example, under the multi-dimensional, it is difficult to visualize, but this time the images of the brain, may still be stuck in 2 dimensions, resulting in a total sense, "abstract"

in fact,

I think,

The best way to understand abstract concepts

Yes

Back abstraction itself from the rule, from the internal structure, from the definition to understand, "rather than draw a map", low-dimensional can cool the high-dimensional.

For example, understanding "vector space", do not draw anything coordinates of the arrows and the like, "technical level" , spend more time to understand the internal structure plus several multiplication , such thoughts layer .

This will cause the original data can be linearly divided after dimensionality reduction can not be divided. Thus, the category as a projection line connecting the center point, is not oh.

So, the best projection line to determine how, this is the next problem to solve

LDA dimensionality reduction derivation

Fisher's linear dimensionality reduction

  • Different classes, the dimension reduction, between classes, large differences
  • The same category, the dimension reduction, a class of data points, a small difference

Natural language is a good description of how to turn the language of mathematics, this is actually a very interesting question, suddenly thought of a very touching sentence:

" Compared to solve the problem, in fact, the most difficult is to find (define) problem"

Not pulled directly Come on, baby - assuming that there is a large sample X K categories, each category (matrix) of center (mean vector) is \ (\ mu_1, \ mu_2, ... \ mu_k \)

First of all,

Described, "the total amount of the difference" between the global mean and the mean of how different categories , to obtain a matrix

\(S_B = \sum \limits_{k=1}^K(\mu_k - \bar x) (\mu_k - \bar x)^T\)

Are column vectors Oh, and finally get a matrix, B: Between between categories

then,

To describe the same class, and the class of each sample point of "central point", "distance" much the same matrix to give a

\(\sum \limits_{i =1}^{n} (x_i - \mu_k)(x_i -\mu_k)^T\)

Note: I have here n represents a category k number of samples, is Kazakhstan, the overall number of samples non-dynamic Ha

Consider all categories, sample point difference between the various kinds of "total" , and then on the outer layer of the k sum, namely:

\(S_W = \sum \limits _{i=1}^K \sum \limits_{i =1}^n (x_i - \mu_k)(x_i -\mu_k)^T\)

W: sample point between similar, Within

Goal: in fact, want to find an optimal projection direction, that is, a vector

And a wave vector understood: that is, its length size (die), i.e. the direction, the direction connecting the base to the origin of coordinates and with an axis "angle" measured in two dimensions

\(max _w \ J(w) = \frac {w'S_Bw}{w'S_Ww}\)

\ (W '\) said \ (T W ^ \) . corresponding vector / matrix, a ' denotes transposition, the corresponding function is represented derivative

Different values ​​of w, i.e., the projection line represents a different direction (direction vector)

"Among large groups, a small group", i.e., the denominator of the equation to be as large as possible, relative to the lower elements, based on the overall maximization problem, on a projection line (vector) w function J (w).

w vector can be any value, in order to solve as soon as possible, you can do a conditional constraint on w :

\(w'S_ww =1\)

Why would not a molecule of 1, 3, 4? In fact can, 1 feel more beautiful ah.

Why can constraint on w? \ (S_w \) is a known thing, if \ (S_w \) is large, it w necessarily small, look, constraints on the

Constraints in doing? Control the general direction of the range w, not in life, no direction

Into a minimization problem with constraints ( - maximum = minimum)

\(min_w \ J(w) =- \frac{1}{2} w'S_Bw\)

\(s.t. \ w'S_ww = 1\)

Introducing Lagrangian

\(L(w) = - \frac {1}{2} w'S_B w+ \frac {1}{2} \lambda (w'S_ww-1)\)

Behind \ (\ frac {1} { 2} \) to find the back guide in the form of look a little better, nothing meaningful

Partial derivative of w, is 0 let

\(\nabla_w = 0 = -S_Bw + \lambda S_ww\)

Matrix derivation, network investigation, is not very specific and can understand

即: \(S_Bw = \lambda S_ww\)

Equal on both sides simultaneously multiplying \ (S_w ^ {- 1} \) namely:

\(S_w^{-1} S_B w = \lambda w\)

Brothers, the form, the form " \ (= Ax of \ the lambda X \) " does not decompose Well characterized in that (a direction parallel to the stretching do transform vector), strictly speaking,

If \ (S_w ^ {- 1} S_B \) is a symmetric matrix, w is solving the problem of the decomposition a feature, but if it is not a symmetric matrix, the problem becomes the generalized eigenvector problem oh.

Since \ (S_B \) is orthogonal positive definite matrix , (to not prove Kazakhstan), to \ (S_B \) do feature decomposition, we get:

S_B with covariance about the same, here is proof under positive semi-definite covariance

Covariance matrix is defined as: \ (\ Sigma = E [(X-\ MU) (X-\ MU) ^ T] \)

Simply proof for any nonzero vector z, satisfying:

\ (Z ^ T \ Sigma z \ ge 0 to \)

\ (Z '\ Sigma z = z' E [(X \ mu) (X \ mu) ^ T] z \)

\(= E[z'(X-\mu)(z'(X-\mu))^T]\)

\(=E[z'(X-\mu)]^2 \ge 0\)

Desired must be non-negative, i.e., semi-definite covariance syndrome

\(S_B = S \land S^{-1} = Q \land Q'\)

Matrix properties: if X is orthogonal

\ (S_B ^ {\ FRAC {. 1} {2}} = Q \ Land ^ {\ FRAC {. 1} {2}} Q '\) , \ (\ Land \) is a diagonal matrix , that is, only the primary of there diagonal matrix of values.

therefore,

\(S_B = S_B^{\frac {1}{2}} S_B^{\frac {1}{2}}\)

Now, then define \ (v = S_B ^ {\ frac {1} {2}} w \) then for previous equation: \ (S_w ^ {-}. 1 S_B = W \ W the lambda \) can be written as

\(S_w ^{-1} S_B^{\frac {1}{2}} S_B^{\frac {1}{2}} w = \lambda w\)

V substitution will have to:

\(S_w^{-1} S_B ^{\frac {1}{2}} v = \lambda w\)

Then multiplying both sides \ (S_B ^ {\ frac { 1} {2}} \) to give:

\(S_B^{\frac {1}{2}} S_w^{-1} S_B ^{\frac {1}{2}} v = S_B^{\frac {1}{2}} \lambda w = \lambda v\)

Short wave, so that \ (A = S_B ^ {\ frac {1} {2}} S_w ^ {- 1} S_B ^ {\ frac {1} {2}} \) Actually A matrix is positive definite, in this case the form, and become familiar form:

\ (Av = \ lambda v, \ where v = S_B ^ {\ frac {1} {2}} w \)

Why would we get a pass conversion? It is to solve the w ah

A feature then of decomposition, found \ (\ lambda_k and v_k \) after ( back to get a lot of pairs of \ (\ lambda_k, v_k \) to take several large not on line yet)

A Solution w, i.e. to find the optimal projection direction, to meet the "high cohesion, low coupling"

\(w = (S_B^{\frac {1}{2}})^{-1}v_k\)

LDA vs PCA

  • PCA projection line (direction) is such that only the data points on the variance of the maximum, the projection line in the vertical direction as a dividing line based, is not concerned tag value

  • LDA projection line (direction) is such that the different labels of points , the distance between the large category, the distance between the dots the same, the label must be considered

summary

Even pushing the two wave LDA, ah ...

As a linear classifier, the core idea is that Bayesian

As a dimension reduction tool, the principle is to write code style high cohesion, low coupling

Think, then, a wave of expansion, ah comprehensive LDA first rose dimension, classification + kernel technique , wondering wave ....

Guess you like

Origin www.cnblogs.com/chenjieyouge/p/11999919.html