PCA 的定义方式有很多,其中最常见的有三个: 其一,PCA 可以理解为降维法,在保留尽可能多的离散程度信息的基础上减少变量的个数,消除变量之间的线性相关性; 其二,PCA 可以理解为向其他方向上的正交投影,使得投影点的方差最大(Hotelling,1933); 其三,它也可以理解为正交投影,使得复原损失最小,这一损失通过数据点与估计点间平方距离的平均值来刻画(Pearson,1901).
下面我们分别考虑这三种定义方式.
1. 降维法
这里的"维"指的便是变量的个数,在记录数据时,例如采集一个人的信息,需要收集其身高、体重、胸围等数据,这里的身高、体重和胸围即是"变量";每个人的数据,例如(173(cm), 65(kg), 887(mm)),称为一个样本,或数据点. PCA 可以这么理解:一方面,在保留尽可能多的离散程度信息的情况下减少变量的个数;另一方面,消除变量之间的线性相关性(在上述例子中,身高、体重、胸围之间显然具有某种正相关性). 至于为什么要这样做,这就涉及到 PCA 的来历,可参见:A Tutorial on Principal Component Analysis(译).
离散程度信息可以通过变量的方差来刻画,方差越大,含有的信息越多;
变量的线性相关性可以通过协方差的绝对值来刻画,绝对值越大,相关性越强,协方差为零时线性无关;
PCA 的思路是,对原有变量进行线性组合得到新变量,使得新变量的方差尽可能大,不同变量间的协方差为零.
下面来看详细 推导过程: 设
X
\small X
X 为
m
m
m 维随机变量,
X
=
(
x
1
x
2
⋮
x
m
)
X=\begin{pmatrix}x_1\\x_2\\\vdots\\x_m\end{pmatrix}
X = ⎝ ⎜ ⎜ ⎜ ⎛ x 1 x 2 ⋮ x m ⎠ ⎟ ⎟ ⎟ ⎞ 对其作变换如下:
P
X
=
Y
PX=Y
P X = Y 其中
P
\small P
P 为方阵,
P
=
(
p
i
j
)
m
×
m
=
[
p
1
T
p
2
T
⋮
p
m
T
]
P=(p_{ij})_{m\times m}=\begin{bmatrix}p_1^T\\p_2^T\\ \vdots \\p_m^T\end{bmatrix}
P = ( p i j ) m × m = ⎣ ⎢ ⎢ ⎢ ⎡ p 1 T p 2 T ⋮ p m T ⎦ ⎥ ⎥ ⎥ ⎤ 则
[
y
1
y
2
⋮
y
m
]
=
Y
=
P
X
=
[
p
11
x
1
+
p
12
x
2
+
⋯
+
p
1
m
x
m
p
21
x
1
+
p
22
x
2
+
⋯
+
p
2
m
x
m
⋮
p
m
1
x
1
+
p
m
2
x
2
+
⋯
+
p
m
m
x
m
]
=
[
p
1
T
X
p
2
T
X
⋮
p
m
T
X
]
\begin{bmatrix}y_1\\ y_2\\ \vdots \\ y_m\end{bmatrix}=Y=PX=\begin{bmatrix}p_{11}x_1+p_{12}x_2+\cdots+p_{1m}x_m \\ p_{21}x_1+p_{22}x_2+\cdots+p_{2m}x_m \\ \vdots \\ p_{m1}x_1+p_{m2}x_2+\cdots+p_{mm}x_m \end{bmatrix}=\begin{bmatrix}p_1^TX\\p_2^TX\\ \vdots \\p_m^TX\end{bmatrix}
⎣ ⎢ ⎢ ⎢ ⎡ y 1 y 2 ⋮ y m ⎦ ⎥ ⎥ ⎥ ⎤ = Y = P X = ⎣ ⎢ ⎢ ⎢ ⎡ p 1 1 x 1 + p 1 2 x 2 + ⋯ + p 1 m x m p 2 1 x 1 + p 2 2 x 2 + ⋯ + p 2 m x m ⋮ p m 1 x 1 + p m 2 x 2 + ⋯ + p m m x m ⎦ ⎥ ⎥ ⎥ ⎤ = ⎣ ⎢ ⎢ ⎢ ⎡ p 1 T X p 2 T X ⋮ p m T X ⎦ ⎥ ⎥ ⎥ ⎤ 可以看到,新变量
y
i
y_i
y i 是原变量的线性组合.
V
a
r
(
y
i
)
=
E
[
y
i
−
E
(
y
i
)
]
2
=
E
[
(
p
i
T
X
−
E
(
p
i
T
X
)
)
(
p
i
T
X
−
E
(
p
i
T
X
)
)
T
]
=
p
i
T
E
[
(
X
−
E
(
X
)
)
(
X
−
E
(
X
)
)
T
]
p
i
=
p
i
T
C
X
p
i
C
o
v
(
y
i
,
y
j
)
=
p
i
T
C
X
p
j
,
i
,
j
=
1
,
2
,
⋯
,
m
\begin{aligned}Var(y_i)&=E[y_i-E(y_i)]^2\\&=E[(p_i^TX-E(p_i^TX))(p_i^TX-E(p_i^TX))^T]\\&=p_i^TE[(X-E(X))(X-E(X))^T]p_i\\&=p_i^TC_Xp_i\\Cov(y_i,y_j)&=p_i^TC_Xp_j,\,\,i,j=1,2,\cdots,m\end{aligned}
V a r ( y i ) C o v ( y i , y j ) = E [ y i − E ( y i ) ] 2 = E [ ( p i T X − E ( p i T X ) ) ( p i T X − E ( p i T X ) ) T ] = p i T E [ ( X − E ( X ) ) ( X − E ( X ) ) T ] p i = p i T C X p i = p i T C X p j , i , j = 1 , 2 , ⋯ , m 要使
V
a
r
(
y
i
)
\small Var(y_i)
V a r ( y i ) 尽可能地大,这一点很容易做到,只需按比例缩放
p
i
p_i
p i 即可. 不过这样做没有什么意义,也不是我们想要的,因此需要对
p
i
p_i
p i 做些限制:设
p
i
p_i
p i 为单位向量,即
∥
p
i
∥
2
=
p
i
T
p
i
=
1
\small \Vert p_i \Vert^2=p_i^Tp_i=1
∥ p i ∥ 2 = p i T p i = 1 . 我们的目标是合理选择
p
i
p_i
p i ,使得
V
a
r
(
y
i
)
\small Var(y_i)
V a r ( y i ) 尽量地大,同时满足
C
o
v
(
y
i
,
y
j
)
=
0
,
i
≠
j
\small Cov(y_i,y_j)=0,i\neq j
C o v ( y i , y j ) = 0 , i = j . 先做些准备工作,因为
C
X
\small C_X
C X 是实对称矩阵且正定,所以其特征值均为正数,且存在某正交矩阵
U
\small U
U ,使得
U
T
C
X
U
=
D
,
D
=
d
i
a
g
(
λ
1
,
λ
2
,
⋯
,
λ
m
)
\small U^TC_XU=D,\,\,D=diag(\lambda_1,\lambda_2,\cdots,\lambda_m)
U T C X U = D , D = d i a g ( λ 1 , λ 2 , ⋯ , λ m ) ,其中
λ
1
≥
λ
2
≥
⋯
≥
λ
m
>
0
,
U
=
(
u
1
u
2
⋯
u
m
)
,
u
i
\lambda_1\geq\lambda_2\geq\cdots\geq\lambda_m>0,\,\, U=(u_1\,u_2\,\cdots\,u_m),\,\, u_i
λ 1 ≥ λ 2 ≥ ⋯ ≥ λ m > 0 , U = ( u 1 u 2 ⋯ u m ) , u i 为
λ
i
\lambda_i
λ i 的特征向量. 所以
C
X
\small C_X
C X 可以表示为
C
X
=
U
D
U
T
C_X=UDU^T
C X = U D U T 首先,要使
V
a
r
(
y
1
)
\small Var(y_1)
V a r ( y 1 ) 尽可能地大,
V
a
r
(
y
1
)
=
p
1
T
C
X
p
1
=
p
1
T
U
D
U
T
p
1
Var(y_1)=p_1^TC_Xp_1=p_1^TUDU^Tp_1
V a r ( y 1 ) = p 1 T C X p 1 = p 1 T U D U T p 1 记
z
1
=
U
T
p
1
=
(
z
11
,
z
12
,
⋯
,
z
1
m
)
T
z_1=U^Tp_1=(z_{11},z_{12},\cdots,z_{1m})^T
z 1 = U T p 1 = ( z 1 1 , z 1 2 , ⋯ , z 1 m ) T ,则
∥
z
1
∥
2
=
z
11
2
+
z
12
2
+
⋯
+
z
1
m
2
=
z
1
T
z
1
=
p
1
T
U
U
T
p
1
=
p
1
T
p
1
=
1
\Vert z_1 \Vert^2=z_{11}^2+z_{12}^2+\cdots+z_{1m}^2=z_1^Tz_1=p_1^TUU^Tp_1=p_1^Tp_1=1
∥ z 1 ∥ 2 = z 1 1 2 + z 1 2 2 + ⋯ + z 1 m 2 = z 1 T z 1 = p 1 T U U T p 1 = p 1 T p 1 = 1
V
a
r
(
y
1
)
=
z
1
T
D
z
1
=
z
11
2
λ
1
+
z
12
2
λ
2
+
⋯
+
z
1
m
2
λ
m
≤
z
11
2
λ
1
+
z
12
2
λ
1
+
⋯
+
z
1
m
2
λ
1
=
λ
1
\begin{aligned}Var(y_1)&=z_1^TDz_1\\&=z_{11}^2\lambda_1+z_{12}^2\lambda_2+\cdots+z_{1m}^2\lambda_m\\&\leq z_{11}^2\lambda_1+z_{12}^2\lambda_1+\cdots+z_{1m}^2\lambda_1\\&=\lambda_1\end{aligned}
V a r ( y 1 ) = z 1 T D z 1 = z 1 1 2 λ 1 + z 1 2 2 λ 2 + ⋯ + z 1 m 2 λ m ≤ z 1 1 2 λ 1 + z 1 2 2 λ 1 + ⋯ + z 1 m 2 λ 1 = λ 1 取
z
1
=
(
1
,
0
,
⋯
,
0
)
T
z_1=(1,0,\cdots,0)^T
z 1 = ( 1 , 0 , ⋯ , 0 ) T ,等式成立,此时
V
a
r
(
y
1
)
\small Var(y_1)
V a r ( y 1 ) 取最大值
λ
1
,
p
1
=
U
z
1
=
u
1
\lambda_1,\,\,p_1=Uz_1=u_1
λ 1 , p 1 = U z 1 = u 1 .
然后,考虑
p
2
p_2
p 2 ,需满足:
m
a
x
V
a
r
(
y
2
)
max\,\,Var(y_2)
m a x V a r ( y 2 )
s
.
t
.
C
o
v
(
y
1
,
y
2
)
=
0
s.t.\, Cov(y_1,y_2)=0
s . t . C o v ( y 1 , y 2 ) = 0 即
m
a
x
p
2
T
C
X
p
2
max\,\,p_2^TC_Xp_2
m a x p 2 T C X p 2
s
.
t
.
p
2
T
C
X
p
1
=
0
s.t.\,\,p_2^TC_Xp_1=0
s . t . p 2 T C X p 1 = 0 同样记
z
2
=
U
T
p
2
=
(
z
21
,
z
22
,
⋯
,
z
2
m
)
T
z_2=U^Tp_2=(z_{21},z_{22},\cdots,z_{2m})^T
z 2 = U T p 2 = ( z 2 1 , z 2 2 , ⋯ , z 2 m ) T ,则
p
2
T
C
X
p
1
=
p
2
T
U
D
U
T
p
1
=
z
2
T
D
z
1
=
λ
1
z
21
=
0
p_2^TC_Xp_1=p_2^TUDU^Tp_1=z_2^TDz_1=\lambda_1z_{21}=0
p 2 T C X p 1 = p 2 T U D U T p 1 = z 2 T D z 1 = λ 1 z 2 1 = 0 所以
z
21
=
0
z_{21}=0
z 2 1 = 0 ,则
∥
z
2
∥
2
=
0
+
z
22
2
+
⋯
+
z
2
m
2
=
z
2
T
z
2
=
p
2
T
U
U
T
p
2
=
p
2
T
p
2
=
1
\Vert z_2 \Vert^2=0+z_{22}^2+\cdots+z_{2m}^2=z_2^Tz_2=p_2^TUU^Tp_2=p_2^Tp_2=1
∥ z 2 ∥ 2 = 0 + z 2 2 2 + ⋯ + z 2 m 2 = z 2 T z 2 = p 2 T U U T p 2 = p 2 T p 2 = 1
V
a
r
(
y
2
)
=
z
22
2
λ
2
+
z
23
2
λ
3
+
⋯
+
z
2
m
2
λ
m
≤
z
22
2
λ
2
+
z
23
2
λ
2
+
⋯
+
z
2
m
2
λ
2
=
λ
2
\begin{aligned}Var(y_2)&=z_{ 22}^2\lambda_2+z_{ 23}^2\lambda_3+\cdots+z_{2m}^2\lambda_m\\&\leq z_{ 22}^2\lambda_2+z_{ 23}^2\lambda_2+\cdots+z_{2m}^2\lambda_2\\&=\lambda_2\end{aligned}
V a r ( y 2 ) = z 2 2 2 λ 2 + z 2 3 2 λ 3 + ⋯ + z 2 m 2 λ m ≤ z 2 2 2 λ 2 + z 2 3 2 λ 2 + ⋯ + z 2 m 2 λ 2 = λ 2 取
z
2
=
(
0
,
1
,
0
,
⋯
,
0
)
T
z_2=(0,1,0,\cdots,0)^T
z 2 = ( 0 , 1 , 0 , ⋯ , 0 ) T ,等式成立,此时
V
a
r
(
y
2
)
\small Var(y_2)
V a r ( y 2 ) 取最大值
λ
2
\lambda_2
λ 2 且满足
C
o
v
(
y
1
,
y
2
)
=
0
,
p
2
=
U
z
2
=
u
2
\small Cov(y_1,y_2)=0,\,p_2=Uz_2=u_2
C o v ( y 1 , y 2 ) = 0 , p 2 = U z 2 = u 2 .
依次递推下去,求
p
i
p_i
p i ,需满足:
m
a
x
V
a
r
(
y
i
)
max \,\,Var(y_i)
m a x V a r ( y i )
s
.
t
.
C
o
v
(
y
j
,
y
i
)
=
0
,
j
=
1
,
2
,
⋯
,
i
−
1
s.t.\,Cov(y_j,y_i)=0,\,\,j=1,2,\cdots,i-1
s . t . C o v ( y j , y i ) = 0 , j = 1 , 2 , ⋯ , i − 1 解得
V
a
r
(
y
i
)
\small Var(y_i)
V a r ( y i ) 的最大值为
λ
i
\lambda_i
λ i ,当
p
i
=
u
i
p_i=u_i
p i = u i 时取得最大值.
综合上述,
p
i
=
u
i
p_i=u_i
p i = u i ,即
P
=
[
p
1
T
p
2
T
⋮
p
m
T
]
=
[
u
1
T
u
2
T
⋮
u
m
T
]
=
U
T
,
Y
=
U
T
X
P=\begin{bmatrix}p_1^T\\p_2^T\\ \vdots \\p_m^T\end{bmatrix}=\begin{bmatrix}u_1^T\\u_2^T\\ \vdots \\u_m^T\end{bmatrix}=U^T, Y=U^TX
P = ⎣ ⎢ ⎢ ⎢ ⎡ p 1 T p 2 T ⋮ p m T ⎦ ⎥ ⎥ ⎥ ⎤ = ⎣ ⎢ ⎢ ⎢ ⎡ u 1 T u 2 T ⋮ u m T ⎦ ⎥ ⎥ ⎥ ⎤ = U T , Y = U T X 此时
C
Y
=
[
C
o
v
(
y
1
,
y
1
)
C
o
v
(
y
1
,
y
2
)
⋯
C
o
v
(
y
1
,
y
m
)
C
o
v
(
y
2
,
y
1
)
C
o
v
(
y
2
,
y
2
)
⋯
C
o
v
(
y
2
,
y
m
)
⋮
⋮
⋱
⋮
C
o
v
(
y
m
,
y
1
)
C
o
v
(
y
m
,
y
2
)
⋯
C
o
v
(
y
m
,
y
m
)
]
=
[
λ
1
0
⋯
0
0
λ
2
⋯
0
⋮
⋮
⋱
⋮
0
0
⋯
λ
m
]
=
D
\begin{aligned}C_Y&=\begin{bmatrix}Cov(y_1,y_1) & Cov(y_1,y_2) & \cdots & Cov(y_1,y_m)\\Cov(y_2,y_1) & Cov(y_2,y_2) & \cdots & Cov(y_2,y_m)\\\vdots & \vdots & \ddots & \vdots\\Cov(y_m,y_1) & Cov(y_m,y_2) & \cdots & Cov(y_m,y_m)\end{bmatrix}\\&=\begin{bmatrix}\lambda_1&0&\cdots&0\\0&\lambda_2&\cdots&0\\\vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&\lambda_m\end{bmatrix}\\&=D\end{aligned}
C Y = ⎣ ⎢ ⎢ ⎢ ⎡ C o v ( y 1 , y 1 ) C o v ( y 2 , y 1 ) ⋮ C o v ( y m , y 1 ) C o v ( y 1 , y 2 ) C o v ( y 2 , y 2 ) ⋮ C o v ( y m , y 2 ) ⋯ ⋯ ⋱ ⋯ C o v ( y 1 , y m ) C o v ( y 2 , y m ) ⋮ C o v ( y m , y m ) ⎦ ⎥ ⎥ ⎥ ⎤ = ⎣ ⎢ ⎢ ⎢ ⎡ λ 1 0 ⋮ 0 0 λ 2 ⋮ 0 ⋯ ⋯ ⋱ ⋯ 0 0 ⋮ λ m ⎦ ⎥ ⎥ ⎥ ⎤ = D 即
C
Y
\small C_Y
C Y 为对角阵,对角线元素为
C
X
\small C_X
C X 的特征值(由大到小排列). 也可以通过 拉格朗日乘数法 来求解
p
i
p_i
p i ,但不如此方法直观,感兴趣的读者可查阅相关资料或自己求解.
扫描二维码关注公众号,回复:
8955548 查看本文章
2. 最大方差解释
PCA 的目的是重新找一组基,这些基向量之间互相垂直,且数据点在基向量方向上投影的方差最大 . 记要找 标准正交基 为
{
p
1
,
p
2
,
⋯
,
p
m
}
,
w
i
t
h
p
i
T
p
j
=
δ
i
j
\lbrace p_1,p_2,\cdots,p_m\rbrace,with \,\, p_i^Tp_j=\delta_{ij}
{ p 1 , p 2 , ⋯ , p m } , w i t h p i T p j = δ i j .(这里重点是基向量的方向,与模长无关,所以取基向量为单位向量) 设
X
\small X
X 为
m
×
n
m\times n
m × n 的数据矩阵,一行表示一个变量,一列表示一个数据点 or 样本. 例如,在上一部分的例子中,
X
=
[
173
159
65
55
887
853
]
X=\begin{bmatrix}173 &159 \\ 65 & 55 \\ 887 & 853\end{bmatrix}
X = ⎣ ⎡ 1 7 3 6 5 8 8 7 1 5 9 5 5 8 5 3 ⎦ ⎤ 行表示身高、体重、胸围等变量,列表示某个人的数据,
m
=
3
,
n
=
2
m=3,n=2
m = 3 , n = 2 .
对
X
\small X
X 进行列分块,
X
=
(
x
1
x
2
⋯
x
n
)
X=(x_1\,x_2\,\cdots\,x_n)
X = ( x 1 x 2 ⋯ x n ) 则
x
j
x_j
x j 表示数据点,
p
i
T
x
j
p_i^Tx_j
p i T x j 表示
x
j
x_j
x j 在
p
i
p_i
p i 方向上的投影,则各数据点在
p
i
p_i
p i 方向上投影的方差可以表示为
V
a
r
(
i
)
=
1
n
∑
j
=
1
n
(
p
i
T
x
j
−
p
i
T
x
j
‾
)
2
=
1
n
∑
j
=
1
n
(
p
i
T
(
x
j
−
x
j
‾
)
)
2
=
1
n
∑
j
=
1
n
p
i
T
(
x
j
−
x
j
‾
)
(
x
j
−
x
j
‾
)
T
p
i
=
p
i
T
(
1
n
∑
j
=
1
n
(
x
j
−
x
j
‾
)
(
x
j
−
x
j
‾
)
T
)
p
i
=
p
i
T
C
X
p
i
\begin{aligned}Var(i)&=\frac{1}{n}\sum_{j=1}^n(p_i^Tx_j-\overline{p_i^Tx_j})^2\\&=\frac{1}{n}\sum_{j=1}^n(p_i^T(x_j-\overline{x_j}))^2\\&=\frac{1}{n}\sum_{j=1}^np_i^T(x_j-\overline{x_j})(x_j-\overline{x_j})^Tp_i\\&=p_i^T(\frac{1}{n}\sum_{j=1}^n(x_j-\overline{x_j})(x_j-\overline{x_j})^T)p_i\\&=p_i^TC_Xp_i\end{aligned}
V a r ( i ) = n 1 j = 1 ∑ n ( p i T x j − p i T x j ) 2 = n 1 j = 1 ∑ n ( p i T ( x j − x j ) ) 2 = n 1 j = 1 ∑ n p i T ( x j − x j ) ( x j − x j ) T p i = p i T ( n 1 j = 1 ∑ n ( x j − x j ) ( x j − x j ) T ) p i = p i T C X p i 首先考虑使
V
a
r
(
1
)
\small Var(1)
V a r ( 1 ) 最大,这里的
V
a
r
(
1
)
\small Var(1)
V a r ( 1 ) 与上部分中的
V
a
r
(
y
1
)
\small Var(y_1)
V a r ( y 1 ) 相同,可利用同样方法得到相同结果,即
p
1
=
u
1
p_1=u_1
p 1 = u 1 ,即最大特征值
λ
1
\lambda_1
λ 1 的单位特征向量. 下一步,求
p
2
p_2
p 2 ,使得
V
a
r
(
2
)
\small Var(2)
V a r ( 2 ) 最大,同时使
p
2
T
p
1
=
0
p_2^Tp_1=0
p 2 T p 1 = 0 . 这时细心的读者会发现,上一部分中要求的是
p
2
T
C
X
p
1
=
0
\small p_2^TC_Xp_1=0
p 2 T C X p 1 = 0 ,与这里不同. 仔细想想,真的不同吗?
∵
C
X
p
1
=
λ
1
p
1
(
λ
1
>
0
)
∴
p
2
T
C
X
p
1
=
λ
1
p
2
T
p
1
∴
p
2
T
C
X
p
1
=
0
⟺
p
2
T
p
1
=
0
\begin{aligned}&\because \,\,C_Xp_1=\lambda_1p_1(\lambda_1>0)\\ &\therefore \,\,p_2^TC_Xp_1=\lambda_1 p_2^Tp_1\\ &\therefore \,\,p_2^TC_Xp_1=0 \iff p_2^Tp_1=0\end{aligned}
∵ C X p 1 = λ 1 p 1 ( λ 1 > 0 ) ∴ p 2 T C X p 1 = λ 1 p 2 T p 1 ∴ p 2 T C X p 1 = 0 ⟺ p 2 T p 1 = 0 嘿嘿,怎么样?这下相信两者相同了吧? 同理,之后的限制条件也相同,即
p
i
T
C
X
p
j
=
0
⟺
p
i
T
p
j
=
0
\small p_i^TC_Xp_j=0 \iff p_i^Tp_j=0
p i T C X p j = 0 ⟺ p i T p j = 0 .
当
p
i
=
u
i
p_i=u_i
p i = u i ,即
λ
i
\lambda_i
λ i 的单位特征向量时,
p
i
p_i
p i 方向上数据投影点的方差取最大值
λ
i
\lambda_i
λ i 且不同基向量之间相互垂直. 所以,PCA 要找的一组标准正交基就是协方差矩阵
C
X
\small C_X
C X 的单位正交特征向量.
3. 最小均方误差解释
对数据矩阵
X
m
×
n
\small X_{m\times n}
X m × n 进行列分块,每列表示一个数据点,即
X
=
(
x
1
x
2
⋯
x
n
)
X=(x_1\,x_2\,\cdots\,x_n)
X = ( x 1 x 2 ⋯ x n ) 现用一组标准正交基
{
p
1
,
p
2
,
⋯
,
p
m
}
\lbrace p_1,p_2,\cdots,p_m\rbrace
{ p 1 , p 2 , ⋯ , p m } 重新表示各数据点
x
i
=
a
i
1
p
1
+
a
i
2
p
2
+
⋯
+
a
i
m
p
m
,
i
=
1
,
2
,
⋯
,
n
x_i=a_{i1}p_1+a_{i2}p_2+\cdots+a_{im}p_m,\,\,i=1,2,\cdots,n
x i = a i 1 p 1 + a i 2 p 2 + ⋯ + a i m p m , i = 1 , 2 , ⋯ , n 由于这是一组标准正交基,容易求得(点击查看过程),
a
i
j
=
x
i
T
p
j
a_{ij}=x_i^Tp_j
a i j = x i T p j .
考虑
d
(
d
<
m
)
d(d<m)
d ( d < m ) 维空间
V
d
=
s
p
a
n
{
p
1
,
p
2
,
⋯
,
p
d
}
\small V_d=span\lbrace p_1,p_2,\cdots,p_d\rbrace
V d = s p a n { p 1 , p 2 , ⋯ , p d } ,我们的目的是在
V
d
\small V_d
V d 中重新表示样本,同时要保证"损失"最小. 设每个数据点的估计值可以表示为
x
~
i
=
∑
j
=
1
d
b
i
j
p
j
+
∑
j
=
d
+
1
m
z
j
p
j
\widetilde{x}_i=\sum_{j=1}^db_{ij}p_j+\sum_{j=d+1}^mz_jp_j
x
i = j = 1 ∑ d b i j p j + j = d + 1 ∑ m z j p j 其中
b
i
j
b_{ij}
b i j 与数据点有关,
z
j
z_j
z j 与数据点无关. 设损失函数为数据点与近似点平方距离的平均值,即
J
=
1
n
∑
i
=
1
n
∥
x
i
−
x
~
i
∥
2
J=\frac{1}{n} \sum_{i=1}^n \Vert x_i-\widetilde{x}_i\Vert^2
J = n 1 i = 1 ∑ n ∥ x i − x
i ∥ 2 为了使该损失函数值最小,我们可以随意选择
b
i
j
,
z
j
b_{ij},\,z_j
b i j , z j 和
{
p
j
}
\lbrace p_j\rbrace
{ p j } .
首先考虑
b
i
j
b_{ij}
b i j ,
x
i
−
x
~
i
=
∑
j
=
1
d
(
a
i
j
−
b
i
j
)
p
j
+
∑
j
=
d
+
1
m
(
a
i
j
−
z
j
)
p
j
x_i-\widetilde{x}_i=\sum_{j=1}^d(a_{ij}-b_{ij})p_j+\sum_{j=d+1}^m(a_{ij}-z_j)p_j
x i − x
i = j = 1 ∑ d ( a i j − b i j ) p j + j = d + 1 ∑ m ( a i j − z j ) p j 由
{
p
1
,
p
2
,
⋯
,
p
m
}
\lbrace p_1,p_2,\cdots,p_m\rbrace
{ p 1 , p 2 , ⋯ , p m } 是一组标准正交基,所以
∥
x
i
−
x
~
i
∥
2
=
∑
j
=
1
d
(
a
i
j
−
b
i
j
)
2
+
∑
j
=
d
+
1
m
(
a
i
j
−
z
j
)
2
\Vert x_i-\widetilde{x}_i\Vert^2=\sum_{j=1}^d(a_{ij}-b_{ij})^2+\sum_{j=d+1}^m(a_{ij}-z_j)^2
∥ x i − x
i ∥ 2 = j = 1 ∑ d ( a i j − b i j ) 2 + j = d + 1 ∑ m ( a i j − z j ) 2
J
=
1
n
∑
i
=
1
n
∥
x
i
−
x
~
i
∥
2
=
1
n
∑
i
=
1
n
(
∑
j
=
1
d
(
a
i
j
−
b
i
j
)
2
+
∑
j
=
d
+
1
m
(
a
i
j
−
z
j
)
2
)
J=\frac{1}{n} \sum_{i=1}^n \Vert x_i-\widetilde{x}_i\Vert^2=\frac{1}{n} \sum_{i=1}^n(\sum_{j=1}^d(a_{ij}-b_{ij})^2+\sum_{j=d+1}^m(a_{ij}-z_j)^2)
J = n 1 i = 1 ∑ n ∥ x i − x
i ∥ 2 = n 1 i = 1 ∑ n ( j = 1 ∑ d ( a i j − b i j ) 2 + j = d + 1 ∑ m ( a i j − z j ) 2 ) 选择
b
i
j
b_{ij}
b i j 使
J
\small J
J 最小,可以看出当
b
i
j
=
a
i
j
=
x
i
T
w
j
b_{ij}=a_{ij}=x_i^Tw_j
b i j = a i j = x i T w j 时,
J
\small J
J 取最小值,此时
J
=
1
n
∑
i
=
1
n
∑
j
=
d
+
1
m
(
a
i
j
−
z
j
)
2
=
1
n
∑
j
=
d
+
1
m
∑
i
=
1
n
(
a
i
j
−
z
j
)
2
J=\frac{1}{n} \sum_{i=1}^n\sum_{j=d+1}^m(a_{ij}-z_j)^2=\frac{1}{n} \sum_{j=d+1}^m\sum_{i=1}^n(a_{ij}-z_j)^2
J = n 1 i = 1 ∑ n j = d + 1 ∑ m ( a i j − z j ) 2 = n 1 j = d + 1 ∑ m i = 1 ∑ n ( a i j − z j ) 2 也可通过对
b
i
j
b_{ij}
b i j 求导,令偏导数为零,得到相同结果.
然后,考虑
z
j
z_j
z j ,
J
=
1
n
∑
j
=
d
+
1
m
∑
i
=
1
n
(
z
j
2
−
2
z
j
a
i
j
+
a
i
j
2
)
=
1
n
∑
j
=
d
+
1
m
(
n
z
j
2
−
2
(
∑
i
=
1
n
a
i
j
)
z
j
+
∑
i
=
1
n
a
i
j
2
)
\begin{aligned}J&=\frac{1}{n} \sum_{j=d+1}^m\sum_{i=1}^n(z_j^2-2z_ja_{ij}+a_{ij}^2)\\&=\frac{1}{n} \sum_{j=d+1}^m(nz_j^2-2(\sum_{i=1}^na_{ij})z_j+\sum_{i=1}^na_{ij}^2)\end{aligned}
J = n 1 j = d + 1 ∑ m i = 1 ∑ n ( z j 2 − 2 z j a i j + a i j 2 ) = n 1 j = d + 1 ∑ m ( n z j 2 − 2 ( i = 1 ∑ n a i j ) z j + i = 1 ∑ n a i j 2 ) 对
z
j
z_j
z j 求偏导,令导数为零
∂
J
∂
z
j
=
2
n
z
j
−
2
∑
i
=
1
n
a
i
j
=
0
\frac{\partial J}{\partial z_j}=2nz_j-2\sum_{i=1}^na_{ij}=0
∂ z j ∂ J = 2 n z j − 2 i = 1 ∑ n a i j = 0 则
z
j
=
1
n
∑
i
=
1
n
a
i
j
=
1
n
∑
i
=
1
n
x
i
T
p
j
=
(
1
n
∑
i
=
1
n
x
i
T
)
p
j
=
x
‾
T
p
j
z_j=\frac{1}{n}\sum_{i=1}^na_{ij}= \frac{1}{n}\sum_{i=1}^nx_i^Tp_j=(\frac{1}{n}\sum_{i=1}^nx_i^T)p_j=\overline{x}^Tp_j
z j = n 1 i = 1 ∑ n a i j = n 1 i = 1 ∑ n x i T p j = ( n 1 i = 1 ∑ n x i T ) p j = x T p j 此时
J
=
1
n
∑
i
=
1
n
∑
j
=
d
+
1
m
(
a
i
j
−
z
j
)
2
=
1
n
∑
i
=
1
n
∑
j
=
d
+
1
m
(
x
i
T
p
j
−
x
‾
T
p
j
)
2
=
1
n
∑
i
=
1
n
∑
j
=
d
+
1
m
(
(
x
i
−
x
‾
)
T
p
j
)
(
(
x
i
−
x
‾
)
T
p
j
)
=
1
n
∑
j
=
d
+
1
m
∑
i
=
1
n
p
j
T
(
x
i
−
x
‾
)
(
x
i
−
x
‾
)
T
p
j
=
∑
j
=
d
+
1
m
p
j
T
(
1
n
∑
i
=
1
n
(
x
i
−
x
‾
)
(
x
i
−
x
‾
)
T
)
p
j
=
∑
j
=
d
+
1
m
p
j
T
C
X
p
j
\begin{aligned}J&=\frac{1}{n} \sum_{i=1}^n\sum_{j=d+1}^m(a_{ij}-z_j)^2\\&=\frac{1}{n} \sum_{i=1}^n\sum_{j=d+1}^m(x_i^Tp_j-\overline{x}^Tp_j)^2\\&=\frac{1}{n} \sum_{i=1}^n\sum_{j=d+1}^m((x_i-\overline{x})^Tp_j)((x_i-\overline{x})^Tp_j)\\&=\frac{1}{n} \sum_{j=d+1}^m\sum_{i=1}^np_j^T(x_i-\overline{x})(x_i-\overline{x})^Tp_j\\&= \sum_{j=d+1}^mp_j^T (\frac{1}{n}\sum_{i=1}^n (x_i-\overline{x})(x_i-\overline{x})^T)p_j\\&=\sum_{j=d+1}^m p_j^TC_Xp_j\end{aligned}
J = n 1 i = 1 ∑ n j = d + 1 ∑ m ( a i j − z j ) 2 = n 1 i = 1 ∑ n j = d + 1 ∑ m ( x i T p j − x T p j ) 2 = n 1 i = 1 ∑ n j = d + 1 ∑ m ( ( x i − x ) T p j ) ( ( x i − x ) T p j ) = n 1 j = d + 1 ∑ m i = 1 ∑ n p j T ( x i − x ) ( x i − x ) T p j = j = d + 1 ∑ m p j T ( n 1 i = 1 ∑ n ( x i − x ) ( x i − x ) T ) p j = j = d + 1 ∑ m p j T C X p j
现在就只剩下最后一个任务,合理选取
p
1
,
p
2
,
⋯
,
p
m
p_1,p_2,\cdots,p_m
p 1 , p 2 , ⋯ , p m ,使得
J
\small J
J 最小且
p
i
T
p
j
=
δ
i
j
p_i^Tp_j=\delta_{ij}
p i T p j = δ i j .
先令
d
=
m
−
1
d=m-1
d = m − 1 ,利用与第一部分同样的方法,即根据
C
X
\small C_X
C X 正交相似与对角形,求得
p
m
p_m
p m 为最小特征值
λ
m
\lambda_m
λ m 的单位特征向量,此时
J
\small J
J 的最小值为
λ
m
\lambda_m
λ m . 而后,令
d
=
m
−
2
,
m
−
3
,
⋯
,
1
d=m-2,m-3,\cdots,1
d = m − 2 , m − 3 , ⋯ , 1 ,可得:
p
i
p_i
p i 为特征值
λ
i
\lambda_i
λ i 的单位特征向量,
i
=
1
,
2
,
⋯
,
m
i=1,2,\cdots,m
i = 1 , 2 , ⋯ , m ,
J
J
J 的最小值为
J
m
i
n
=
∑
j
=
d
+
1
m
λ
j
J_{min}=\sum_{j=d+1}^m \lambda_j
J m i n = j = d + 1 ∑ m λ j 参考文献: [1] Shlens J. A tutorial on principal component analysis. arXiv preprint arXiv: 14016.1100, 2014. [2] Christopher M. Bishop. Pattern Recognition and Machine Learning [M]. Singapore:Springer, 2006. [3] 周长宇 . A Tutorial on Principal Component Analysis(译 ). https://blog.csdn.net/zhouchangyu1221/article/details/103949967, 2020-01-22. [4] 范金城,梅长林.数据分析 [M].第二版.北京:科学出版社 , 2018.