算法
1、计算数据集每个属性的平均值
2、将原始矩阵的每个值减去其对应的均值
3、求出去均值的矩阵的协方差矩阵
4、得到协方差矩阵的特征值和特征向量
5、根据给定的降维矩阵维度k获取具有最大特征值的特征向量k个,得到特征矩阵
6、降维后的矩阵为:去均值矩阵 * 特征矩阵
7、重构的原始数据为:降维矩阵 * 特征矩阵 + 均值
代码如下:
from numpy import mat, mean, cov, linalg
def pca(data_set, feature_amount):
data_set_mean = mean(data_set, 0)
data_set_mean_removed = data_set - data_set_mean
cov_mat = cov(data_set_mean_removed, rowvar=False)
eig_values, eig_vectors = linalg.eig(cov_mat)
eig_pairs = sorted(list(zip(eig_values, eig_vectors.T)), reverse=True)
feature = mat(list(ele_pair[1] for ele_pair in eig_pairs[:feature_amount]))
low_data_set = data_set_mean_removed * feature.T
re_data_set = (low_data_set * feature) + data_set_mean
return low_data_set, re_data_set
def main():
data_set = load_data_set('data.txt')
low_data_set, re_data_set = pca(data_set, 1)
if __name__ == '__main__':
main()