一、开篇：简述SVD应用

利用SVD实现,我们能够用小得多的数据集来表示原始数据集。这样做,实际上是去除了噪声和冗余信息。简而言之，SVD是一种从大量数据中提取主要关键数据的方法。

下面介绍几种应用场景：

1、隐性语义索引
最早的SVD应用之一就是信息检索。我们称利用SVD的方法为隐性语义索引(LatentSemantic Indexing,LSI)或隐性语义分析(Latent Semantic Analysis,LSA)。在LSI中,一个矩阵是由文档和词语组成的。应用SVD时，构建的SVD奇异值代表了文章的主题或者主要概念。
当我们查找一个词时,其同义词所在的文档可能并不会匹配上。如果我们从上千篇相似的文档中抽取出概念,那么同义词就会映射为同一概念。
2、推荐系统
简单版本的推荐系统能够计算项或者人之间的相似度。更先进的方法则先利用SVD从数据中构建一个主题空间,然后再在该空间下计算其相似度。

二、矩阵分解

SVD是矩阵分解的一种类型,而矩阵分解是将数据矩阵分解为多个独立部分的过程。
很多情况下,数据中的一小段携带了数据集中的大部分信息,其他信息则要么是噪声,要么就是毫不相关的信息。
在线性代数中还有很多矩阵分解技术。矩阵分解可以将原始矩阵表示成新的易于处理的形式,这种新形式是两个或多个矩阵的乘积。
不同的矩阵分解技术具有不同的性质,其中有些更适合于某个应用,有些则更适合于其他应用。最常见的一种矩阵分解技术就是SVD。
公式如下：
Datam×n=Um×m⋅Σm×n⋅VTn×n
上述分解中会构建出一个矩阵 Σ ,该矩阵只有对角元素,其他元素均为0。另一个惯例就是,Σ 的对角元素是从大到小排列的。这些对角元素称为奇异值(Singular Value),它们对应了原始数据集矩阵 Data 的奇异值。奇异值和特征值是有关系的。这里的奇异值就是矩阵 Data⋅DataT 特征值的平方根。
科学和工程中,一直存在这样一个普遍事实:在某个奇异值的数目( r 个)之后,其他的奇异值都置为0。这就意味着数据集中仅有 r个重要特征,而其余特征则都是噪声或冗余特征。

三、利用 Python 实现 SVD

import pandas as pd
import numpy as np
%matplotlib inline
%matplotlib notebook
import matplotlib.pyplot as plt
from numpy import *

# svdRec.py
import svdRec

# SVD尝试
Data = svdRec.loadExData()
Data
#[[1, 1, 1, 0, 0],
# [2, 2, 2, 0, 0],
# [1, 1, 1, 0, 0],
# [5, 5, 5, 0, 0],
# [1, 1, 0, 2, 2],
# [0, 0, 0, 3, 3],
# [0, 0, 0, 1, 1]]

U,Sigma,VT = linalg.svd(Data)
# 奇异值
Sigma
#array([9.72140007e+00, 5.29397912e+00, 6.84226362e-01, 1.30585973e-15, 1.86360781e-31])

# Sig3
Sig3 = mat([[Sigma[0],0,0],[0,Sigma[1],0],[0,0,Sigma[2]]])
Sig3
#matrix([[9.72140007, 0.        , 0.        ],
#        [0.        , 5.29397912, 0.        ],
#        [0.        , 0.        , 0.68422636]])

# 重构一个原始数据的近似矩阵
U[:,:3]*Sig3*VT[:3,:]
#matrix([[ 1.00000000e+00,  1.00000000e+00,  1.00000000e+00,          1.74773390e-16,  1.54715650e-16],
#        [ 2.00000000e+00,  2.00000000e+00,  2.00000000e+00,          5.16080234e-17,  1.14925430e-17],
#        [ 1.00000000e+00,  1.00000000e+00,  1.00000000e+00,         -5.66278795e-16, -5.86336535e-16],
#        [ 5.00000000e+00,  5.00000000e+00,  5.00000000e+00,          6.89552582e-17,  1.69135539e-17],
#        [ 1.00000000e+00,  1.00000000e+00, -3.33066907e-16,          2.00000000e+00,  2.00000000e+00],
#        [ 8.32667268e-17, -2.77555756e-17, -1.11022302e-16,          3.00000000e+00,  3.00000000e+00],
#        [-2.08166817e-17, -5.55111512e-17,  4.16333634e-17,          1.00000000e+00,  1.00000000e+00]])

四、相似度

协同过滤( collaborative filtering )是通过将用户和其他用户的数据进行对比来实现推荐的，唯一所需要的数学方法就是相似度的计算。
利用用户对它们的意见来计算相似度：这就是协同过滤中所使用的方法。它并不关心物品的描述属性,而是严格地按照许多用户的观点来计算相似度。
我们希望,相似度值在 0 到 1 之间变化,并且物品对越相似,它们的相似度值也就越大。我们可以用“相似度 =1/(1+ 距离 ) ”这样的算式来计算相似度。当距离为 0 时,相似度为 1.0 。如果距离真的非常大时,相似度也就趋近于 0 。

1-距离采用欧式距离来计算（计算平方和）。
2-第二种计算距离的方法是皮尔逊相关系数( Pearson correlation )。
该方法相对于欧氏距离的一个优势在于,它对用户评级的量级并不敏感。比如某个狂躁者对所有物品的评分都是 5 分,而另一个忧郁者对所有物品的评分都是 1 分,皮尔逊相关系数会认为这两个向量是相等的。在 NumPy 中,皮尔逊相关系数的计算是由函数 corrcoef() 进行的,后面我们很快就会用到它了。皮尔逊相关系数的取值范围从- 1 到 +1 ,我们通过 0.5 + 0.5*corrcoef() 这个函数计算,并且把其取值范围归一化到 0 到 1 之间。
3-余弦相似度 ( cosine similarity )
其计算的是两个向量夹角的余弦值。如果夹角为 90 度,则相似度为 0 ;如果两个向量的方向相同,则相似度为 1.0 。
cosΘ=A⋅B/∥A∥∥B∥
其中 ∥A∥∥B∥为A、B的2范数。你可以定义向量的任一范数,但是如果不指定范数阶数,则都假设为 2 范数。

相似度计算

myMat = mat(svdRec.loadExData())

# 欧式距离
svdRec.ecludSim(myMat[:,0],myMat[:,4])
#0.13367660240019172

# 余弦相似度
svdRec.cosSim(myMat[:,0],myMat[:,4])
#0.5472455591261534

# 皮尔逊相关系数
svdRec.pearsSim(myMat[:,0],myMat[:,4])
#0.23768619407595826

五、推荐系统

示范：餐馆菜肴推荐引擎

# 推荐系统的工作过程是：给定一个用户系统会为此用户返回N个最好的推荐菜。
# 伪代码：
#（1）寻找用户没有评级的菜肴，即在用户-物品矩阵中的0值；
#（2）在用户没有评级的所有物品中，对每个物品预计一个可能的评级分数。
#（3）对这些物品的评分从高到低进行排序，返回前N个物品。

myMat=mat(svdRec.loadExData())
myMat[0,1]=myMat[0,0]=myMat[1,0]=myMat[2,0]=4
myMat[3,3]=2
myMat
#matrix([[4, 4, 1, 0, 0],
#        [4, 2, 2, 0, 0],
#        [4, 1, 1, 0, 0],
#        [5, 5, 5, 2, 0],
#        [1, 1, 0, 2, 2],
#        [0, 0, 0, 3, 3],
#        [0, 0, 0, 1, 1]])

# 相似度方法1
svdRec.recommend(myMat, 2)
#the 3 and 0 similarity is: 0.916025
#the 3 and 1 similarity is: 0.916025
#the 3 and 2 similarity is: 1.000000
#the 4 and 0 similarity is: 1.000000
#the 4 and 1 similarity is: 1.000000
#the 4 and 2 similarity is: 0.000000
#Out[13]:[(4, 2.5), (3, 1.9703483892927431)]

# 相似度方法2
svdRec.recommend(myMat, 2, simMeas=svdRec.ecludSim)
#the 3 and 0 similarity is: 0.240253
#the 3 and 1 similarity is: 0.240253
#the 3 and 2 similarity is: 0.250000
#the 4 and 0 similarity is: 0.500000
#the 4 and 1 similarity is: 0.500000
#the 4 and 2 similarity is: 0.000000
#Out[14]:[(4, 2.5), (3, 1.98665729687295)]

# 相似度方法3
svdRec.recommend(myMat, 2, simMeas=svdRec.pearsSim)
#the 3 and 0 similarity is: 1.000000
#the 3 and 1 similarity is: 1.000000
#the 3 and 2 similarity is: 1.000000
#the 4 and 0 similarity is: 1.000000
#the 4 and 1 similarity is: 1.000000
#the 4 and 2 similarity is: 0.000000
#Out[15]:[(4, 2.5), (3, 2.0)]

利用SVD提高推荐的效果

from numpy import linalg as la
U,Sigma,VT=la.svd(mat(svdRec.loadExData2()))

# 总能量
Sig2 = Sigma**2
sum(Sig2)
#541.9999999999994

# 90%的能量
sum(Sig2)*0.9
#487.7999999999995

# 计算前两个元素所包含的能量
sum(Sig2[:2])
#378.8295595113579

# 计算前三个元素所包含的能量
sum(Sig2[:3])
#500.5002891275792

# 相似度推荐1
recommend1 = svdRec.recommend(myMat, 1, estMethod=svdRec.svdEst)
recommend1
#the 3 and 0 similarity is: 0.441210
#the 3 and 1 similarity is: 0.523799
#the 3 and 2 similarity is: 0.650061
#the 4 and 0 similarity is: 0.561288
#the 4 and 1 similarity is: 0.475190
#the 4 and 2 similarity is: 0.343564
#Out[21]:[(4, 2.813435807030927), (3, 2.546365997875842)]

# 相似度推荐2
recommend2 = svdRec.recommend(myMat, 1, estMethod=svdRec.svdEst,simMeas=svdRec.pearsSim)
recommend2
#the 3 and 0 similarity is: 0.833133
#the 3 and 1 similarity is: 0.472864
#the 3 and 2 similarity is: 0.706605
#the 4 and 0 similarity is: 0.912539
#the 4 and 1 similarity is: 0.442920
#the 4 and 2 similarity is: 0.357391
#Out[22]:[(4, 3.065521710794845), (3, 2.827916320630774)]

# 相似度推荐3
recommend3 = svdRec.recommend(myMat, 1, estMethod=svdRec.svdEst,simMeas=svdRec.cosSim)
recommend3
#the 3 and 0 similarity is: 0.441210
#the 3 and 1 similarity is: 0.523799
#the 3 and 2 similarity is: 0.650061
#the 4 and 0 similarity is: 0.561288
#the 4 and 1 similarity is: 0.475190
#the 4 and 2 similarity is: 0.343564
#Out[23]:[(4, 2.813435807030927), (3, 2.546365997875842)]

六、推荐系统


from numpy import *
from numpy import linalg as la

def loadExData():
    return[ [1, 1, 1, 0, 0],
          [2, 2, 2, 0, 0],
          [1, 1, 1, 0, 0],
          [5, 5, 5, 0, 0],
          [1, 1, 0, 2, 2],
          [0, 0, 0, 3, 3],
          [0, 0, 0, 1, 1]]
    
    

def loadExData2():
    return[[0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5],
           [0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 3],
           [0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0],
           [3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0],
           [5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0],
           [0, 0, 0, 0, 5, 0, 1, 0, 0, 5, 0],
           [4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1],
           [0, 0, 0, 4, 0, 4, 0, 0, 0, 0, 4],
           [0, 0, 0, 2, 0, 2, 5, 0, 0, 1, 2],
           [0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0],
           [1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]]
    
def ecludSim(inA,inB):
    return 1.0/(1.0 + la.norm(inA - inB))

def pearsSim(inA,inB):
    if len(inA) < 3 : return 1.0
    return 0.5+0.5*corrcoef(inA, inB, rowvar = 0)[0][1]

def cosSim(inA,inB):
    num = float(inA.T*inB)
    denom = la.norm(inA)*la.norm(inB)
    return 0.5+0.5*(num/denom)

def standEst(dataMat, user, simMeas, item):
    n = shape(dataMat)[1]
    simTotal = 0.0; ratSimTotal = 0.0
    for j in range(n):
        userRating = dataMat[user,j]
        if userRating == 0: continue
        overLap = nonzero(logical_and(dataMat[:,item].A>0, \
                                      dataMat[:,j].A>0))[0]
        if len(overLap) == 0: similarity = 0
        else: similarity = simMeas(dataMat[overLap,item], \
                                   dataMat[overLap,j])
        print ('the %d and %d similarity is: %f' % (item, j, similarity))
        simTotal += similarity
        ratSimTotal += similarity * userRating
    if simTotal == 0: return 0
    else: return ratSimTotal/simTotal
    
def svdEst(dataMat, user, simMeas, item):
    n = shape(dataMat)[1]
    simTotal = 0.0; ratSimTotal = 0.0
    U,Sigma,VT = la.svd(dataMat)
    Sig4 = mat(eye(4)*Sigma[:4]) #arrange Sig4 into a diagonal matrix
    xformedItems = dataMat.T * U[:,:4] * Sig4.I  #create transformed items
    for j in range(n):
        userRating = dataMat[user,j]
        if userRating == 0 or j==item: continue
        similarity = simMeas(xformedItems[item,:].T,\
                             xformedItems[j,:].T)
        print ('the %d and %d similarity is: %f' % (item, j, similarity))
        simTotal += similarity
        ratSimTotal += similarity * userRating
    if simTotal == 0: return 0
    else: return ratSimTotal/simTotal

def recommend(dataMat, user, N=3, simMeas=cosSim, estMethod=standEst):
    unratedItems = nonzero(dataMat[user,:].A==0)[1]#find unrated items 
    if len(unratedItems) == 0: return 'you rated everything'
    itemScores = []
    for item in unratedItems:
        estimatedScore = estMethod(dataMat, user, simMeas, item)
        itemScores.append((item, estimatedScore))
    return sorted(itemScores, key=lambda jj: jj[1], reverse=True)[:N]

def printMat(inMat, thresh=0.8):
    for i in range(32):
        for k in range(32):
            if float(inMat[i,k]) > thresh:
                print (1),
            else: print (0),
        print ('')

def imgCompress(numSV=3, thresh=0.8):
    myl = []
    for line in open('0_5.txt').readlines():
        newRow = []
        for i in range(32):
            newRow.append(int(line[i]))
        myl.append(newRow)
    myMat = mat(myl)
    print ("****original matrix******")
    printMat(myMat, thresh)
    U,Sigma,VT = la.svd(myMat)
    SigRecon = mat(zeros((numSV, numSV)))
    for k in range(numSV):#construct diagonal matrix from vector
        SigRecon[k,k] = Sigma[k]
    reconMat = U[:,:numSV]*SigRecon*VT[:numSV,:]
    print ("****reconstructed matrix using %d singular values******" % numSV)
    printMat(reconMat, thresh)

《机器学习实战Machine_Learning_in_Action》 CH14-SVD简化

一、开篇：简述SVD应用

下面介绍几种应用场景：

二、矩阵分解

三、利用 Python 实现 SVD

四、相似度

相似度计算

五、推荐系统

示范：餐馆菜肴推荐引擎

利用SVD提高推荐的效果

六、推荐系统

目录

一、开篇：简述SVD应用

下面介绍几种应用场景：

二、矩阵分解

三、利用 Python 实现 SVD

四、相似度

相似度计算

五、推荐系统

示范：餐馆菜肴推荐引擎

利用SVD提高推荐的效果

六、推荐系统

猜你喜欢

目录

热门文章