Exploring the user-based recommender 2( similarity metrics)

Sample Data

1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0
  • Pearson correlation–based similarity

The Pearson correlation(Pearson Product Moment Correlation) is a number between –1 and 1 that measures the tendency of two series of numbers, paired up one-to-one, to move together. That is to say, it measures how likely a number in one series is to be relatively large when the corresponding number in the other series is high, and vice versa. It measures the tendency of the numbers to move together proportionally, such that there’s a roughly linear relationship between the values in one series and the other. When this tendency is high,the correlation is close to 1. When there appears to be little relationship at all, the value is near 0. When there appears to be an opposing relationship—one series’ numbers are high exactly when the other series’ numbers are low—the value is near –1.

 

It measures the tendency of two users’ preference values to move together—to be relatively high, or relatively low, on the same items.

Formula


The similarity computation can only operate on items that both users have expressed a preference for.


 

Pearson correlation problems

First, it doesn’t take into account the number of items in which two users’ preferences overlap, which is probably a weakness in the context of recommender engines.

Second, if two users overlap on only one item, no correlation can be computed because of how the computation is defined.

Finally, the correlation is also undefined if either series of preference values are all
identical.

 

Class Constructor

PearsonCorrelationSimilarity(DataModel dataModel)

PearsonCorrelationSimilarity(DataModel dataModel, Weighting weighting)

 

  • Euclidean distance based similarity

This implementation is based on the distance between users. This idea makes sense if you think of users as points in a space of many dimensions (as many dimensions as there are items), whose coordinates are preference values.

 

Formula

d(p, q) = \sqrt{(p_1- q_1)^2 + (p_2 - q_2)^2+...+(p_i - q_i)^2+...+(p_n - q_n)^2}.
r=1/(1+d)


Class Constructor
EuclideanDistanceSimilarity(DataModel dataModel)
EuclideanDistanceSimilarity(DataModel dataModel, Weighting weighting)
  • cosine measure similarity

The cosine measure similarity is another similarity metric that depends on envisioning user preferences as points in space. Hold in mind the image of user preferences as points in an n-dimensional space. Now imagine two lines from the origin, or point (0,0,...,0), to each of these two points. When two users are similar, they’ll have similar ratings, and so will be relatively close in space—at least, they’ll be in roughly the same direction from the origin. The angle formed between these two lines will be relatively small. In contrast, when the two users are dissimilar, their points will be distant, and likely in different directions from the origin, forming a wide angle.

 

Formula



 

Class Constructor

PearsonCorrelationSimilarity(DataModel dataModel)

PearsonCorrelationSimilarity(DataModel dataModel, Weighting weighting)

 

 

 

References

http://www.statisticshowto.com/what-is-the-pearson-correlation-coefficient/

http://www.socialresearchmethods.net/kb/statcorr.php

http://en.wikipedia.org/wiki/Cosine_similarity

 

 

Books

The Practically Cheating Statistics Handbook, the Sequel! (2nd Edition)

猜你喜欢

转载自ylzhj02.iteye.com/blog/2057165