CDN笔记二 Locality Sensitive Hashing算法 续

This is part of my general survey on LSH in CDN class.

NN search
Given a set P of n points, design an algorithm that, given any point q, returns a point in P that is closest to q (its “nearest neighbor” in P).

ANN search
An approximate nearest neighbor search problem is to find the point, whose distance from the query is at most c times the distance from the query to its nearest points. Its formal definition is denoted as followed:
Given a set P and a point q, find a point pi in P so that
d ( p i , q ) c × m i n p j d ( p j , q ) {d(p_i, q) \leq c \times min_{p_j}d(p_j, q)}

The appealing point of this approach is that it is almost as good as the exact one in many cases. In particular, if the distance measure accurately captures the notion of user quality, then small differences in the distance should not matter.

General approaches to ANN problem are mainly two kinds, tree-based approach and LSH.

LSH
LSH refers to a family of functions (LSH families) to hash data points into buckets so that data points near each other are located in the same buckets with high probability, while data points far from each other are likely to be in different buckets. This makes it easier to identify observations in high dimensional space. The formal definition is:
A family of hashing functions is ( r , c r , p 1 , p 2 ) L S H {(r,cr,p_1,p_2)-LSH} with p 1 p 2 {p_1 \geq p_2} and c 1 {c \geq 1} if:
P r [ h ( x ) = h ( y ) ] p 1 {Pr[h(x)=h(y)] \geq p_1} when d ( x , y ) r {d(x,y) \leq r}
P r [ h ( x ) = h ( y ) ] p 2 {Pr[h(x)=h(y)] \leq p_2} when d ( x , y ) c r {d(x,y) \geq cr}

A simple LSH sample
Assume that distance is measured by Hamming distance and v§ denotes the Hamming representation of the point p, which is a binary vector. In Hamming space, the LSH family is a set H that satisfies: For each hash function h in H, h ( p ) {h(p)} = 0 if the r-th unit of v ( p ) {v(p)} is 0, where r is a random number in the range of [1, the length of v ( p ) {v(p)} ]. Otherwise h§ = 1. Let’s say we have six data points: A(1,1), B(2,1), C(1,2), D(2,2), E(4,2) and F(4,3). The corresponding Hamming vectors are shown in table 1.
在这里插入图片描述

Assume that we have three hash tables, g1, g2, g3, and each table has two hash functions. For g1, we select the second and fourth unit. For g2, we select the first and sixth unit. For g3, we select the third and eighth unit. Then map the points to the hash tables. For example, map point A to table1. The second unit of v ( A ) {v(A)} is 0 and fourth unit is 0. So A is hashed into the 00 bucket. The hashing result is shown in figure 1.

Assume that we want to search the point Q(4,4). Firstly, transfer Q into Hamming space, v(Q) = 11111111. Secondly, map Q to three hash tables, which is the 11 buckets. We only need to compare Q with the points in the this bucket, which are E, F in table 1 and C, D, E, F in table 2. The amount of points needed to be compare is relatively large in this case. However, if we further increase the number of hash tables and hash functions, the amount points needed to be compare will be only a very small part of all the data.

在这里插入图片描述

LSH Scheme

  • Min Hash
    Min Hash is a hash scheme originally proposed for detecting duplicate web pages and eliminate them from search results. The goal of MinHash is to estimate the Jaccard similarity quickly, without explicitly computing the intersection and union.
    In Min Hash scheme, the hashing function should satisfy that, for two sets A, B,
    P r [ h ( A ) = h ( B ) ] = J ( A , B ) {Pr[h(A)=h(B)]=J(A,B)}
    An example hashing function is that let f : u 2 64 { f: u \rightarrow 2^{64}} , h ( A ) = m i n a A f ( a ) {h(A) = min_{a \in A} f(a)} . Then Pr[min f(a) = min f(b)] = J(A,B).

  • Sim Hash
    Sim Hash scheme is proposed for the same purpose of Min Hash, i.e. estimating the similarity of two sets. The idea of Sim Hash is to project the documents into vector space and measure the angular distance. The hashing function contains two steps. The first step is projection. Select a random vector r {r} and compute the dot product between u and r, where u is vector representation of the document. The second step is rounding, which returns h r ( u ) = s i g n ( < u , r > ) {h_r(u) = sign(<u,r>)} . Then
    P r [ h ( u ) h ( v ) ] = d ( u , v ) / π { Pr[h(u) \neq h(v)] = d(u,v)/\pi}
    where d ( u , v ) {d(u,v)} is the angular distance of vector u and v.

  • Other Schemes
    SRS
    QALSH

Reference

  • 局部敏感哈希深度解析(locality-sensetive hashing,
    LSH)
  • Locality Sensitive Hashing归总
  • A. Z. Broder, On the resemblance and containment of documents, 1997
  • M. S. Charikar, Similarity estimation techniques from rounding algorithms, 2002
  • Y. Sun, W. Wang, J. Qin, Y. Zhang, and X. Lin, SRS: solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index, 2014
  • Q. Huang, J. Feng, Y. Zhang, Q. Fang, and W. Ng, Query-aware locality-sensitive hashing for approximate nearest neighbor search, 2015

猜你喜欢

转载自blog.csdn.net/thormas1996/article/details/85246910