Locality Sensitive Hashing: An Short Intro Based on Minhash

Hashing maps objects into different bins. Unlike conventional hashing functions which minimize collision probability, locality sensitive hashing functions maximize it for similar objects. In other words, for a given distance measure, similar items are more likely to be mapped to the same bin with LSH. This way, we can find neighbors for a certain item in a smaller area of neighborhood (hence the name “locality sensitive” hashing), without comparing it against all items. This dramatically speeds up nearest neighbor search.

Here we use “objects” in a broad sense to represent any data (e.g., sets, arrays, documents, etc.). One example is a text document represented by a bag of words, another example is a shopping basket containing a set of grocery items.

There are different LSH functions associated with different distance measures. LSH functions are used to group similar items into the same buckets. Before we can do that, we first need to define “similar” in terms of the distance measure under consideration. That’s why LSH functions are always associated with a distance measure.

Loosely speaking, for a given distance measure, we choose a LSH function such that similar items are more likely to group together after hashing.

Min hashing functions for Jaccard distance

The min hashing functions convert a set into a signature. The length of the signature is equal to the number of permutations on the rows.

The probability that the minhash function for a random permutation of rows produce the same value for two sets are equal to the Jaccard similarity of those two sets.

Another way to interpret this is that: the minhashing output is actually a probability estimation that two sets are identical.

With the signatures of two sets, we can simply calculate the fraction of elements which are equal at the same location, this fraction serves as a good approximation to the Jaccard Similarity of the two sets.

Using the datasketch package:

from datasketch import MinHash

data1 = ['minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for',
'estimating', 'the', 'similarity', 'between', 'datasets']
data2 = ['minhash', 'is', 'a', 'probability', 'data', 'structure', 'for',
'estimating', 'the', 'similarity', 'between', 'documents']

m1, m2 = MinHash(), MinHash()
for d in data1:
m1.update(d.encode('utf8'))
for d in data2:
m2.update(d.encode('utf8'))
print("Estimated Jaccard for data1 and data2 is", m1.jaccard(m2))
# or compute using the digest of minhashing
dg1 = m1.digest()
dg2 = m2.digest()
# the nominator is number of places when both digests are equal
# the denominator can be for dg1 or dg2, they are same length
j = np.sum(dg1==dg2)/len(dg2)
print("another estimated from digest", j)
s1 = set(data1)
s2 = set(data2)
actual_jaccard = float(len(s1.intersection(s2)))/float(len(s1.union(s2)))
print("Actual Jaccard for data1 and data2 is", actual_jaccard)
# the similarity is 0.7109375

We can verify that the results produced by MinHashing is the same as a rudimentary Jaccard similarity.

Alternatively, one can also verify the results by accessing the signature by :

num_of_common = np.argwhere(m1.digest()==m2.digest()).flatten()
len(num_of_common)/128 # 128 is the default number of perms

In summary, we can use the minhashing functions associated with Jaccard distance to group those similar items (similar in terms of the distance metric under consideration, in here it is Jaccard).

Bit sampling for Hamming distance

We can use bit sampling as a LSH function accompanying Hamming distance.

See Mining of massive dataset Section 3.7.

Or this nice post: https://randorithms.com/2019/09/19/Visual-LSH.html

Random projection for cosine distance

Similar to minhash for Jaccard similarity, we can use random projections to approximate the cosine similarity.

By randomly choosing hyperplanes defined by a unit vector, we can use monte carlo methods to estimate the similarity.

We randomly chose a large number of random planes defined by vector v, the probability that the dot products of x *v and y*v has different signs are equal to the angle theta.

As we can see, the closer (similar) two vectors are, the more often a random projection will lead to the same sign of both dot products.

This blog provides a good explanation on this.

One can also see Mining of massive dataset Section 3.7.

LSH functions for Euclidean space

This topic is a little bit more complicated, see this nice post: https://randorithms.com/2019/09/19/Visual-LSH.html

Notes

  1. A visual and intuitive introduction: https://randorithms.com/2019/09/19/Visual-LSH.html
  2. On random projection for cosine distance: https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection
  3. LSH in the larger family of Nearest Neighbour Search problem: https://en.wikipedia.org/wiki/Nearest_neighbor_search
  4. Two popular Open Source implementations: facebook https://github.com/facebookresearch/faiss and spotify https://github.com/spotify/annoy
  5. A very good blog post on angular/cosine distance LSH: https://tostr.pl/blog/locality-sensitive-hashing-for-angular-distance-in-python/

--

--

Machine Learning & Software Engineer in Amsterdam, Holland

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store