MinHash is a technique from Locality Sensitive Hashing allowing to find similarities among 2 sets. This is a buzzword frequently met in Data Mining and Data Science fields of CS. What surprising is that this method was invented in 1997 and used in AltaVista web-search engine back in the 90s to find similarities among web-documents and it also can be used to:

- Find duplicates
- Near duplicate image detection
- Near neighbor search

Basically the algorithm can be applied to anything that can be presented by numbers.

Let’s start with a bit of math from theory of probability and statistics.

Define a formula of two sets *A* and *B:*

This is so-called a Jaccard coefficient.

Where: J ∈ [0..1]

j = 0 – if *A* ∩ *B = 0, that is 2 sets are disjoint meaning there are no similarities*

*j = 1 – if A ∩ B = A = B, that is 2 sets are identical.*

A, B are more similar when their **Jaccard coefficient** is closer to 1.

This simple formula is cumbersome if the sets are quite large, e.g. 2 web-documents of more than 1MB in size. Ouch, that’s too much. 1MB of text-data is **1,048,576 characters **provided that 1 ASCII char = 8 bits (of course for unicode charset it is greater).

Now that we understand a bit of theory let’s try to apply hashing to Jaccard coefficient. Everywhere I hear hashing it always leads to randomized algorithms.

Ok, let’s move on. The main idea is that similar objects hash to the same bucket. This follows from the fact that **probability of collision higher for similar objects**.

Here we give an example for 2 sets A and B but the algorithm can be applied to any number of sets.

1. Essentially, we need to construct a set of independent hash functions <h1,h2,h3,…hk> **randomly**. *k* = O(1/ε^{2}), ε > 0 such that the expected error of the estimate is at most ε. For example, 400 hashes would be required to estimate *J*(*A*,*B*) with an expected error less than or equal to .05. So, k can be varied to increase/decrease the likelihood of false negatives.

2. Next we initialize for each set A and B the value to infinity.

3. For each element s in both sets A and B we compute the element’s hash:

such as: If then .

Eventually we should have for both sets A and B.

4. If 2 sets A and B are similar then the probability P( A = B) = |*A* ∩ *B*| / |A U B|- is high and **it** **is the actual Jaccard coefficient!**

5. We calculated statistics to estimate how similar are these 2 sets. General formula is: Similarity = identical s / k

In real world this requires considering more thoroughly different parameters, hash-functions etc. However, to demonstrate the algorithm I wrote a simple java code:

import java.util.HashMap; import java.util.HashSet; import java.util.Map; import java.util.Random; import java.util.Set; public class LSHMinHash<T> { private final int hashes[]; private final int numOfHashes; private final int numOfSets; private final Set<T> setA; private final Set<T> setB; private final Map<T, boolean[]> matrix; private final int[][] minHashes; public LSHMinHash(double e, Set<T> setA, Set<T> setB){ this.numOfHashes = (int)(1 / (e * e)); this.numOfSets = 2; this.setA = setA; this.setB = setB; matrix = buildSetMatrix(setA, setB); minHashes = initMinHashes(numOfSets, numOfHashes); hashes = computeHashes(numOfHashes); } private Map<T,boolean[]> buildSetMatrix(Set<T> setA, Set<T> setB) { Map<T,boolean[]> matrix = new HashMap<T,boolean[]>(); for(T element : setA){ matrix.put(element, new boolean[] { true, false } ); } for(T element : setB){ if(matrix.containsKey(element)){ matrix.put(element, new boolean[] { true, true } ); }else if(!matrix.containsKey(element)){ matrix.put(element, new boolean[] { false, true } ); } } return matrix; } private int[][] initMinHashes(int numOfSets, int numOfHashes) { int[][] minHashes = new int[numOfSets][numOfHashes]; for (int i = 0; i < numOfSets; i++) { for (int j = 0; j < numOfHashes; j++) { minHashes[i][j] = Integer.MAX_VALUE; } } return minHashes; } private int[] computeHashes(int numOfHashes) { int[] hashes = new int[numOfHashes]; Random r = new Random(31); for (int i = 0; i < numOfHashes; i++){ int a = (int)r.nextInt(); int b = (int)r.nextInt(); int c = (int)r.nextInt(); hashes[i] = (int)((a * (a * b * c >> 4) + a * a * b * c + c) & 0xFFFFFFFF); } return hashes; } private void computeMinHashForSet(Set<T> set, int setIndex){ int hashIndex = 0; for(T element : matrix.keySet()) { for (int i = 0; i < numOfHashes; i++) { if(set.contains(element)) { int hashValue = hashes[hashIndex]; if (hashValue < minHashes[setIndex][hashIndex]) { minHashes[setIndex][hashIndex] = hashValue; } } } hashIndex++; } } private double computeMinHash(int[][] minHashes, int numOfHashes) { int identicalMinHashes = 0; for (int i = 0; i < numOfHashes; i++){ if (minHashes[0][i] == minHashes[1][i]) { identicalMinHashes++; } } return (1.0 * identicalMinHashes) / numOfHashes; } public double findSimilarities() { computeMinHashForSet(setA, 0); computeMinHashForSet(setB, 1); return computeMinHash(minHashes, numOfHashes); } public static void main(String[] args){ Set<String> setA = new HashSet<String>(); setA.add("THIS"); setA.add("IS "); setA.add("THE"); setA.add("CASE"); Set<String> setB = new HashSet<String>(); setB.add("THAT"); setB.add("IS "); setB.add("THE"); setB.add("CASE"); double errorFactor = 0.001; LSHMinHash<String> minHash = new LSHMinHash<String>(errorFactor, setA, setB); System.out.println(minHash.findSimilarities()); } }