MinHash algorithm or how to quickly find similarities among 2 documents

MinHash is a technique from Locality Sensitive Hashing allowing to find similarities among 2 sets. This is a  buzzword frequently met in Data Mining  and Data Science fields of CS. What surprising is that this method was invented in 1997 and used in AltaVista web-search engine back in the 90s to find similarities among web-documents and it also can be used to:

  • Find duplicates
  • Near duplicate image detection
  • Near neighbor search

Basically the algorithm can be applied to anything that can be presented by numbers.

Let’s start with a bit of math from theory of probability and statistics.

Define a formula of two sets A and B:

 J(A,B) = {{|A \cap B|}\over{|A \cup B|}}.

This is so-called  a Jaccard coefficient.

Where: J ∈ [0..1]

j = 0 – if A ∩ B = 0, that is 2 sets are disjoint meaning there are no similarities

j = 1 – if A ∩ B = A = B, that is 2 sets are identical.

A, B are more similar when their Jaccard coefficient is closer to 1.

This simple formula is cumbersome if the sets are quite large, e.g. 2 web-documents of more than 1MB in size. Ouch, that’s too much. 1MB of text-data is 1,048,576 characters provided that 1 ASCII char = 8 bits (of course for unicode charset it is greater).

Now that we understand a bit of theory let’s try to apply hashing to Jaccard coefficient. Everywhere I hear hashing it always leads to randomized algorithms.

Ok, let’s move on. The main idea is that similar objects hash to the same bucket. This follows from the fact that probability of collision higher for similar objects.

Here we give an example for 2 sets A and B but the algorithm can be applied to any number of sets.

1. Essentially, we need to construct a set of independent hash functions <h1,h2,h3,…hk> randomly.  k = O(1/ε2), ε > 0 such that the expected error of the estimate is at most ε. For example, 400 hashes would be required to estimate J(A,B) with an expected error less than or equal to .05. So, k can be varied to increase/decrease the likelihood of false negatives.

2. Next we initialize for each set A and B the  value to infinity.

3. For each element s in both sets A and B we compute the element’s hash:

 such as: If  then .

Eventually we should have  for both sets A and B.

4. If 2 sets A and B are similar then the probability P(  A =  B) = |A ∩ B| / |A U B|- is high and it is the actual Jaccard coefficient!

5. We calculated   statistics to estimate how similar are these 2 sets. General formula is: Similarity = identical s / k

In real world this requires considering more thoroughly different parameters, hash-functions etc. However, to demonstrate the algorithm I wrote a simple java code:

import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Random;
import java.util.Set;

public class LSHMinHash<T> {

    private final int hashes[];
    private final int numOfHashes;
    private final int numOfSets;
    private final Set<T> setA;
    private final Set<T> setB;
    private final Map<T, boolean[]> matrix;
    private final int[][] minHashes;

    public LSHMinHash(double e, Set<T> setA, Set<T> setB){
        this.numOfHashes = (int)(1 / (e * e));
        this.numOfSets = 2;
        this.setA = setA;
        this.setB = setB;
        matrix = buildSetMatrix(setA, setB);
        minHashes = initMinHashes(numOfSets, numOfHashes);
        hashes = computeHashes(numOfHashes);
    }

    private Map<T,boolean[]> buildSetMatrix(Set<T> setA, Set<T> setB) {

        Map<T,boolean[]> matrix = new HashMap<T,boolean[]>();

        for(T element : setA){
            matrix.put(element, new boolean[] { true, false } );
        }

        for(T element : setB){
            if(matrix.containsKey(element)){
                matrix.put(element, new boolean[] { true, true } );
            }else if(!matrix.containsKey(element)){
                matrix.put(element, new boolean[] { false, true } );
            }
        }

        return matrix;
    }

    private int[][] initMinHashes(int numOfSets, int numOfHashes) {
        int[][] minHashes = new int[numOfSets][numOfHashes];

        for (int i = 0; i < numOfSets; i++) {
            for (int j = 0; j < numOfHashes; j++) {
                minHashes[i][j] = Integer.MAX_VALUE;
            }
        }
        return minHashes;
    }

    private int[] computeHashes(int numOfHashes) {
        int[] hashes = new int[numOfHashes];
        Random r = new Random(31);

        for (int i = 0; i < numOfHashes; i++){
            int a = (int)r.nextInt();
            int b = (int)r.nextInt();
            int c = (int)r.nextInt();
            hashes[i] = (int)((a * (a * b * c >> 4) + a * a * b * c + c) & 0xFFFFFFFF);
        }
        return hashes;
    }

    private void computeMinHashForSet(Set<T> set, int setIndex){
        int hashIndex = 0;

        for(T element : matrix.keySet()) {
            for (int i = 0; i < numOfHashes; i++) {
                if(set.contains(element)) {
                    int hashValue = hashes[hashIndex];
                    if (hashValue < minHashes[setIndex][hashIndex]) {
                        minHashes[setIndex][hashIndex] = hashValue;
                    }
                }
            }
            hashIndex++;
        }
    }

    private double computeMinHash(int[][] minHashes, int numOfHashes) {
        int identicalMinHashes = 0;
        for (int i = 0; i < numOfHashes; i++){
            if (minHashes[0][i] == minHashes[1][i]) {
                identicalMinHashes++;
            }
        }
        return (1.0 * identicalMinHashes) / numOfHashes;
    }

    public double findSimilarities() {
        computeMinHashForSet(setA, 0);
        computeMinHashForSet(setB, 1);
        return computeMinHash(minHashes, numOfHashes);
    }

    public static void main(String[] args){
        Set<String> setA = new HashSet<String>();
        setA.add("THIS");
        setA.add("IS ");
        setA.add("THE");
        setA.add("CASE");

        Set<String> setB = new HashSet<String>();
        setB.add("THAT");
        setB.add("IS ");
        setB.add("THE");
        setB.add("CASE");

        double errorFactor = 0.001;

        LSHMinHash<String> minHash = new LSHMinHash<String>(errorFactor, setA, setB);
        System.out.println(minHash.findSimilarities());
    }
}