Ever Wondered How Facebook Knows Which Post Is Similar To Which?

If you have two or more friends posting the same link to an article, Facebook will intelligently group these two articles into one single feed. Well, Facebook uses some sort of clustering algorithm and Jaccard is one of them.

What is Jaccard Coefficient or Jaccard Similarity?

"The Jaccard index, also known as the Jaccard similarity coefficient (originally coined coefficient de communauté by Paul Jaccard), is a statistic used for comparing the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets: J(A,B) = |A n B|/|A uB|." - Wikipedia

I whipped out a Python script to illustrate this point.

from __future__ import division

class Jaccard:

    def  __init__(self,args,args_ii):
        self.args = self.tokenize(args)
        self.args_ii = self.tokenize(args_ii)

    def intersect(self):
        return [i for i in self.args if i in self.args_ii]

    def union(self):
        return set(self.args + self.args_ii)

    def similarity(self):
        intersect = len(self.intersect())
        union = len(self.union())
        if intersect == None and union == None:
            return 1
        return intersect / union

    def distance(self):
        return 1 - self.similarity()

    def tokenize(self,item):
        item = item.lower()
        return item.split(" ")

jaccard = Jaccard("I am Sam","I am Sam, Sa is awesome. Sam is a genius")  
similarity = jaccard.similarity()  
print similarity

From the class above, I decided to break it down into tiny bits - functions/methods. We are using two sentences here for our test.

1. We tokenize each sentence. We basically made them into sets.

2. We calculate the intesection of these sentences.

3. We calculate the union of the sentences.

4. We calculate the similarity which is basically the division of the intersection of both sets by its union.

Simple right?

Jaccard only needs two variables, the union and intersection of both sets. Similarity ranges between 1 and 0. Anything outside that is probably wrong. 😊

Why should you care?

  • E-commerce companies uses Jaccard Similarity to recommend items and deals to customer based on the purchases made by other customers. (Disclaimer - I work for an eCommerce company)

  • This algorithm can be used to detect plagiarism.

  • Similar comments in an online forum can be grouped together.
  • Search engines use this same algorithm to group and sometimes discard similar search results.