Cosine Similarity in MapReduce (Hadoop)

Look, here is the thing. I scoured the Internet looking for a solution to this problem and every blog/article/comment that I read assumes I already know how to do it. Computing pairwise document similarity in MapReduce is actually a research paper published by the famous Jeremy Lin and his band of programmers. Even the paper assumes I already know how to compute cosine similarity in MapReduce. 

Take the dot product of the document vectors divided by the root of the squared distance.

Oh? You want to calculate similarity between documents in Hadoop? Very simple, step one—> calculate cosine similarity- GODDAM I DON’T KNOW how to do that! Mind explaining? No? Fine, that’s why I have a professor to talk to.

Sure enough after a chat over lunch, I finally understand how to compute cosine similarity MapReduce style.

And I am putting up this post (even though this is not a coding blog) some unfortunate soul will not be as lost as I was. So hey, you’re welcome.

Here is how you do it and I promise you it’s the easiest thing in the world.

Things you will need: An inverted index ( <word         doc1      doc2        doc5>….)

Now the inverted index is key component of our MapReduce program. There will be no code here because you see, the challenge that I was facing was getting the vectors of the 2 documents (that I was comparing) into the reducer. 

Suppose this is what you feed into the mapper:

 Word     occurrence@DocID.......
 cat      6@Doc1    3@Doc2     4@Doc3... etc
 Hot      9@Doc1    2@Doc3    10@Doc5

and so on.

Lets say you are looking at the pairwise similarity between <Doc1, Doc3>. How do you get their respective vectors (in this case, just 2 words) into the reducer? It’s very very simple.

This is what your Mapper should emit:

    Key:<Document_one, Document_two> Value:<occurrence,occurrence>

So for our two vector docs above, the emission for <doc1 to doc3> for the first word/line “cat” would be:

 Key:<doc1, doc3> Value:<6, 3>

This way, you are getting the vectors associated with each word and in the reducer you will have a list of these values, i.e. the vectors for both documents on which you can compute cosine similarity in the reducer. Of course you can replace “occurrence” with TF or TF-IDF vectors or anything else. The key part here is getting the KEY-VALUE right.

In the reducer, you will have something like this for all <doc1,doc3>:

Key:<doc1, doc3>
Value: [<6,4>,

Thus, the cosine similarity can be computed by just getting the doc product of doc1 (6,9) and doc3 (4,2) which is calculated like so —>

(6*4)+(9*2).....etc for larger vectors

And the denominator is calculated like so —->

 rad(6^2+9^2)+rad(4^2+2^2).... etc for larger vectors

And you can do all of this in the reducer because you have the vectors of both documents. Of course after your value is computed, you can output them from the reducer any way you like.

Hope this helps and please do correct my mistakes if there are any in my method.


2 thoughts on “Cosine Similarity in MapReduce (Hadoop)

  1. This reducer is not correct as written. As presented above, the denominator only considers words that are common to both documents. Instead the denominator should be the product of the l2-norms of the complete term vectors for each document. This means it must include all the words that are present in each document not just the shared terms. Note, I would also suggest removing stop words to avoid inflating the similarities between documents

Tell me what you think!

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s