Look, here is the thing. I scoured the Internet looking for a solution to this problem and every blog/article/comment that I read assumes I already know how to do it. Computing pairwise document similarity in MapReduce is actually a research paper published by the famous Jeremy Lin and his band of programmers. Even **the paper **assumes I already know how to compute **cosine similarity in MapReduce. **

Oh? You want to calculate similarity between documents in Hadoop? Very simple, step one—> calculate cosine similarity- GODDAM I DON’T **KNOW how to do that! Mind explaining? **No? Fine, that’s why I have a professor to talk to.

Sure enough after a chat over lunch, I finally understand how to compute cosine similarity MapReduce style.

**And I am putting up this post (even though this is not a coding blog) some unfortunate soul will not be as lost as I was. **So hey, you’re welcome.

Here is how you do it and I promise you it’s the easiest thing in the world.

**Things you will need: **An inverted index ( <word doc1 doc2 doc5>….)

Now the inverted index is key component of our MapReduce program. There will be no code here because you see, the challenge that I was facing **was getting the vectors of the 2 documents (that I was comparing) into the reducer. **

Suppose this is what you feed into the mapper:

Word occurrence@DocID....... cat 6@Doc1 3@Doc2 4@Doc3... etc Hot 9@Doc1 2@Doc3 10@Doc5

and so on.

Lets say you are looking at the pairwise similarity between <Doc1, Doc3>. How do you get their respective vectors (in this case, just 2 words) into the reducer? It’s very very simple.

This is what your Mapper should emit:

Key:<Document_one, Document_two> Value:<occurrence,occurrence>

So for our two vector docs above, the emission for <doc1 to doc3> for the first word/line “cat” would be:

Key:<doc1, doc3> Value:<6, 3>

This way, you are getting **the vectors associated with each word** and in the reducer you will have a list of these values, **i.e. the vectors for both documents on which you can compute cosine similarity in the reducer. **Of course you can replace “occurrence” with TF or TF-IDF vectors or anything else. The key part here is getting the KEY-VALUE right.

In the reducer, you will have something like this for all <doc1,doc3>:

Key:<doc1, doc3> Value: [<6,4>, <9,2>]

Thus, the cosine similarity can be computed by just getting the doc product of doc1 (6,9) and doc3 (4,2) which is calculated like so —>

(6*4)+(9*2).....etc for larger vectors

And the denominator is calculated like so —->

rad(6^2+9^2)+rad(4^2+2^2).... etc for larger vectors

And you can do all of this in the reducer because you have the vectors of both documents. Of course after your value is computed, you can output them from the reducer any way you like.

Hope this helps and please do correct my mistakes if there are any in my method.

This reducer is not correct as written. As presented above, the denominator only considers words that are common to both documents. Instead the denominator should be the product of the l2-norms of the complete term vectors for each document. This means it must include all the words that are present in each document not just the shared terms. Note, I would also suggest removing stop words to avoid inflating the similarities between documents

thanks for pointing that out mark. I did eventually find that out (this was written a while ago) and I guess I never updated it.