Look, here is the thing. I scoured the Internet looking for a solution to this problem and every blog/article/comment that I read assumes I already know how to do it. Computing pairwise document similarity in MapReduce is actually a research paper published by the famous Jeremy Lin and his band of programmers. Even the paper assumes I already know how to compute cosine similarity in MapReduce.
Oh? You want to calculate similarity between documents in Hadoop? Very simple, step one—> calculate cosine similarity- GODDAM I DON’T KNOW how to do that! Mind explaining? No? Fine, that’s why I have a professor to talk to.
Sure enough after a chat over lunch, I finally understand how to compute cosine similarity MapReduce style.
And I am putting up this post (even though this is not a coding blog) some unfortunate soul will not be as lost as I was. So hey, you’re welcome.
Here is how you do it and I promise you it’s the easiest thing in the world.
Things you will need: An inverted index ( <word doc1 doc2 doc5>….)
Now the inverted index is key component of our MapReduce program. There will be no code here because you see, the challenge that I was facing was getting the vectors of the 2 documents (that I was comparing) into the reducer.
Suppose this is what you feed into the mapper:
Word occurrence@DocID....... cat 6@Doc1 3@Doc2 4@Doc3... etc Hot 9@Doc1 2@Doc3 10@Doc5
and so on.
Lets say you are looking at the pairwise similarity between <Doc1, Doc3>. How do you get their respective vectors (in this case, just 2 words) into the reducer? It’s very very simple.
This is what your Mapper should emit:
Key:<Document_one, Document_two> Value:<occurrence,occurrence>
So for our two vector docs above, the emission for <doc1 to doc3> for the first word/line “cat” would be:
Key:<doc1, doc3> Value:<6, 3>
This way, you are getting the vectors associated with each word and in the reducer you will have a list of these values, i.e. the vectors for both documents on which you can compute cosine similarity in the reducer. Of course you can replace “occurrence” with TF or TF-IDF vectors or anything else. The key part here is getting the KEY-VALUE right.
In the reducer, you will have something like this for all <doc1,doc3>:
Key:<doc1, doc3> Value: [<6,4>, <9,2>]
Thus, the cosine similarity can be computed by just getting the doc product of doc1 (6,9) and doc3 (4,2) which is calculated like so —>
(6*4)+(9*2).....etc for larger vectors
And the denominator is calculated like so —->
rad(6^2+9^2)+rad(4^2+2^2).... etc for larger vectors
And you can do all of this in the reducer because you have the vectors of both documents. Of course after your value is computed, you can output them from the reducer any way you like.
Hope this helps and please do correct my mistakes if there are any in my method.