Introduction
When you write an email to some friends in Gmail, more contacts are suggested. I've been wondering for some time how that worked, because I want something similar on epinardscaramel, for the tags to be attached to posts.
I was thinking of doing something along the lines of :
//create groups of tags. Give them each a score.
//increase their score each time the group is used.
//when a user picks tags A and B, find the groups which contains A and B
//suggest tags, different from A and B, that belong to the group with the best score.
But by looking around, someone pointed me in the direction of research.google.com, where one can find a paper detailing how the gmail team has implemented this algorithm : Suggesting Friends Using the Implicit Social Graph.
The algorithm
The paper start by suggesting the use of an hypergraph, where each node is an email address. Each time a mail, sent or received, is considered, google creates an "hyperedge" of the graph, which is simply put a set of email addresses. So far so good.
Things get complicated when we get to the formula for the score, IR, of each group :
Isn't math grand ? Don't leave yet, I will explain, this formula is rather smart. The first thing we notice is that we have to do the sum of two identical parts, one concerning the incoming emails Iin, and the other the outgoing emails Iout. That part of the formula is :
This part of the computation is concerned with the messages' dates, where tnow is today's date and t(i) the email's date. The difference is divided by λ, a constant that denotes the "half-life" of the score. Indeed, if the mail has just been created, "tnow - t(i)" will be zero, same as zero divided by λ, and the message's score will be (½) to the power of 0, which is 1.
Once λ seconds have passed between the message creation and now, "tnow - t(i)" will be equal to λ, λ divided by itself is 1, and ½ to the power of one is ½. When 2λ seconds have passed, the score will then be (½)², or ¼. Therefore, we see that the score is divided by two each time that the duration λ passes.
The formula computes the score of each mail, then sums them (symbol Σ). Finally, to show that outgoing mails are more important than incoming mails, we multiply the sum of the outgoing mails' scores by ωout, another constant.
Taking into account the time that has passed since the message's creation is really smart; Gmail rightly considers that a recent group of friends is more relevant that an older one.
Suggesting friends
Now that we have a score for each group of emails, we are capable of suggesting addresses that would go nicely with the ones already inputed.
In order to do so, it's a little more complex than finding the group with the best score and suggesting all the members of this group : We need to compute a score for each email, considering all the groups and their scores.
The paper suggests 5 different ways of calculating each address' score, I'll let you read it if you want to go further.
What about épinards & caramel ?
For this site, I want to implement the algorithm for groups of tags : the formula for IR will be simpler in my case, since I don't have the difference between outgoing and incoming. My hypergraph will have each tag as a node, and the groups' scores will take into account the publication date of each blog post.