PageRank sculpting

by gowtham 2010-02-05 19:24:29

People think about PageRank in lots of different ways. People have compared PageRank to a “random surfer” model in which PageRank is the probability that a random surfer clicking on links lands on a page. Other people think of the web as an link matrix in which the value at position (i,j) indicates the presence of links from page i to page j. In that case, PageRank corresponds to the principal eigenvector of that normalized link matrix.

Disclaimer: Even when I joined the company in 2000, Google was doing more sophisticated link computation than you would observe from the classic PageRank papers. If you believe that Google stopped innovating in link analysis, that’s a flawed assumption. Although we still refer to it as PageRank, Google’s ability to compute reputation based on links has advanced considerably over the years. I’ll do the rest of my blog post in the framework of “classic PageRank” but bear in mind that it’s not a perfect analogy.

Probably the most popular way to envision PageRank is as a flow that happens between documents across outlinks. In a recent talk at WordCamp I showed an image from one of the original PageRank papers:

Flow of PageRank

In the image above, the lower-left document has “nine points of PageRank” and three outgoing links. The resulting PageRank flow along each outgoing link is consequently nine divided by three = three points of PageRank.

That simplistic model doesn’t work perfectly, however. Imagine if there were a loop:

A closed loop of PageRank flow

No PageRank would ever escape from the loop, and as incoming PageRank continued to flow into the loop, eventually the PageRank in that loop would reach infinity. Infinite PageRank isn’t that helpful so Larry and Sergey introduced a decay factor–you could think of it as 10-15% of the PageRank on any given page disappearing before the PageRank flows along the outlinks. In the random surfer model, that decay factor is as if the random surfer got bored and decided to head for a completely different page. You can do some neat things with that reset vector, such as personalization, but that’s outside the scope of our discussion.

Now let’s talk about the rel=nofollow attribute. Nofollow is method (introduced in 2005 and supported by multiple search engines) to annotate a link to tell search engines “I can’t or don’t want to vouch for this link.” In Google, nofollow links don’t pass PageRank and don’t pass anchortext [*].

So what happens when you have a page with “ten PageRank points” and ten outgoing links, and five of those links are nofollowed? Let’s leave aside the decay factor to focus on the core part of the question. Originally, the five links without nofollow would have flowed two points of PageRank each (in essence, the nofollowed links didn’t count toward the denominator when dividing PageRank by the outdegree of the page). More than a year ago, Google changed how the PageRank flows so that the five links without nofollow would flow one point of PageRank each

607
0
0