On Google's PageRank


Anyone who runs a website knows that a successful site has to have a high Google PageRank. Unfortunately, though, Google updates their publicly available PRs only every a few months. Moreover, the PR is only given in discrete increments, so it is hard to know whether a site improved if the actual increase is less than a PR point on average. Given that I am curious about the PR of my site, I thought I should do something. And I did.

The PR is almost all about incoming links and their quality. Thus, an accurate page rank determination requires detailed knowledge of the page rank of each an every back-linked page. It would therefore be a formidable task to calculate the page rank of a high profile site such as cnn.com.

Statistically, however, if you have many incoming links, they tend to have the same rank distribution (many low quality links and fewer high quality links, and so forth). So in principle, we could try to estimate the page rank using just inbound links to a site. Moreover, since the PageRank is logarithmic in the inbound links, we could for example approximate it as:
$$ PR \approx f(N_{in}) \approx a \log_{10}(N_{in}) $$
This approximation is already not too bad, but one can do somewhat better, as we'll see below.

Figure 1 - Actual PageRank vs. inbound site links, for 120 random site. The horizontal scale is logarithmic. The nearly linear slope implies that the google PR is logarithmic as well. The linear fit gives a base 8 or so.
I took about 120 random website (all those in my bookmark's file) and plotted the actual PR as a function of log10(Nin). One can see from this graph that the slope varies. Hence, the PR is not exactly linear with log10(Nin). This could arise from several things. For example, Google's algorithm may not be exactly logarithmic, or the assumption that the quality of the links does not depend on average on the quality of the site is probably an over simplification, or perhaps other reasons.

The linear fit has a slope of 1.12 (which implies a PR unit increase every factor 7.8 increase in the inbound links). By comparing the fit to the actual PR data, one finds that the standard deviation in PR is 1.2 PR units, and also that about 82% of the sites have a predicted PR which is the correct one or +/-1 PR unit.

Figure 2 - Histogram of the Predicted PR minus the actual PR for 120 random sites. We see that in just over 50% of the cases, the predicted PR the same as the real, while in an extra 40%, it is within one PR unit of the real value.
To improve the fit, we can use a higher rank polynomial in log10(Nin) and also use other available data, such as the number of pages in a site, the back-links as seen by different search engines (which would be differently sensitive to different quality pages) and also the number of links within a site. I will save you from the ugly looking fitting formula (it has 8 different terms).

The standard deviation obtained with the improved fit is only 0.85, and 91% of the sites have a predicted PR which is the correct one or +/-1 PR unit. It is probably impossible to obtain a notably better fit with the data I use. To improve the PR prediction, one would in principle require more data, such as the actual PR of the back-linked pages.

Here is a calculator to estimate the PageRank of any site you wish.

Interesting Prediction

We do feel our site is underpredicted by Google PR. It is

www.LookInTheAttic.com

Your algorithm gives us a PR of 5, whereas we are stuck at a PR of 3 for nearly two years, although are traffic and sales have greatly increased.

Not sure why this is? It is somewhat strange!

Any insights?

JC

Quality! you need quality!

The current tool does not take into consideration the quality of the links, it just assumes that you have a typical distribution of links in terms of quality. However, in your case, you most likely have many low quality links, and less high quality links for an average site with a PR of 5, in fact, your site is in the bottom 10% or so of the sites in terms of quality of links.
If you had the same number of links, but their quality distribution was average, your PR would have been 5 and not 3.

Quality Links

That makes sense - is there any way you could assist us in this endeavor? It seems like you know a lot about the Google PR system. Please contact me if you are interested.

Thanks,

John

Not Really, cannot help.

There are a few things you can do with you site (e.g., make sure all the pages link to each other well), but you cannot avoid the number one necessity... good quality links. With this I cannot help you much since I don't deel with commercial sites (my experience is with science related ones only, like mine).
Good luck with you endeavor.
Nir

Temporarily down due to bad links

Hi!
What's the meaning of "Temporarily down due to bad links"? How did your algorithm get the information that my website (EapTips) has bad links and what are these bad links?

Thanks!
Saverio

Thanks for sharing

Wow, wonderful content. Thanks for sharing. Learned a lot

Powered by Drupal