Guide to Link Building Pt.2
25 October 2006Overview of Link Based Ranking Algorithms
In this section we’ll give an overview of link-based ranking algorithms in order to provide some perspective for our future link development efforts.
PageRank
Prior to Google’s launch in late 1998, search engines relied on page content to rank webpages. Since this content was under the direct control of webmasters, website owners were able to easily influence rankings by modifying the content appearing on their sites. Google changed this landscape with the launch of their ‘PageRank’ ranking algorithm.
PageRank is unique in that it focuses on the one factor that is almost entirely outside the control of webmasters, the number of times that outside sites link to a given webpage. In September of 2001 Google was granted a patent for ‘PageRank’. The published patent document #6,285,999 - Method for node ranking in a linked database provides detailed insight into the original Google algorithm.
PR (A) = 1-d/N + d(PR(B)/L(B) + PR(C)/L(C)…)
PR (x) = PageRank of given page
L = number of outbound links on given page
d = damping factor
N = number of documents in index (set)
As we can see from the above equation, PageRank is calculated by taking the PageRank of the each of the inlinking pages, dividing this PageRank between the number of outbound links on each page, adding the fractional PageRank values together, and adjusting the total with a damping factor designed to add a small cost to every PageRank transaction (preventing unlimited, lossless PR cycling between pages).
The PageRank algorithm revolutionized search and launched Google to the top of the search engine world, however the algorithm as originally published contained a number of vulnerabilities that over time have eroded its usefulness. The original PageRank equation was essentially an iterative link tallying system that counted the number of links to a given page and granted additional priority to links that originated from pages that were themselves heavily linked.
PageRank does not evaluate the topic or authoritative status of inlinking pages, and this omission opens the door to a number of artificial link boosting strategies that influenced Google rankings from 1999-2001. During this time strategies such as free-for-all link pages, link exchanges, and the creation of artificial ‘link farms’ made up of hundreds of interlinked sites controlled by one webmaster, threatened the validity of Google’s results and led to the development of next generation PageRank filters’ and alternate link-based ranking algorithms. Today PageRank remains a central part of the Google ranking algorithm, but is now heavily modified in order to combat attempts at manipulation.
Hilltop
Hilltop was created by Krishna Bharath and George A. Mihaila of the University of Toronto in 1999 with the goal of building a link-based ranking method that was resistant to outside manipulation. As discussed in their paper Hilltop: A Search Engine based on Expert Documents, Hilltop differs from PageRank particularly in that it’s a topic-specific link algorithm. At the core of Hilltop is an initial set of ‘expert documents’, defined as pages on a given topic that link out to a large number of non-affiliated pages on related topics. Non-affiliated status in this case is determined by analysis of IP C-class and links between sites within the same C-class are discounted by Hilltop. Hilltop ranking results for a given webpage are based on both the number of these inlinks from expert documents as well as the anchor text of the links.
Hilltop improves on the original PageRank algorithm in two important ways. First, it incorporates the authoritative status of the linking page. This efficiently targets the effectiveness of link farms which incorporate large numbers of low authority spam sites. Second, it adds a valuable topic-specific aspect to the algorithm that rewards on-topic inks and discounts untargeted links. This effectively reduces the artificial benefit previously granted by free-for-all link pages and link exchange programs. While the Hilltop algorithm can still be gamed, successful strategies typically require creation of artificial expert documents, a process which is prohibitively costly for most webmasters.
Google was granted a patent on the Hilltop concept in February 2003 #6,526,440 - Ranking search results by reranking the results based on local inter-connectivity and is believed to be incorporating the Hilltop system into its current ranking algorithm in some form. Krishna Bharath is currently employed by Google.
TrustRank
TrustRank was developed by Zoltán Gyöngyi and Hector Garcia-Molina of Stanford University and Jan Pedersen of Yahoo as a way to combat artificial linking schemes. In August 2006, Google was granted a patent on the concept with #7096214 - System and method for supporting editorial opinion in the ranking of search results. TrustRank is similar to Hilltop in that it’s based around an initial set of expert documents, however instead of using an automated evaluation of authority (Hilltop), TrustRank instead relies on human experts to define a small ‘seed set’ of expert documents. This seed set is iteratively crawled to obtain primary linked pages, secondary linked pages, etc. The initial seed set is given a set number of points and by analyzing how these points filter down to the linked pages, TrustRank is able to calculate the importance of a given page.
TrustRank is interesting in that it can be used as both a positive and negative ranking factor. Good sites tend to link to good sites, bad sites tend to link to bad sites. By defining positive and negative seed sets, TrustRank can identify both sites that are likely to be valuable and those that are likely to be spam. This ability to define negative seed sets (bad seeds) and penalize sites that are linked from them is likely the source of Google’s recommendation in its Webmaster Guidelines which advises site owners to “avoid links to web spammers or bad neighborhoods on the web, as your own ranking may be affected adversely by those links.
TrustRank raises the bar for link-based ranking algorithms since it requires webmasters to be very selective about where they obtain inlinks. According to TrustRank, massive untargeted link development campaigns may actually hurt a site’s rankings by increasing the odds of connecting to documents in a negative seed set. TrustRank also exaggerates the importance of links from positive authoritative sources, making inlinks from reputable on-topic sites such as media, trade organizations, and educational institutions significantly more valuable.
TrustRank functions as an excellent adjunct to PageRank and is thought to be incorporated in some form into the current Google ranking algorithm.
LocalRank
LocalRank is an addition to PageRank that was developed by Krishna Bharat at Google and filed as a continuation of the ‘Hilltop’ patent on January 30, 2003 #6725259 - Ranking search results by reranking the results based on local inter-connectivity. LocalRank is similar to TrustRank in that it relies on an initial set of topic-specific documents. LocalRank differs from TrustRank in that the selection of this initial set is fully automated and doesn’t require human intervention. LocalRank essentially takes a set of the top 1000 pages returned by a user’s search query and then reorders them according to how many times a given page is linked to from pages within this set.
LocalRank further raises the bar for link development by requiring highly ranked pages to be well-linked from pages on a specific query as opposed to just pages in an expert set or on a given topic. Webmasters should devote attention to obtaining links from sites that already rank well for their desired phrases in order to effectively target TrustRank.
Historical Data
On December 23, 2003 Google filed patent #20050071741 - Information Retrieval Based on Historical Data which while not specifically related to link-based ranking, has significantly changed the way in which links are attributed in the Google ranking algorithms. The document outlines a number of novel strategies for using collected historical data on websites to improve ranking accuracy and reduce spam. Historical data defines several items of interest including domain registration dates, document change frequency over time, inlink gain/loss over time, age of inlinks, overall link ‘freshness’, and link churn rate.
These criteria when combined with PageRank/Hilltop/TrustRank/LocalRank are extremely effective at eliminating most remaining link manipulation techniques since the combined algorithm not only reviews links for topical connections but also analyzes the behavior of these links over time. Short-term link development campaigns are revealed in historical data as a brief spike in new indexed links and are clearly indicative of non-natural link development. The new algorithm also allows Google to gradually mature the influence of new links over 6-12 month intervals, significantly increasing the cost and time-to-benefit of paid text link ad campaigns.
As with any algorithm, there are still opportunities here to maximize results. However, the historical data algo more than any other has forced webmasters to consider the long-term implications when planning link development. Best results are now obtained by executing link development slowly and consistently over the life of the site.


