
The PageRank ™ Algorithm
Topics: Pagerank Easy, pagerank hard...Topic sensitive
Pagerank
U.s Patent file # 6,285,999 ;
September 4, 2001
"Method for node ranking in a linked database"
Abstract : A method assigns importance ranks to nodes in a linked
database, such as any database of documents containing citations, the
world wide web or any other hypermedia database. The rank assigned to a
document is calculated from the ranks of documents citing it. In
addition, the rank of a document is calculated from a constant
representing the probability that a browser through the database will
randomly jump to the document. The method is particularly useful in
enhancing the performance of search engine results for hypermedia
databases, such as the world wide web, whose documents have a large
variation in quality.
Inventors:
Page; Lawrence (Stanford, CA)
Assignee:
The Board of Trustees of the Leland Stanford Junior
University (Stanford, CA)
Appl. No.: 004827 ; Filed: January 9, 1998
Read original Patent - click here
PAGERANK : THE EASY WAY TO
UNDERSTAND IT
Google is the brightest piece of modern intelligent crawlers. It does
more than crawling and indexing web pages. Google analyzes the
structure of the links between pages. It uses a technology of its own
called "Pagerank(tm) ; Pagerank(tm) uses the link structure of the web
as an organized road-map and organization device. The more pages that
link to a particular site, the higher that sites ranking (importance).
Pagerank(tm) also looks at the site; the ranking of the page is
proportional to the number of high ranked pages it links to. Google
therefore does not assign a ranking to a page simply on the number of
times that a keyword might appear in the text (or the code) of a
particular page, but also on the particular characteristic of the page
itself (the "on-page variables"). Google results are supposed to pages
that match all the terms of a user's query, either in the contents of
the page or in the quality inbound links to that page.
See a graphical sample below

As a result of this, you
MUST have at least one site or page linking to your website to remain
into the index.
The Topic Sensitive Pagerank™
check out original paper from Stanford
University here
(Pdf document, require Adobe Acrobat Reader)
Theorized by Taher H. Haveliwala from
Stanford University, it's basically a technology that allows grouping
many specific words and phrases into a certain "topic". Applied to
Search engines algorithms this COULD bring some more detailed and
relevant results.
PAGERANK™: THE HARD WAY.
The original PageRank algorithm was described by
Lawrence Page and Sergey Brin (well-known creators of Google) in
several publications.

It is given by
PR(A) = (1-d) + d
(PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
where
· PR(A)
is the PageRank of page A,
· PR(Ti) is the PageRank of pages Ti
which link to pageA,
· C(Ti) is the number of outbound links
on page Ti and
· d is a damping factor which can be
set between 0and1.
So, first of all, we see that PageRank does not rank
web sites as a whole, but is determined for each page individually.
Further, the PageRank of page A is recursively defined by the PageRanks
of those pages which link to page A.
The PageRank of pages Ti
which link to page A does not influence
the PageRank of page A
uniformly. Within the PageRank algorithm, the PageRank of a page T is always weighted by the number
of outbound links C(T) on
page T. This means that the more outbound
links a page T has, the less
will page A benefit from a link to it on
page T. The weighted PageRank
of pages Ti is then added up.
The outcome of this is that an additional inbound link for page A will always increase page A's PageRank.
Finally, the sum of the weighted PageRanks of all pages Ti is multiplied with a damping factor d which can be set between 0 and 1.
Thereby, the extend of PageRank benefit for a page by another page
linking to it is reduced.
The Random Surfer Model
In their publications, Lawrence Page and Sergey Brin give a very simple
intuitive justification for the PageRank algorithm. They consider
PageRank as a model of user behaviour, where a surfer clicks on links
at random with no regard towards content. The random surfer visits a
web page with a certain probability which derives from the page's
PageRank. The probability that the random surfer clicks on one link is
solely given by the number of links on that page. This is why one
page's PageRank is not completely passed on to a page it links to, but
is devided by the number of links on the page.
So, the probability for the random surfer reaching one
page is the sum of probabilities for the random surfer following links
to this page. Now, this probability is reduced by the damping factor d. The justification within the Random Surfer
Model, therefore, is that the surfer does not click on an infinite
number of links, but gets bored sometimes and jumps to another page at
random.
The probability for the random surfer not stopping to
click on links is given by the damping factor d, which is, depending on
the degree of probability therefore, set between 0 and 1. The higher d
is, the more likely will the random surfer keep clicking links. Since
the surfer jumps to another page at random after he stopped clicking
links, the probability therefore is implemented as a constant (1-d) into the algorithm.
Regardless of inbound links, the probability for the random surfer
jumping to a page is always (1-d),
so a page has always a minimum PageRank.

A Different Notation of the PageRank Algorithm
Lawrence Page and Sergey Brin have published two different versions of
their PageRank algorithm in different papers. In the second version of
the algorithm, the PageRank of page A is given as
PR(A) = (1-d) / N + d
(PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
where N is
the total number of all pages on the web. The second version of the
algorithm, indeed, does not differ fundamentally from the first one.
Regarding the Random Surfer Model, the second version's PageRank of a
page is the actual probability for a surfer reaching that page after
clicking on many links. The PageRanks then form a probability
distribution over web pages, so the sum of all pages' PageRanks will be
one. Contrary, in the first version of the algorithm the probability
for the random surfer reaching a page is weighted by the total number
of web pages. So, in this version PageRank is an expected value for the
random surfer visiting a page, when he restarts this procedure as often
as the web has pages. If the web had 100 pages and a page had a
PageRank value of 2, the random surfer would reach that page in an
average twice if he restarts 100 times.
As mentioned above, the two versions of the algorithm
do not differ fundamentally from each other. A PageRank which has been
calculated by using the second version of the algorithm has to be
multiplied by the total number of web pages to get the according
PageRank that would have been caculated by using the first version.
Even Page and Brin mixed up the two algorithm versions in their most
popular paper "The Anatomy of
a Large-Scale Hypertextual Web
Search Engine", where they
claim the first version of the algorithm to form a probability
distribution over web pages with the sum of all pages' PageRanks being
one.
In the following, we will use the first version of the
algorithm. The reason is that PageRank calculations by means of this
algorithm are easier to compute, because we can disregard the total
number of web pages.
The Characteristics of PageRank
The characteristics of PageRank shall be illustrated by a small
example.
We regard a small web consisting of three pages A, B
and C, whereby page A links to the pages B and C, page B links to page
C and page C links to page A. According to Page and Brin, the damping
factor d is usually set to 0.85, but to keep the calculation simple we
set it to 0.5. The exact value of the damping factor d admittedly has
effects on PageRank, but it does not influence the fundamental
principles of PageRank. So, we get the following equations for the
PageRank calculation:
PR(A) = 0.5 + 0.5 PR(C)
PR(B) = 0.5 + 0.5 (PR(A) / 2)
PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))
These equations can easily be solved. We get the
following PageRank values for the single pages:
PR(A) = 14/13 = 1.07692308
PR(B) = 10/13 = 0.76923077
PR(C) = 15/13 = 1.15384615
It is obvious that the sum of all pages' PageRanks is 3
and thus equals the total number of web pages. As shown above this is
not a specific result for our simple example.
For our simple three-page example it is easy to solve
the according equation system to determine PageRank values. In
practice, the web consists of billions of documents and it is not
possible to find a solution by inspection.
The Iterative Computation of PageRank
Because of the size of the actual web, the Google
search engine uses an approximative, iterative computation of PageRank
values. This means that each page is assigned an initial starting value
and the PageRanks of all pages are then calculated in several
computation circles based on the equations determined by the PageRank
algorithm. The iterative calculation shall again be illustrated by our
three-page example, whereby each page is assigned a starting PageRank
value of 1.
Iteration
PR(A)
PR(B)
PR(C)
0
1
1
1
1
1
0.75
1.125
2
1.0625
0.765625
1.1484375
3
1.07421875
0.76855469
1.15283203
4
1.07641602
0.76910400
1.15365601
5
1.07682800
0.76920700
1.15381050
6
1.07690525
0.76922631
1.15383947
7
1.07691973
0.76922993
1.15384490
8
1.07692245
0.76923061
1.15384592
9
1.07692296
0.76923074
1.15384611
10
1.07692305
0.76923076
1.15384615
11
1.07692307
0.76923077
1.15384615
12
1.07692308
0.76923077
1.15384615
We see that we get a good approximation of the real
PageRank values after only a few iterations. According to publications
of Lawrence Page and Sergey Brin, about 100 iterations are necessary to
get a good approximation of the PageRank values of the whole web. Also,
by means of the iterative calculation, the sum of all pages' PageRanks
still converges to the total number of web pages. So the average
PageRank of a web page is 1. The
minimum PageRank of a page is given by (1-d).
Therefore, there is a maximum PageRank for a page which is given by dN+(1-d),
where N is total number of web pages. This maximum can theoretically
occur, if all web pages solely link to one page, and this page also
solely links to itself.
|
|