In simple terms, PageRank is an intuitive way of ranking web pages, which formed the basis for Google’s web indexing algorithm during its early phase. In this article, you’ll learn about the intuition behind page rank and implementing page rank in python. The article is divided into the following sections:

  • Basic Idea behind Page Rank
  • Understanding the Pank Rank algorithm
  • Implementing Page Rank from scratch

Basic Idea behind Page Rank

The intuition behind the Page-Rank is based on the idea that popularity of a webpage is determined not only by the number of incoming links but also by the kind of incomings links. Citations from highly ranked pages contribute more than lower ranked web pages, for example if your website in linked by forbes website it will affect your ranking more than compared to a random website.

Taking it further let’s take an example for calculation the PR of a web page A cited by web page B shown in Fig. 1:

Fig.1 Example of web graph.

\begin{equation} PR(A)=(1-d)* (1/N)+ d* P(B,A)*PR(B) \end{equation}

PR(A): PR of A
PR(B): PR of B
P(B,A): Probability going from B to A (here it is equal to one)
N: Total number of webpages(in our case 2).
d: is known as damping factor, to add some randomness to the equation.

Simultaneously PR of B is calculated. This process continues until it PR does not change beyond some value.

Page Rank Algorithm

Taking it further we and to have a better understanding of how page rank works, we consider a graph (shown by fig 2) of web pages having links shown by the arrow.
Note that, if there are web pages with no out link then they do not contribute to the page ranking (they are usually referred to as dangling pages).

Fig.2

Our aim in now to figure of the PR of individual web pages. Inorder to do so, we need to perform the following steps:

  • Find the probabilities of going from one web page to another (respresented using probability transition matrix)
  • Apply the page rank algorithm our the web page until it converges.

STEP 1

Initially, page rank of all the web pages is taken as 1. The weight of the edge is the probability of going from a web page X to Y ( the web page A has 2 out links, therefore, the probability to visit each web page is 1/2 ). After expressing the web graph in terms of probabilities the web graph looks something like:

Fig.3

Probability transition matrix is represented as:

STEP 2

The page rank of each web page is determined by applying the PageRank equation. This process is repeated until the algorithm converges i.e. the values of page rank do not change beyond a small value ( know as epsilon usually fixed as 1e-4 ). The damping factor (d) introduced is to add some randomness over the web graph i.e. d is a probability that a user will move to the linked web page and 1-d is the probability of choosing a random web page, it is usually taken as 0.85.

Iteration 1:


\(PR(C)=(.15)*(1/4)+ .85*(.5*1 + 1*1 + 1*1) = 2.16\)

The PR(C) can also be calculated by matrix dot product.

Similarly, extending this for all the web pages we end up with the equation:
( * represents matrix multiplication)

\begin{equation} PR=(1-d)(1/N)+ d* (C^T*PR) \end{equation}

Where matrix C represents the probability transition ( C[i][j] = probability of the user transitioning from page i to page j). The C matrix of our example can be expressed as the matrix represented above. Also, the initial page ranks are as assigned 1 for all the web pages. The PRs of web pages are calculated until the PRs converge to a certain value.

Implementing Page Rank

Page Rank implementation in python:

import numpy as np
def pagerank(C, eps=0.0001, d=0.85):
    P = np.ones(len(C)) 
    while True:
        P_ = np.ones(len(A))*(1/N)* (1 - d) + d * C.T.dot(P)
        delta = abs(P_ - P).sum()
        if delta <= eps:
            return P_
        P = P_
p=pagerank(C)
#result
#p=[1.16, 0.644, 1.19, 0.15]

Final ranking of our example are C > A > B > D !

Notice: page rank of A is high even when it has only one incoming link.

References