Simplified PageRank Algorithm
QQ1703105484
In late 90’s as the number of webpages on the internet was growing exponentially different search engines were trying different approaches to rank the webpages. At Stanford, two computer science Ph.D. students, Sergey Brin and Larry Page were working on the following questions: How can we trust information? Why are some web pages more important than others? Their research led to the formation of the Google search engine.
In this project, you are required to implement a simplified version of the original PageRank algorithm on which Google was built by representing the web as a graph and implementing this graph using an Adjacency List or an equivalent data structure. The PageRank algorithm is an algorithm that is used to order or rank different web pages on the internet.
The entire internet consists of different webpages that can be represented as a graph. Each node represents a webpage and each edge represents a link between two webpages. This graph can be implemented as an Adjacency Matrix or an Adjacency List. In this assignment, you are supposed to implement the Adjacency List representation. If you use an Adjacency Matrix representation, you will be penalized by 50% of your implementation points and will also fail the large test cases.
2 |
3 |
1 |
4 |
5 |
|
k |
kr/out_degree(k) |
ir/3 |
ir/3 |
i |
jr/out_degree(j) |
ir/3 |
j |
Rank(i) = Rank(j)/out_degree(j) + Rank(k)/out_degree(k)
2 |
3 |
1 |
4 |
5 |
|
In this assignment, you need to compute the rank of the webpages using a Simplified PageRank Algorithm explained in the example below. You are supposed to implement an Adjacency List data structure to represent the graph.
Line 1 contains the number of lines (n) that will follow and the number of power iterations (p) you need to perform. Each line from 2 to n+1 will contain two URL’s – from_page to_page separated by a single space. This means that the from_page points to the URL to_page.
Print the PageRank of all pages after p powerIterations in ascending alphabetical order of webpage. Also, round off the rank of the page to two decimal places.
Input 7 2 google.com gmail.com google.com maps.com facebook.com ufl.edu ufl.edu google.com ufl.edu gmail.com maps.com facebook.com gmail.com maps.com | Output facebook.com 0.20 gmail.com 0.20 google.com 0.10 maps.com 0.30 ufl.edu 0.20 |
Note: Here, p = 2, n = 7, |V| = 5
(1-outdegrees) |
(1) |
2 |
3 |
(2) |
1 |
(1) |
4 |
5 |
|
(2)
Step 1: Mapping for Simplicity (Optional but you will need a mechanism to store unique vertices) Use a map/associative array to map the URL’s with a unique id
In PageRank, the equation to calculate the rank for your graph is given as follows:
M is the matrix with values given by the following:
Mij = 1/dj if there is an edge from Vj to Vi (dj is the outdegree of node j)
|
This means that a rank of the webpage at time t+1 is equal to the rank of that page at time t multiplied by matrix, M. To achieve this, we create our matrix M based on input. Next, we initialize r(t) which is a matrix of size |V|x1 and consists of the ranks of every webpage. We initialize r(t) to 1/|V|. Next, we compute power_iterations based on our input, p. There is a mathematical proof that the matrix r converges, i.e. r(t+1)
= r(t) at which point the algorithm stops. However, this is difficult to test and we therefore will be testing your algorithm on a fixed power iteration value, p.
Example r and M matrices for our input:
r(0) M
|
1 | 1/5 |
2 | 1/5 |
3 | 1/5 |
4 | 1/5 |
5 | 1/5 |
|
|
|
r(1) = M.r(0) = =
x
M x r(0) = r(1)
Note: In our input case, the number of power_iterations, p is 2. Therefore we print r(1) of the urls sorting in alphabetical order. If p was 1 then return the initializing rank matrix or r(0). If p>2, the process repeats where you multiply the matrix, M with the new rank matrix r(t+1) at the next iteration.
|
|
|
r(t+1) = r(2) = M.r(1) = =
M x r(1) = r(2)
You are allowed to use your own template but make sure your code passes the sample test cases. An example template to think about the problem is:
class AdjacencyList {
private:
public:
};
//Think about what member variables you need to initialize
//Think about what helper functions you will need in the algorithm
void AdjacencyList::PageRank(int n){ } // prints the PageRank of all pages after p powerIterations in
ascending alphabetical order of webpages and rounding rank to two decimal places
// This class and method are optional. To accept the input, you can use this method:
int main()
{
int no_of_lines, power_iterations; std::string from, to;
std::cin >> no_of_lines; std::cin >> power_iterations; for(int i=0; i< no_of_lines; i++)
{
std::cin >> from; std::cin >> to;
// Do Something
}
//Create a graph object Created_Graph.PageRank(power_iterations);
}
.h file that has all your implementation. Remember this code should be the same that you tested on Stepik.
Based on the number of repeated questions we got in Project 1 on Slack, the course staff will now maintain an active FAQ Google document to answer your questions. Post your questions in Slack, but this time we will answer in this document and send you the link. The link to the document is: https://docs.google.com/document/d/1a9hR1Ep2IYK-MnsXwl2VxyotO1Yd-XCC8NWUkdp24JM/