Google PageRank Algorithm By: Danny Lin
Table of Contents Google Search History / What is Page Rank? Page Rank Algorithm Inbound/Outbound Links Dangling Nodes Constraints Calculating your page rank How to maximize your page rank score Loopholes Neat stuff
Google Search Google search using PageRank: 1) Crawl the web and locate all publicly accessible webpages 2) Index the data from step 1 to allow for efficient searches for keywords or phrases 3) Rate the importance of each page in the database – using PageRank 4) Return results in descending order of importance with respect to search
Google’s Original Architectural Design Source: http://infolab.stanford.edu/~backrub/over.gif
History Page Rank was conceptualized by Sergey Brin and Lawrence Page; discussed in their 1998 paper: The anatomy of a large-scale hypertextual web search engine (http://infolab.stanford.edu/~backrub/google.html) Used to rank the importance of web pages Source: https://upload.wikimedia.org/wikipedia/commons/thumb/6/69/PageRank-hi-res.png/1280px-PageRank-hi-res.png
PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn)) Page Rank Algorithm PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn)) PR(Tn) - The importance of page Tn. C(Tn) - The number of outgoing links for page Tn. PR(Tn)/C(Tn) - The calculated importance passed to page A from page Tn. d - damping factor (0.85).
Inbound/Outbound Links With respect to page A: Inbound links – links that point towards page A Outbound links – links within page A pointing towards other pages
Dangling Nodes A dangling node is a page that does not have any outbound links. Issue: They act as sinks that reduce the importance from the web. Solution: Assume that the dangling node has a link to every other page. We randomly select the next page at random. This creates a stochastic matrix; all entries are nonnegative and the sum of each column is equal to 1. Source: http://www.webworkshop.net/images/pr1.gif
Constraints Must be primitive, i.e. for some n, Sn has all positive entries where λ1 = 1 and λ2 < 1 Must be stochastic, i.e. all entries are nonnegative and the sum of each column is equal to 1. Must be irreducible, i.e. you should not be able to perform row/column permutations such that you end up with a block upper-triangular form. The nodes must be strongly connected.
Calculating your page rank “Page Rank can be calculated using a simple iterative algorithm and corresponds to the principal eigenvector of the normalized link matrix (probability distribution) of the web” Algorithm to calculate the normalized probability distribution: Multiply stochastic matrix, S, with an random eigenvector, i1, to get new eigenvector, i2… Repeat step 1 until in-1 = in (approx.) LINEAR ALGEBRA TIME!!! Page Rank calculation time!
How to maximize your page rank score Internal Linking – having links to other pages within your website Hierarchical Fully meshed Good and plentiful content E.g. news website Provide a useful service or product E.g. phpbb – online bulletin board system
Loopholes SEO (Search Engine Optimization) webpages to increase traffic flow conversions $$ An issues that arose from this: the selling of links from high PR pages Source: http://www.bloggingcage.com/wp-content/uploads/2015/07/pr8links.png
Neat stuff Overview of a google search (1-2 minutes): http://www.google.com/insidesearch/howsearchworks/thestory/index.html How search has evolved (6 minutes): https://www.youtube.com/watch?v=mTBShTwCnD4 Changes to Google’s search algorithm: https://moz.com/google-algorithm-change
References Content http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm http://www.ams.org/samplings/feature-column/fcarc-pagerank http://infolab.stanford.edu/~backrub/google.html http://www.rose-hulman.edu/~bryan/googleFinalVersionFixed.pdf http://www.google.com/insidesearch/howsearchworks/thestory/index.html Images https://lh4.googleusercontent.com/-vAlbgOEKiNI/TtkBZvZLnDI/AAAAAAAAMrw/ooZ1Thuutmw/w1034-h587-no/OriginalGooglePage.PNG http://infolab.stanford.edu/~backrub/over.gif https://upload.wikimedia.org/wikipedia/commons/thumb/6/69/PageRank-hi-res.png/1280px-PageRank-hi-res.png http://www.webworkshop.net/images/pr1.gif http://www.bloggingcage.com/wp-content/uploads/2015/07/pr8links.png
Questions? Source: https://lh4.googleusercontent.com/-vAlbgOEKiNI/TtkBZvZLnDI/AAAAAAAAMrw/ooZ1Thuutmw/w1034-h587-no/OriginalGooglePage.PNG