Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. Associate Professor of Mathematics and Computer Science Juniata College Huntingdon, PA

Slides:



Advertisements
Similar presentations
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Advertisements

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Link Analysis: PageRank
Google’s PageRank By Zack Kenz. Outline Intro to web searching Review of Linear Algebra Weather example Basics of PageRank Solving the Google Matrix Calculating.
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
Experiments with MATLAB Experiments with MATLAB Google PageRank Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University, Taiwan
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
Report on Faculty Exchange and Sabbatical during the Academic Year Gerald Kruse, Ph.D. Associate Professor of Computer Science and Mathematics.
How Google Relies on Discrete Mathematics Gerald Kruse Juniata College Huntingdon, PA
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
The PageRank Citation Ranking “Bringing Order to the Web”
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.
Presented by Zheng Zhao Originally designed by Soumya Sanyal
The Further Mathematics network
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Google’s PageRank: The Math Behind the Search Engine Author:Rebecca S. Wills, 2006 Instructor: Dr. Yuan Presenter: Wayne.
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.
Using Adaptive Methods for Updating/Downdating PageRank Gene H. Golub Stanford University SCCM Joint Work With Sep Kamvar, Taher Haveliwala.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
Algorithms (wait, Math?) Everywhere… Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Author(s): Rahul Sami and Paul Resnick, 2009 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution.
Lectures 6 & 7 Centrality Measures Lectures 6 & 7 Centrality Measures February 2, 2009 Monojit Choudhury
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Overview of Web Ranking Algorithms: HITS and PageRank
Adaptive On-Line Page Importance Computation Serge, Mihai, Gregory Presented By Liang Tian 7/13/2010 1Adaptive On-Line Page Importance Computation.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Nov.
CompSci 100E 3.1 Random Walks “A drunk man wil l find his way home, but a drunk bird may get lost forever”  – Shizuo Kakutani Suppose you proceed randomly.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
How works M. Ram Murty, FRSC Queen’s Research Chair Queen’s University or How linear algebra powers the search engine.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Google PageRank Algorithm
By: Jesse Ehlert Dustin Wells Li Zhang Iterative Aggregation/Disaggregation(IAD)
1 CS 430: Information Discovery Lecture 5 Ranking.
CompSci 100E 4.1 Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
PageRank Google : its search listings always seemed deliver the “good stuff” up front. 1 2 Part of the magic behind it is its PageRank Algorithm PageRank™
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
The PageRank Citation Ranking: Bringing Order to the Web
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
PageRank and Markov Chains
DTMC Applications Ranking Web Pages & Slotted ALOHA
CSE 454 Advanced Internet Systems University of Washington
CSE 454 Advanced Internet Systems University of Washington
Degree and Eigenvector Centrality
Laboratory of Intelligent Networks (LINK) Youn-Hee Han
Link Counts GOOGLE Page Rank engine needs speedup
Iterative Aggregation Disaggregation
Piyush Kumar (Lecture 2: PageRank)
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
Presentation transcript:

Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. Associate Professor of Mathematics and Computer Science Juniata College Huntingdon, PA

But first, a brief mention of the exchange agreement between Juniata and FH-Muenster Faculty Students Students JuniataFH-Muenster Tim AumanRobin Segglemann Mike LinkFrank Volkmer ??Sascha Hlusiak ??Morin Ostkamp

A “taste” of Juniata College and its IT/CS Department Juniata College is located in central Pennsylvania, in the eastern U.S., a 3 hour drive (this is considered “close” in America) from: Baltimore, Washington, DC, Philadelphia, and Pittsburgh students, 45% male / 55% female, predominantly residential campus, Liberal Arts curriculum, reputation as strong in the sciences.

Juniata Information Technology at Juniata is multi-disciplinary, with courses and faculty also a part of the Computer Science, Business, Communication, Criminal Justice, and Environmental Sciences departments. Between 10 and 20 IT and CS majors graduate each year, with many more choosing to minor (8 course sequence). IT at Juniata is characterized by a 3 semester “Innovations for Industry” sequence (abbreviated as I4I).

Brumbaugh Academic Center

von Liebig Science Center

Innovations for Industry Meeting Room

C – 102 Classroom

Digital Media “Green Room”

Now, back to Search Engines… What must they do? Crawl the web and locate all public pages Index the “crawled” data so it can be searched Rank the pages for more effective searching ( the focus of this talk )

PageRank is NOT a simple citation index NOTE: While PageRank is an important part of Google’s search results, it is not the sole means used to rank pages. AB Which is the more popular page below, A or B? What if the links to A were from unpopular pages, and the one link to B was from ?

Intuitively PageRank is analogous to popularity The web as a graph: each page is a vertex, each hyperlink a directed edge. A page is popular if a few very popular pages point (via hyperlinks) to it. A page could be popular if many not-necessarily popular pages point (via hyperlinks) to it. Page APage B Page C Which of these three would have the highest page rank?

So what is the mathematical definition of PageRank? In particular, a page’s rank is equal to the sum of the ranks of all the pages pointing to it. note the scaling of each page rank note the scaling of each page rank

Writing out the equation for each web-page in our example gives: Page APage B Page C

Even though this is a circular definition we can calculate the ranks. Re-write the system of equations as a Matrix- Vector product. The PageRank vector is simply an eigenvector of the coefficient matrix, with

Page APage B Page C PageRank = 0.4 PageRank = 0.2 Note: we choose the eigenvector with

Note that the coefficient matrix is column-stochastic* Every column-stochastic matrix has 1 as an eigenvalue. * As long as there are no “dangling nodes” and the graph is connected.

In Page, Brin, et. al. [1], they suggest dangling nodes most likely would occur from pages which haven’t been crawled yet, and so they “simply remove them from the system until all the PageRanks are calculated.” It is interesting to note that a column-substochastic does have a positive eigenvalue and corresponding eigenvector with non-negative entries, which is called the Perron eigenvector, as detailed in Bryan and Leise [2]. Dangling Nodes have no outgoing links Page B Page A Page C In this example, Page C is a dangling node. Note that its associated column in the coefficient matrix is all 0. Matrices like these are called column-substochastic.

In this example, the eigenspace assiciated with eigenvalue is two-dimensional. Which eigenvector should be used for ranking? A disconnected graph could lead to non-unique rankings Page D Page C Page E Page B Page A Notice the block diagonal structure of the coefficient matrix. Note: Re-ordering via permutation doesn’t change the ranking, as in [2].

Add a “random-surfer” term to the simple PageRank formula. This models the behavior of a real web-surfer, who might jump to another page by directly typing in a URL or by choosing a bookmark, rather than clicking on a hyperlink. Originally, m=0.15 in Google, according to [2]. can also be written as: can also be written as: Let S be an n x n matrix with all entries 1/n. S is column- stochastic, and we consider the matrix M, which is a weighted average of A and S. Important Note: We will use this formulation with A when computing x, and s is a column vector with all entries 1/n, where if

The eigenspace associated with is one- dimensional, and the normalized eigenvector is M for our previous disconnected graph, with m=0.15 Page D Page C Page E Page B Page A So the addition of the random surfer term permits comparison between pages in different subwebs.

By many estimates, the web currently contains at least 8 billion pages. How does Google compute an eigenvector for something this large? One possibility is the power method. In [2], it is shown that every positive (all entries are > 0) column-stochastic matrix M has a unique vector q with positive components such that Mq = q, with, and it can be computed as, for any initial guess with positive components and. Iterative Calculation

Rather than calculating the powers of M directly, we could use the iteration,. Since M is positive, would be an calculation. As we mentioned previously, Google uses the equivalent expression in the computation: These products can be calculated without explicitly creating the huge coefficient matrix, since A contains mostly 0’s. The iteration is guaranteed to converge, and it will converge quicker with a better first guess, so the previous PageRank vector is used as the initial vector. Iterative Calculation continued

“Google-ing” Google

Results in an early paper from Page, Brin et. al. while in graduate school

Attempts to Manipulate Search Results Via a “Google Bomb”

Liberals vs. Conservatives!

Juniata’s own “Google Bomb”

At Juniata, CS 315 is my “Analysis and Algorithms” course

“Ego Surfing” Be very careful…

More than one Gerald Kruse…

Miscellaneous points Try a search in Google on “PigeonRank.” What types of sites would Google NOT give good results on? PageRank is not the only means Google uses to order search results. This exchange has been a wonderful opportunity for me professionally and personally with my family (wife Lisa, children Olivia, Peter, and Isabel)

Bibliography [1] S. Brin, L. Page, et. al., The PageRank Citation Ranking: Bringing Order to the Web, Stanford Digital Libraries Project (January 29, 1998). [2] K. Bryan and T. Leise, The $25,000,000,000 Eigenvector: The Linear Algebra behind Google, SIAM Review, 48 (2006), pp [3] G. Strang, Linear Algebra and Its Applications, Brooks-Cole, Boston, MA, [4] D. Poole, Linear Algebra: A Modern Introduction, Brooks-Cole, Boston, MA, 2005.

Any Questions? Slides available at

This gives a regular matrix In matrix notation we have In matrix notation we have Since we can rewrite as The new coefficient matrix is regular, so we can calculate the eigenvector iteratively. This iterative process is a series of matrix-vector products, beginning with an initial vector (typically the previous PageRank vector). These products can be calculated without explicitly creating the huge coefficient matrix.