Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.

Slides:



Advertisements
Similar presentations
Numerical Solution of Linear Equations
Advertisements

Markov Models.
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
Google’s PageRank By Zack Kenz. Outline Intro to web searching Review of Linear Algebra Weather example Basics of PageRank Solving the Google Matrix Calculating.
Solving Linear Systems (Numerical Recipes, Chap 2)
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
Experiments with MATLAB Experiments with MATLAB Google PageRank Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University, Taiwan
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
CE 311 K - Introduction to Computer Methods Daene C. McKinney
The Further Mathematics network
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Google’s PageRank: The Math Behind the Search Engine Author:Rebecca S. Wills, 2006 Instructor: Dr. Yuan Presenter: Wayne.
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
Using Adaptive Methods for Updating/Downdating PageRank Gene H. Golub Stanford University SCCM Joint Work With Sep Kamvar, Taher Haveliwala.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Roshnika Fernando P AGE R ANK. W HY P AGE R ANK ?  The internet is a global system of networks linking to smaller networks.  This system keeps growing,
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Yaomin Jin Design of Experiments Morris Method.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Optimal Link Bombs are Uncoordinated Sibel Adali Tina Liu Malik Magdon-Ismail Rensselaer Polytechnic Institute.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
© 2011 Autodesk Freely licensed for use by educational institutions. Reuse and changes require a note indicating that content has been modified from the.
Predictive Ranking -H andling missing data on the web Haixuan Yang Group Meeting November 04, 2004.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Google PageRank Algorithm
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
By: Jesse Ehlert Dustin Wells Li Zhang Iterative Aggregation/Disaggregation(IAD)
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Matrices. Variety of engineering problems lead to the need to solve systems of linear equations matrixcolumn vectors.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
Motivation Modern search engines for the World Wide Web use methods that require solving huge problems. Our aim: to develop multiscale techniques that.
The PageRank Citation Ranking: Bringing Order to the Web
PageRank and Markov Chains
DTMC Applications Ranking Web Pages & Slotted ALOHA
CSE 454 Advanced Internet Systems University of Washington
CSE 454 Advanced Internet Systems University of Washington
Iterative Aggregation Disaggregation
CSE 454 Advanced Internet Systems University of Washington
CSE 454 Advanced Internet Systems University of Washington
CS 440 Database Management Systems
PageRank algorithm based on Eigenvectors
Junghoo “John” Cho UCLA
Numerical Analysis Lecture11.
Presentation transcript:

Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer

Introduction Internet Statistics 151 million active internet user as of January % used a search engine at least once during the month Average time spent searching was about 40 minutes

Introduction Search Engines Most common means of accessing Web Easiest method of organizing and accessing information Therefore, high quality / usable search engines are very important RankProviderSearches (000)Share of Searches -All Search10,812, % 1Google Search6,986, % 2Yahoo! Search1,726, % 3MSN/Windows Live/Bing Search1,156, %

Introduction Basic Idea of a Search Engine Scanning Web Graph Using crawlers to create an index of the information Ranking Organizing and ordering information in a usable way

Introduction Webpage Ranking Problems Personalization Updating – Keeping order up to date 25% of links are changed in one week 5% “new content” in one week 35% of entire web changed in eleven weeks Reducing Computation Time

Introduction Accelerating the PageRank Use compression to fit the Web Graph into main memory Use sequence of scans of the Web Graph to efficiently compute in external memory Reduce computation time through numerical methods

Introduction Reduce computation time through numerical methods Use iterative methods such as Gauss-Seidel and Jacobi method (Arasu et al and Bianchini et al) Subtract off estimates of non-principal eigenvectors (Kamvar et al ) Sort the graph lexicographically by url resulting in an approximate block structure (Kamvar et al) Split into two problems: dangling nodes and non-dangling nodes (Lee et al) Others have been working on ways to only update the ranks of nodes influenced by Web Graph changes

Introduction Del Corso, Gulli, and Romani (authors of the paper) Numerical optimization of PageRank View the PageRank computation as a linear system Transform a dense problem into one which uses a sparse matrix Treatment of dangling nodes which naturally adapts to the random surfer model Exploiting web matrix permutations Increase data locality and reduce number of iterations necessary to solve the problem

Google’s PageRank Overview Web as an oriented graph The random surfer v i = ∑ p ji v j (sum of PageRanks of nodes linking to i weighted by the transition probability) Equilibrium distribution v T = v T H (left eigenvector of H with eigenvalue 1)

Problems with the ideal model Dangling nodes (trap the user) Impose a random jump to every other page B = H + au T Cyclic paths in the Web Graph (reducibility) Brin and Page suggested adding artificial transitions (low probability jump to all nodes) G =  B + (1   )eu T

Current PageRank Solution Since G is just a rank one modification of  H, the power method takes advantage of the sparsity of matrix H.

Google’s PageRank eigenproblem as a linear system Expand v T G = v T (eigenproblem) G =  (H + au T ) + (1   )eu T (expansion of Google matrix)  v T (  H +  au T ) + (1   )v T eu T = v T Restructure Taking S = I   H T   ua T And with v T e = 1  Sv = (1   )u (after taking transpose and rearranging)

Dangling Nodes Problems Dangling Nodes Pages with no links to other pages Pages whose existence is inferred but crawler has not reached According to a 2001 sample, approximately 76% are dangling nodes.

Dangling Nodes Problems Natural Model B = H + au T Jump with probability 1 to any other node Drastic Model Completely removes dangling nodes Problems Dangling nodes themselves are not ranked Removing nodes create new dangling nodes Self-loop Model eg. B = H + F F ij = {1 if i = j & dangling; 0 otherwise} Still row stochastic and is similar to natural Problem Gives unreasonably high rank to the dangling nodes

Which model is the best? Natural model the most “accurate”. Problem Gives a much more dense matrix B It is at least partially for this reason that the problem is approached as an eigenproblem to exploit the sparsity of H Can we have an equally lightweight iterative approach?

Iterative Approach with Sparsity Sparse matrix R R = I   H T PageRank v obtained by solving Ry = u, v =  y such that ||v|| 1 =1 Why does this work? Since S = R   ua T and Sv = (1   )u (R   ua T )v = (1   )u Use Sherman-Morrison formula to calculate the inverse

Iterative Approach with Sparsity We get The vector v is our PageRank vector and was solved using sparse matrix R (R   ua T ) -1 = R -1 + R -1 ua T R -1 1/  + a T R -1 u v = (1   )(1 + aTyaTy )y 1/  + a T y v =  y  = (1   )(1 + aTyaTy ) 1/  + a T y

Exploiting Web Matrix Permutations Use a variety of “cheap” operators to permute the web graph in an organized way in hopes to: increase data locality reduce the number of iterations in order to solve the problem Explore different iterative methods in order to solve the problem quicker. Compare the performance of different iterative methods based on specifically permuted web graphs.

Permutation Strategies The following operations were chosen based on their “limited impact on the computational cost” (Del Corso et al.) O - orders nodes by increasing out-degree Q - orders nodes by decreasing out-degree X - orders nodes by increasing in-degree Y - orders nodes by decreasing in-degree B - ordering the nodes according to their BFS (Breadth First Search) order of visit T - transposes the matrix

Permutation Strategies (cont.) O, Q, X, Y, and T operators  a full matrix The B operator conveniently arranges R into a lower block triangular structure due to BFS order of visit. Combining these operations on R, the following structures of the permuted web matrix are produced.

Permutation Strategies (cont.) Visual representations of the permuted web graph.

Iterative Strategies Power Method Computes dominant eigenvector Jacobi Method Using an initial guess, approximates the solution to the linear system of equations. Each successive iteration uses the previous approximated solution as its next guess till a degree of convergence is reached. (*further explained) Gauss-Seidel Method (Reverse Gauss-Seidel) Modification of the Jacobi Method, which approximates the solution to each successive equation in the linear system based on the values derived from the previous equations, all within each iterative loop. (*further explained) *(

Iterative Strategies (cont.) Further exploration led to iterative methods based on the distinct block structure of certain web graph permutations. DN (or DNR) Method The permuted matrix R OT has the property that the lower diagonal block coincides with the identity matrix. The matrix can be easily partitioned into non-dangling and dangling portions. Then the non-dangling part is solved by Gauss-Seidel (or Reverse Gauss-Seidel respectively). LB/UB/LBR/UBR Methods Uses the Gauss-Seidel or Reverse Gauss-Seidel methods to solve the individual blocks of the triangular block matrices produced by the B operator.

Results

Results (cont.) For both the Power Method and Jacobi Method, the number of Mflops is not dependent on permutations of the web matrix. (“the small differences in numerical data are due to the finite precision” [Del Corso et al.]) The Jacobi Method (applied to the matrix R ) is only a slight improvement (about 3%) compared to the Power Method (applied to the matrix G ).

Results (cont.) The Gauss-Seidel and Reverse Gauss-Seidel Methods reduced Mflops by around 40% and running time by around 45% on the full matrix compared to the Power Method on the full matrix. In particular the Reverse Gauss-Seidel Method performed on the permuted matrix R YTB reduced the number of Mflops by 51% and running time by 82% when compared to the Power method on the full matrix.

Results (cont.) The block methods achieved even better results In particular, the best overall reduction in computation time and number of Mflops was achieved by LBR method on the permuted matrix R QTB. 58% reduction in terms of Mflops 89% reduction in terms of running time

Conclusion Objective: Accelerating Google PageRank by numerical methods Contribution: Viewed web matrix as a sparse linear system Formalized new method for treating dangling nodes Explored new iterative methods and applied them to web matrix permutations Achievement: 1/10 of the computation time Reduced over 50% Mflops

References G.M. Del Corso, A. Gulli, F. Romani, “Exploiting Web Matrix Permutations to Speedup PageRank Computation” Nielsen MegaView Search ( us.nielsen.com/rankings/insights/rank ings/internet)