Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.

Slides:



Advertisements
Similar presentations
Traveling Salesperson Problem
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Branch & Bound Algorithms
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Link Analysis, PageRank and Search Engines on the Web
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
The Further Mathematics network
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Web Intelligence Web Communities and Dissemination of Information and Culture on the www.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Online Algorithms By: Sean Keith. An online algorithm is an algorithm that receives its input over time, where knowledge of the entire input is not available.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Nov.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Predictive Ranking -H andling missing data on the web Haixuan Yang Group Meeting November 04, 2004.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Google PageRank Algorithm
Mix networks with restricted routes PET 2003 Mix Networks with Restricted Routes George Danezis University of Cambridge Computer Laboratory Privacy Enhancing.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.
Motivation Modern search engines for the World Wide Web use methods that require solving huge problems. Our aim: to develop multiscale techniques that.
The PageRank Citation Ranking: Bringing Order to the Web
Link-Based Ranking Seminar Social Media Mining University UC3M
PageRank and Markov Chains
DTMC Applications Ranking Web Pages & Slotted ALOHA
Iterative Aggregation Disaggregation
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin Presentation given by Scott J. McCallen Dept. of Computer Science Kent State University December 4 th 2006

Localized Search Engines What are they? What are they? Focus on a particular community Focus on a particular community Examples: (site specific) or all computer science related websites (topic specific) Examples: (site specific) or all computer science related websites (topic specific) Advantages Advantages Searching for particular terms with several meanings Searching for particular terms with several meanings Relatively inexpensive to build and use Relatively inexpensive to build and use Use less bandwidth, space and time Use less bandwidth, space and time Local domains are orders of magnitude smaller than global domain Local domains are orders of magnitude smaller than global domain

Localized Search Engines (con’t) Disadvantages Disadvantages Lack of Global information Lack of Global information i.e. only local PageRanks are available i.e. only local PageRanks are available Why is this a problem? Why is this a problem? Only pages within that community that are highly regarded will have high PageRanks Only pages within that community that are highly regarded will have high PageRanks There is a need for a global PageRank for pages only within a local domain There is a need for a global PageRank for pages only within a local domain Traditionally, this can only be obtained by crawling entire domain Traditionally, this can only be obtained by crawling entire domain

Some Global Facts 2003 Study by Lyman on the Global Domain 2003 Study by Lyman on the Global Domain 8.9 billion pages on the internet (static pages) 8.9 billion pages on the internet (static pages) Approximately 18.7 kilobytes each Approximately 18.7 kilobytes each 167 terabytes needed to download and crawl the entire web 167 terabytes needed to download and crawl the entire web These resources are only available to major corporations These resources are only available to major corporations Local Domains Local Domains May only contain a couple hundred thousand pages May only contain a couple hundred thousand pages May already be contained on a local web server ( May already be contained on a local web server ( There is much less restriction to the entire dataset There is much less restriction to the entire dataset The advantages of localized search engines becomes clear The advantages of localized search engines becomes clear

Global (N) vs. Local (n) Environmental Websites EDU Websites Political Websites Other websites Some parts overlap, but others don’t. Overlap represents links to other domains. Each local domain isn’t aware of the rest of the global domain. How is it possible to extract global information when only the local domain is available? Excluding overlap from other domains gives a very poor estimate of global rank.

Proposed Solution Find a good approximation to the global PageRank value without crawling entire global domain Find a good approximation to the global PageRank value without crawling entire global domain Find a superdomain of local domain that will well approximate the PageRank Find a superdomain of local domain that will well approximate the PageRank Find this superdomain by crawling as few as n or 2n additional pages given a local domain of n pages Find this superdomain by crawling as few as n or 2n additional pages given a local domain of n pages Esessentially, add as few pages to the local domain as possible until we find a very good approximation of the PageRanks in the local domain Esessentially, add as few pages to the local domain as possible until we find a very good approximation of the PageRanks in the local domain

PageRank - Description Defines importance of pages based on the hyperlinks from one page to another (the web graph) Defines importance of pages based on the hyperlinks from one page to another (the web graph) Computes the stationary distribution of a Markov chain created from the web graph Computes the stationary distribution of a Markov chain created from the web graph Uses the “random surfer” model to create a “random walk” over the chain Uses the “random surfer” model to create a “random walk” over the chain

PageRank Matrix Given m x m adjacency matrix for the web graph, define the PageRank Matrix as Given m x m adjacency matrix for the web graph, define the PageRank Matrix as D U is diagonal matrix such that UD U -1 is column stochastic D U is diagonal matrix such that UD U -1 is column stochastic 0 ≤ α ≤ 1 0 ≤ α ≤ 1 e is vector of all 1’s e is vector of all 1’s v is the random surfer vector v is the random surfer vector

PageRank Vector The PageRank vector r represents the page rank of every node in the webgraph The PageRank vector r represents the page rank of every node in the webgraph It is defined as the dominate eigenvector of the PageRank matrix It is defined as the dominate eigenvector of the PageRank matrix Computed using the power method using a random starting vector Computed using the power method using a random starting vector Computation can take as much as O(m 2 ) time for a dense graph but in practice is normally O(km), k being the average number of links per page Computation can take as much as O(m 2 ) time for a dense graph but in practice is normally O(km), k being the average number of links per page

Algorithm 1 Computing the PageRank vector based on the adjacency matrix U of the given web graph Computing the PageRank vector based on the adjacency matrix U of the given web graph

Algorithm 1 (Explanation) Input: Adjacency Matrix U Input: Adjacency Matrix U Output: PageRank vector r Output: PageRank vector r Method Method Choose a random initial value for r (0) Choose a random initial value for r (0) Continue to iterate using the random surfer probability and vector until reaching the convergence threshold Continue to iterate using the random surfer probability and vector until reaching the convergence threshold Return the last iteration as the dominant eigenvector for adjacency matrix U Return the last iteration as the dominant eigenvector for adjacency matrix U

For a local domain L, we have G as the entire global domain with an N x N adjacency matrix For a local domain L, we have G as the entire global domain with an N x N adjacency matrix Define G to be as the following Define G to be as the following i.e. we partition G into separate sections that allow L to be contained i.e. we partition G into separate sections that allow L to be contained Assume that L has already been crawled and L out is known Assume that L has already been crawled and L out is known Defining the Problem ( G vs. L)

Defining the Problem (p* in g) If we partition G as such, we can denote actual PageRank vector of L as If we partition G as such, we can denote actual PageRank vector of L as with respect to g (the global PageRank vector) Note: E L selects only the nodes that correspond to L from g

Defining the Problem (n << N) We define p as the PageRank vector computed by crawling only local domain L We define p as the PageRank vector computed by crawling only local domain L Note that p will be much different than p* Note that p will be much different than p* Continue to crawl more nodes of the global domain and the difference will become smaller, however this is not possible Continue to crawl more nodes of the global domain and the difference will become smaller, however this is not possible Find the supergraph F of L that will minimize the difference between p and p* Find the supergraph F of L that will minimize the difference between p and p*

Defining the Problem (finding F) We need to find F that gives us the best approximation of p* We need to find F that gives us the best approximation of p* i.e. minimize the following problem (the difference between the actual global PageRank and the estimated PageRank) i.e. minimize the following problem (the difference between the actual global PageRank and the estimated PageRank) F is found with a greedy strategy, using Algorithm 2 F is found with a greedy strategy, using Algorithm 2 Essentially, start with L and add the nodes in F out that minimize our objective and continue doing so a total of T iterations Essentially, start with L and add the nodes in F out that minimize our objective and continue doing so a total of T iterations

Algorithm 2

Algorithm 2 (Explanation) Input: L (local domain), L out (outlinks from L), T (number of iterations), k (pages to crawl per iteration) Input: L (local domain), L out (outlinks from L), T (number of iterations), k (pages to crawl per iteration) Output: p (an improved estimated PageRank vector) Output: p (an improved estimated PageRank vector) Method Method First set F (supergraph) and F out equal to L and L out First set F (supergraph) and F out equal to L and L out Compute the PageRank vector of F Compute the PageRank vector of F While T has not been exceeded While T has not been exceeded Select k new nodes to crawl based on F, F out, f Select k new nodes to crawl based on F, F out, f Expand F to include those new nodes and modify F out Expand F to include those new nodes and modify F out Compute the new PageRank vector for F Compute the new PageRank vector for F Select the elements from f that correspond to L and return p Select the elements from f that correspond to L and return p

Global (N) vs. Local (n) (Again) Environmental Websites EDU Websites Political Websites Other websites Using it on only the local domain gives very inaccurate estimates of the PageRank. We know how to create the PageRank vector using the power method. How far can selecting more nodes be allowed to proceed without crawling the entire global domain? How can we select nodes from other domains (i.e. expanding the current domain) to improve accuracy?

Selecting Nodes Select nodes to expand L to F Select nodes to expand L to F Selected nodes must bring us closer to the actual PageRank vector Selected nodes must bring us closer to the actual PageRank vector Some nodes will greatly influence the current PageRank Some nodes will greatly influence the current PageRank Only want to select at most O(n) more pages than those already in L Only want to select at most O(n) more pages than those already in L

Finding the Best Nodes For a page j in the global domain and the frontier of F (F out ), the addition of page j to F is as follows For a page j in the global domain and the frontier of F (F out ), the addition of page j to F is as follows uj is the outlinks from F to j uj is the outlinks from F to j s is the estimated inlinks from j into F (j has not yet been crawled) s is the estimated inlinks from j into F (j has not yet been crawled) s is estimated based on the expectation of inlink counts of pages already crawled as so s is estimated based on the expectation of inlink counts of pages already crawled as so

Finding the Best Nodes (con’t) We defined the PageRank of F to be f We defined the PageRank of F to be f The PageRank of F j is f j + The PageRank of F j is f j + x j is the PageRank of node j (added to the current PageRank vector) x j is the PageRank of node j (added to the current PageRank vector) Directly optimizing requires us to know the global PageRank p* Directly optimizing requires us to know the global PageRank p* How can we minimize the objective without knowing p*? How can we minimize the objective without knowing p*?

Node Influence Find the nodes in F out that will have the greatest influence on the local domain L Find the nodes in F out that will have the greatest influence on the local domain L Done by attaching an influence score to each node j Done by attaching an influence score to each node j Summation of the difference adding page j will make to PageRank vector among all pages in L Summation of the difference adding page j will make to PageRank vector among all pages in L The influence score has a strong corollary to the minimization of the GlobalDiff(f j ) function (as compared to a baseline, for instance, the total outlink count from F to node j) The influence score has a strong corollary to the minimization of the GlobalDiff(f j ) function (as compared to a baseline, for instance, the total outlink count from F to node j)

Node Influence Results Node Influence vs. Outlink Count on a crawl of conservative web sites Node Influence vs. Outlink Count on a crawl of conservative web sites

Finding the Influence Influence must be calculated for each node j in frontier of F that is considered Influence must be calculated for each node j in frontier of F that is considered We are considering O(n) pages and the calculation is O(n), we are left with a O(n 2 ) computation We are considering O(n) pages and the calculation is O(n), we are left with a O(n 2 ) computation To reduce this complexity, approximating the influence of j may be acceptable, but how? To reduce this complexity, approximating the influence of j may be acceptable, but how? Using the power method for computing the PageRank algorithms may lead us to a good approximation Using the power method for computing the PageRank algorithms may lead us to a good approximation However, using the algorithm (Algorithm 1), requires having a good starting vector However, using the algorithm (Algorithm 1), requires having a good starting vector

PageRank Vector (again) The PageRank algorithm will converge at a rate equal to the random surfer probability α The PageRank algorithm will converge at a rate equal to the random surfer probability α With a starting vector x (0), the complexity of the algorithm is With a starting vector x (0), the complexity of the algorithm is That is, the more accurate the vector becomes, the more complex the process is That is, the more accurate the vector becomes, the more complex the process is Saving Grace: Find a very good starting vector for x (0), in which case we only need to perform one iteration of Algorithm 1 Saving Grace: Find a very good starting vector for x (0), in which case we only need to perform one iteration of Algorithm 1

Finding the Best x (0) Partition the PageRank matrix for F j Partition the PageRank matrix for F j

Finding the Best x (0) Simple approach Simple approach Use as the starting vector (the current PageRank vector) Use as the starting vector (the current PageRank vector) Perform one PageRank iteration Perform one PageRank iteration Remove the element that corresponds to added node Remove the element that corresponds to added node Issues Issues The estimate of f j + will have an error of at least 2αx j The estimate of f j + will have an error of at least 2αx j So if the PageRank of j is very high, very bad estimate So if the PageRank of j is very high, very bad estimate

Stochastic Complement In an expanded form, the PageRank f j + is In an expanded form, the PageRank f j + is Which can be solved as Which can be solved as Observation: Observation: This is the stochastic complement of PageRank matrix of F j This is the stochastic complement of PageRank matrix of F j

Stochastic Complement (Observations) The stochastic complement of an irreducible matrix is unique The stochastic complement of an irreducible matrix is unique The stochastic complement is also irreducible and therefore has unique stationary distribution The stochastic complement is also irreducible and therefore has unique stationary distribution With regards to the matrix S With regards to the matrix S The subdominant eigenvalue is at most which means that for large l, it is very close to α The subdominant eigenvalue is at most which means that for large l, it is very close to α

The New PageRank Approximation Estimate the vector f j of length l by performing one PageRank iteration over S, starting at f Estimate the vector f j of length l by performing one PageRank iteration over S, starting at f Advantages Advantages Starting and ending with a vector of length l Starting and ending with a vector of length l Creates a lower bound for error of zero Creates a lower bound for error of zero Example: Considering adding a node k to F that has no influence over the PageRank of F Example: Considering adding a node k to F that has no influence over the PageRank of F Using the stochastic complement yields the exact solution Using the stochastic complement yields the exact solution

The Details Begin by expanding the difference between two PageRank vectors Begin by expanding the difference between two PageRank vectors with with

The Details Substitute P F into the equation Substitute P F into the equation Summarizing into vectors Summarizing into vectors

Algorithm 3 (Explanation) Input: F (the current local subgraph), F out (outlinks of F), f (current PageRank of F), k (number of pages to return) Input: F (the current local subgraph), F out (outlinks of F), f (current PageRank of F), k (number of pages to return) Output: k new pages to crawl Output: k new pages to crawl Method Method Compute the outlink sums for each page in F Compute the outlink sums for each page in F Compute a scalar for every known global page j (how many pages link to j) Compute a scalar for every known global page j (how many pages link to j) Compute y and z as formulated Compute y and z as formulated For each of the pages in F out For each of the pages in F out Computer x as formulated Computer x as formulated Compute the score of each page using x, y and z Compute the score of each page using x, y and z Return the k pages with the highest scores Return the k pages with the highest scores

PageRank Leaks and Flows The change of a PageRank based on added a node j to F can be described as Leaks and Flows The change of a PageRank based on added a node j to F can be described as Leaks and Flows A flow is the increase in local PageRanks A flow is the increase in local PageRanks Represented by Represented by Scalar is the total amount j has to distribute Scalar is the total amount j has to distribute Vector determines how it will be distributed Vector determines how it will be distributed A leak is the decrease in local PageRanks A leak is the decrease in local PageRanks Leaks come from non-positive vectors x and y Leaks come from non-positive vectors x and y X is proportional to the weighted sum of sibling PageRanks X is proportional to the weighted sum of sibling PageRanks Y is an artifact of the random surfer vector Y is an artifact of the random surfer vector

Leaks and Flows J Local Pages Flows Leaks Random Surfer Siblings

Experiments Methodology Methodology Resources are limited, global graph is approximated Resources are limited, global graph is approximated Baseline Algorithms Baseline Algorithms Random Random Nodes chosen uniformly at random from known global nodes Nodes chosen uniformly at random from known global nodes Outlink Count Outlink Count Node chosen have the highest number of outline counts from the current local domain Node chosen have the highest number of outline counts from the current local domain

Results (Data Sets) Data Set Data Set Restricted to http pages that do not contain the characters ?, or = Restricted to http pages that do not contain the characters ?, or = EDU Data Set EDU Data Set Crawl of the top 100 computer science universities Crawl of the top 100 computer science universities Yielded 4.7 million pages, 22.9 million links Yielded 4.7 million pages, 22.9 million links Politics Data Set Politics Data Set Crawl of the pages under politics in dmoz directory Crawl of the pages under politics in dmoz directory Yielded 4.4 million pages, 17.2 million links Yielded 4.4 million pages, 17.2 million links

Results (EDU Data Set) Normalizations show difference, Kendall shows similarity Normalizations show difference, Kendall shows similarity

Results (Politics Data Set)

Result Summary Stochastic Complement outperformed other methods in nearly every trial Stochastic Complement outperformed other methods in nearly every trial The results are significantly better than the random walk approach with minimal computation The results are significantly better than the random walk approach with minimal computation

Conclusion Accurate estimates of the PageRank can be obtained by using local results Accurate estimates of the PageRank can be obtained by using local results Expand the local graph based on influence Expand the local graph based on influence Crawl at most O(n) more pages Crawl at most O(n) more pages Use stochastic complement to accurately estimate the new PageRank vector Use stochastic complement to accurately estimate the new PageRank vector Not computationally or storage intensive Not computationally or storage intensive

Estimating the Global PageRank of Web Communities The End Thank You