Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin Presentation given by Scott J. McCallen Dept. of Computer Science Kent State University December 4 th 2006
Localized Search Engines What are they? What are they? Focus on a particular community Focus on a particular community Examples: (site specific) or all computer science related websites (topic specific) Examples: (site specific) or all computer science related websites (topic specific) Advantages Advantages Searching for particular terms with several meanings Searching for particular terms with several meanings Relatively inexpensive to build and use Relatively inexpensive to build and use Use less bandwidth, space and time Use less bandwidth, space and time Local domains are orders of magnitude smaller than global domain Local domains are orders of magnitude smaller than global domain
Localized Search Engines (con’t) Disadvantages Disadvantages Lack of Global information Lack of Global information i.e. only local PageRanks are available i.e. only local PageRanks are available Why is this a problem? Why is this a problem? Only pages within that community that are highly regarded will have high PageRanks Only pages within that community that are highly regarded will have high PageRanks There is a need for a global PageRank for pages only within a local domain There is a need for a global PageRank for pages only within a local domain Traditionally, this can only be obtained by crawling entire domain Traditionally, this can only be obtained by crawling entire domain
Some Global Facts 2003 Study by Lyman on the Global Domain 2003 Study by Lyman on the Global Domain 8.9 billion pages on the internet (static pages) 8.9 billion pages on the internet (static pages) Approximately 18.7 kilobytes each Approximately 18.7 kilobytes each 167 terabytes needed to download and crawl the entire web 167 terabytes needed to download and crawl the entire web These resources are only available to major corporations These resources are only available to major corporations Local Domains Local Domains May only contain a couple hundred thousand pages May only contain a couple hundred thousand pages May already be contained on a local web server ( May already be contained on a local web server ( There is much less restriction to the entire dataset There is much less restriction to the entire dataset The advantages of localized search engines becomes clear The advantages of localized search engines becomes clear
Global (N) vs. Local (n) Environmental Websites EDU Websites Political Websites Other websites Some parts overlap, but others don’t. Overlap represents links to other domains. Each local domain isn’t aware of the rest of the global domain. How is it possible to extract global information when only the local domain is available? Excluding overlap from other domains gives a very poor estimate of global rank.
Proposed Solution Find a good approximation to the global PageRank value without crawling entire global domain Find a good approximation to the global PageRank value without crawling entire global domain Find a superdomain of local domain that will well approximate the PageRank Find a superdomain of local domain that will well approximate the PageRank Find this superdomain by crawling as few as n or 2n additional pages given a local domain of n pages Find this superdomain by crawling as few as n or 2n additional pages given a local domain of n pages Esessentially, add as few pages to the local domain as possible until we find a very good approximation of the PageRanks in the local domain Esessentially, add as few pages to the local domain as possible until we find a very good approximation of the PageRanks in the local domain
PageRank - Description Defines importance of pages based on the hyperlinks from one page to another (the web graph) Defines importance of pages based on the hyperlinks from one page to another (the web graph) Computes the stationary distribution of a Markov chain created from the web graph Computes the stationary distribution of a Markov chain created from the web graph Uses the “random surfer” model to create a “random walk” over the chain Uses the “random surfer” model to create a “random walk” over the chain
PageRank Matrix Given m x m adjacency matrix for the web graph, define the PageRank Matrix as Given m x m adjacency matrix for the web graph, define the PageRank Matrix as D U is diagonal matrix such that UD U -1 is column stochastic D U is diagonal matrix such that UD U -1 is column stochastic 0 ≤ α ≤ 1 0 ≤ α ≤ 1 e is vector of all 1’s e is vector of all 1’s v is the random surfer vector v is the random surfer vector
PageRank Vector The PageRank vector r represents the page rank of every node in the webgraph The PageRank vector r represents the page rank of every node in the webgraph It is defined as the dominate eigenvector of the PageRank matrix It is defined as the dominate eigenvector of the PageRank matrix Computed using the power method using a random starting vector Computed using the power method using a random starting vector Computation can take as much as O(m 2 ) time for a dense graph but in practice is normally O(km), k being the average number of links per page Computation can take as much as O(m 2 ) time for a dense graph but in practice is normally O(km), k being the average number of links per page
Algorithm 1 Computing the PageRank vector based on the adjacency matrix U of the given web graph Computing the PageRank vector based on the adjacency matrix U of the given web graph
Algorithm 1 (Explanation) Input: Adjacency Matrix U Input: Adjacency Matrix U Output: PageRank vector r Output: PageRank vector r Method Method Choose a random initial value for r (0) Choose a random initial value for r (0) Continue to iterate using the random surfer probability and vector until reaching the convergence threshold Continue to iterate using the random surfer probability and vector until reaching the convergence threshold Return the last iteration as the dominant eigenvector for adjacency matrix U Return the last iteration as the dominant eigenvector for adjacency matrix U
For a local domain L, we have G as the entire global domain with an N x N adjacency matrix For a local domain L, we have G as the entire global domain with an N x N adjacency matrix Define G to be as the following Define G to be as the following i.e. we partition G into separate sections that allow L to be contained i.e. we partition G into separate sections that allow L to be contained Assume that L has already been crawled and L out is known Assume that L has already been crawled and L out is known Defining the Problem ( G vs. L)
Defining the Problem (p* in g) If we partition G as such, we can denote actual PageRank vector of L as If we partition G as such, we can denote actual PageRank vector of L as with respect to g (the global PageRank vector) Note: E L selects only the nodes that correspond to L from g
Defining the Problem (n << N) We define p as the PageRank vector computed by crawling only local domain L We define p as the PageRank vector computed by crawling only local domain L Note that p will be much different than p* Note that p will be much different than p* Continue to crawl more nodes of the global domain and the difference will become smaller, however this is not possible Continue to crawl more nodes of the global domain and the difference will become smaller, however this is not possible Find the supergraph F of L that will minimize the difference between p and p* Find the supergraph F of L that will minimize the difference between p and p*
Defining the Problem (finding F) We need to find F that gives us the best approximation of p* We need to find F that gives us the best approximation of p* i.e. minimize the following problem (the difference between the actual global PageRank and the estimated PageRank) i.e. minimize the following problem (the difference between the actual global PageRank and the estimated PageRank) F is found with a greedy strategy, using Algorithm 2 F is found with a greedy strategy, using Algorithm 2 Essentially, start with L and add the nodes in F out that minimize our objective and continue doing so a total of T iterations Essentially, start with L and add the nodes in F out that minimize our objective and continue doing so a total of T iterations
Algorithm 2
Algorithm 2 (Explanation) Input: L (local domain), L out (outlinks from L), T (number of iterations), k (pages to crawl per iteration) Input: L (local domain), L out (outlinks from L), T (number of iterations), k (pages to crawl per iteration) Output: p (an improved estimated PageRank vector) Output: p (an improved estimated PageRank vector) Method Method First set F (supergraph) and F out equal to L and L out First set F (supergraph) and F out equal to L and L out Compute the PageRank vector of F Compute the PageRank vector of F While T has not been exceeded While T has not been exceeded Select k new nodes to crawl based on F, F out, f Select k new nodes to crawl based on F, F out, f Expand F to include those new nodes and modify F out Expand F to include those new nodes and modify F out Compute the new PageRank vector for F Compute the new PageRank vector for F Select the elements from f that correspond to L and return p Select the elements from f that correspond to L and return p
Global (N) vs. Local (n) (Again) Environmental Websites EDU Websites Political Websites Other websites Using it on only the local domain gives very inaccurate estimates of the PageRank. We know how to create the PageRank vector using the power method. How far can selecting more nodes be allowed to proceed without crawling the entire global domain? How can we select nodes from other domains (i.e. expanding the current domain) to improve accuracy?
Selecting Nodes Select nodes to expand L to F Select nodes to expand L to F Selected nodes must bring us closer to the actual PageRank vector Selected nodes must bring us closer to the actual PageRank vector Some nodes will greatly influence the current PageRank Some nodes will greatly influence the current PageRank Only want to select at most O(n) more pages than those already in L Only want to select at most O(n) more pages than those already in L
Finding the Best Nodes For a page j in the global domain and the frontier of F (F out ), the addition of page j to F is as follows For a page j in the global domain and the frontier of F (F out ), the addition of page j to F is as follows uj is the outlinks from F to j uj is the outlinks from F to j s is the estimated inlinks from j into F (j has not yet been crawled) s is the estimated inlinks from j into F (j has not yet been crawled) s is estimated based on the expectation of inlink counts of pages already crawled as so s is estimated based on the expectation of inlink counts of pages already crawled as so
Finding the Best Nodes (con’t) We defined the PageRank of F to be f We defined the PageRank of F to be f The PageRank of F j is f j + The PageRank of F j is f j + x j is the PageRank of node j (added to the current PageRank vector) x j is the PageRank of node j (added to the current PageRank vector) Directly optimizing requires us to know the global PageRank p* Directly optimizing requires us to know the global PageRank p* How can we minimize the objective without knowing p*? How can we minimize the objective without knowing p*?
Node Influence Find the nodes in F out that will have the greatest influence on the local domain L Find the nodes in F out that will have the greatest influence on the local domain L Done by attaching an influence score to each node j Done by attaching an influence score to each node j Summation of the difference adding page j will make to PageRank vector among all pages in L Summation of the difference adding page j will make to PageRank vector among all pages in L The influence score has a strong corollary to the minimization of the GlobalDiff(f j ) function (as compared to a baseline, for instance, the total outlink count from F to node j) The influence score has a strong corollary to the minimization of the GlobalDiff(f j ) function (as compared to a baseline, for instance, the total outlink count from F to node j)
Node Influence Results Node Influence vs. Outlink Count on a crawl of conservative web sites Node Influence vs. Outlink Count on a crawl of conservative web sites
Finding the Influence Influence must be calculated for each node j in frontier of F that is considered Influence must be calculated for each node j in frontier of F that is considered We are considering O(n) pages and the calculation is O(n), we are left with a O(n 2 ) computation We are considering O(n) pages and the calculation is O(n), we are left with a O(n 2 ) computation To reduce this complexity, approximating the influence of j may be acceptable, but how? To reduce this complexity, approximating the influence of j may be acceptable, but how? Using the power method for computing the PageRank algorithms may lead us to a good approximation Using the power method for computing the PageRank algorithms may lead us to a good approximation However, using the algorithm (Algorithm 1), requires having a good starting vector However, using the algorithm (Algorithm 1), requires having a good starting vector
PageRank Vector (again) The PageRank algorithm will converge at a rate equal to the random surfer probability α The PageRank algorithm will converge at a rate equal to the random surfer probability α With a starting vector x (0), the complexity of the algorithm is With a starting vector x (0), the complexity of the algorithm is That is, the more accurate the vector becomes, the more complex the process is That is, the more accurate the vector becomes, the more complex the process is Saving Grace: Find a very good starting vector for x (0), in which case we only need to perform one iteration of Algorithm 1 Saving Grace: Find a very good starting vector for x (0), in which case we only need to perform one iteration of Algorithm 1
Finding the Best x (0) Partition the PageRank matrix for F j Partition the PageRank matrix for F j
Finding the Best x (0) Simple approach Simple approach Use as the starting vector (the current PageRank vector) Use as the starting vector (the current PageRank vector) Perform one PageRank iteration Perform one PageRank iteration Remove the element that corresponds to added node Remove the element that corresponds to added node Issues Issues The estimate of f j + will have an error of at least 2αx j The estimate of f j + will have an error of at least 2αx j So if the PageRank of j is very high, very bad estimate So if the PageRank of j is very high, very bad estimate
Stochastic Complement In an expanded form, the PageRank f j + is In an expanded form, the PageRank f j + is Which can be solved as Which can be solved as Observation: Observation: This is the stochastic complement of PageRank matrix of F j This is the stochastic complement of PageRank matrix of F j
Stochastic Complement (Observations) The stochastic complement of an irreducible matrix is unique The stochastic complement of an irreducible matrix is unique The stochastic complement is also irreducible and therefore has unique stationary distribution The stochastic complement is also irreducible and therefore has unique stationary distribution With regards to the matrix S With regards to the matrix S The subdominant eigenvalue is at most which means that for large l, it is very close to α The subdominant eigenvalue is at most which means that for large l, it is very close to α
The New PageRank Approximation Estimate the vector f j of length l by performing one PageRank iteration over S, starting at f Estimate the vector f j of length l by performing one PageRank iteration over S, starting at f Advantages Advantages Starting and ending with a vector of length l Starting and ending with a vector of length l Creates a lower bound for error of zero Creates a lower bound for error of zero Example: Considering adding a node k to F that has no influence over the PageRank of F Example: Considering adding a node k to F that has no influence over the PageRank of F Using the stochastic complement yields the exact solution Using the stochastic complement yields the exact solution
The Details Begin by expanding the difference between two PageRank vectors Begin by expanding the difference between two PageRank vectors with with
The Details Substitute P F into the equation Substitute P F into the equation Summarizing into vectors Summarizing into vectors
Algorithm 3 (Explanation) Input: F (the current local subgraph), F out (outlinks of F), f (current PageRank of F), k (number of pages to return) Input: F (the current local subgraph), F out (outlinks of F), f (current PageRank of F), k (number of pages to return) Output: k new pages to crawl Output: k new pages to crawl Method Method Compute the outlink sums for each page in F Compute the outlink sums for each page in F Compute a scalar for every known global page j (how many pages link to j) Compute a scalar for every known global page j (how many pages link to j) Compute y and z as formulated Compute y and z as formulated For each of the pages in F out For each of the pages in F out Computer x as formulated Computer x as formulated Compute the score of each page using x, y and z Compute the score of each page using x, y and z Return the k pages with the highest scores Return the k pages with the highest scores
PageRank Leaks and Flows The change of a PageRank based on added a node j to F can be described as Leaks and Flows The change of a PageRank based on added a node j to F can be described as Leaks and Flows A flow is the increase in local PageRanks A flow is the increase in local PageRanks Represented by Represented by Scalar is the total amount j has to distribute Scalar is the total amount j has to distribute Vector determines how it will be distributed Vector determines how it will be distributed A leak is the decrease in local PageRanks A leak is the decrease in local PageRanks Leaks come from non-positive vectors x and y Leaks come from non-positive vectors x and y X is proportional to the weighted sum of sibling PageRanks X is proportional to the weighted sum of sibling PageRanks Y is an artifact of the random surfer vector Y is an artifact of the random surfer vector
Leaks and Flows J Local Pages Flows Leaks Random Surfer Siblings
Experiments Methodology Methodology Resources are limited, global graph is approximated Resources are limited, global graph is approximated Baseline Algorithms Baseline Algorithms Random Random Nodes chosen uniformly at random from known global nodes Nodes chosen uniformly at random from known global nodes Outlink Count Outlink Count Node chosen have the highest number of outline counts from the current local domain Node chosen have the highest number of outline counts from the current local domain
Results (Data Sets) Data Set Data Set Restricted to http pages that do not contain the characters ?, or = Restricted to http pages that do not contain the characters ?, or = EDU Data Set EDU Data Set Crawl of the top 100 computer science universities Crawl of the top 100 computer science universities Yielded 4.7 million pages, 22.9 million links Yielded 4.7 million pages, 22.9 million links Politics Data Set Politics Data Set Crawl of the pages under politics in dmoz directory Crawl of the pages under politics in dmoz directory Yielded 4.4 million pages, 17.2 million links Yielded 4.4 million pages, 17.2 million links
Results (EDU Data Set) Normalizations show difference, Kendall shows similarity Normalizations show difference, Kendall shows similarity
Results (Politics Data Set)
Result Summary Stochastic Complement outperformed other methods in nearly every trial Stochastic Complement outperformed other methods in nearly every trial The results are significantly better than the random walk approach with minimal computation The results are significantly better than the random walk approach with minimal computation
Conclusion Accurate estimates of the PageRank can be obtained by using local results Accurate estimates of the PageRank can be obtained by using local results Expand the local graph based on influence Expand the local graph based on influence Crawl at most O(n) more pages Crawl at most O(n) more pages Use stochastic complement to accurately estimate the new PageRank vector Use stochastic complement to accurately estimate the new PageRank vector Not computationally or storage intensive Not computationally or storage intensive
Estimating the Global PageRank of Web Communities The End Thank You