1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 10 June 4, 2006

2 2 Random Sampling of Web Pages

3 3 Outline Problem definition Random sampling of web pages according to their PageRank Uniform sampling of web pages (Henzinger et al) Uniform sampling of web pages (Bar- Yossef et al)

4 4 Random Sampling of Web Pages W = a snapshot of the “indexable web”  Consider only “static” HTML web pages  = a probability distribution over W Goal: Design an efficient algorithm for generating samples from W distributed according to . Our focus:   = PageRank   = Uniform Indexable web

5 5 Random Sampling of Web Pages: Motivation Compute statistics about the web  Ex: What fraction of web pages belong  Ex: What fraction of web pages are written in Chinese?  Ex: What fraction of hyperlinks are advertisements? Compare coverage of search engines  Ex: Is Google larger than MSN?  Ex: What is the overlap between Google and Yahoo? Data mining of the web  Ex: How frequently computer science pages cite biology pages?  Ex: How are pages distributed by topic?

6 6 Random Sampling of Web Pages: Challenges Naïve solution: crawl, index, sample  Crawls cannot get complete coverage  Web is constantly changing  Crawling is slow and expensive Our goals:  Accuracy: generate samples from a snapshot of the entire indexable web  Speed: samples should be generated quickly  Low cost: sampling procedure should run on a desktop PC

7 7 A Random Walk Approach Design a random walk on W whose stationary distribution is   P = Random walk’s probability transition matrix   P =  Run random walk for sufficiently many steps  Recall: For any initial distribution q,  Mixing time: # of steps required to get close to the limit Use reached node as a sample Repeat for as many samples as needed

8 8 A Random Walk Approach: Advantages & Issues Advantages:  Accuracy: random walk can potentially visit every page on the web  Speed: no need to scan the whole web  Low cost: no need for large storage or multiple processors Issues:  How to design the random walk so it converges to  ?  How to analyze the mixing time of the random walk?

9 9 PageRank Sampling [Henzinger et al 1999] Use the “random surfer” random walk:  Start at some initial node v 0  When visiting a page v Toss a coin with heads probability  If coin is heads, go to a uniformly chosen page If coin is tails, go to a random out-neighbor of v Limit distribution: PageRank Mixing time: fast (will see later)

10 10 PageRank Sampling: Reality Problem: how to pick a page uniformly at random? Solutions:  Jump to a random page from the history of the walk Creates bias towards dense web-sites  Pick a random host from the hosts in the walk’s history and jump to a random page from the pages visited on that host Not converging to PageRank anymore Experiments indicate it is still fine

11 11 Uniform Sampling via PageRank Sampling [Henzinger et al 2000] Sampling algorithm: 1.Use previous random walk to generate a sample w according to the PageRank distribution 2.Toss a coin with heads probability 3.If coin is heads, output w as a sample 4.If coin is tails, goto step 1 Analysis:   Need C/|W| iterations until getting a single sample

12 12 Uniform Sampling via PageRank Sampling: Reality How to estimate PR(w)?  Use the random walk itself: VR(w) = visit ratio of w (# of times w was visited by the walk divided by length of the walk) Approximation is very crude  Use the subgraph spanned by nodes visited to compute PageRank Bias towards the neighborhood of the initial page  Use Google

13 13 Uniform Sampling by RW on Regular Graphs [Bar-Yossef et al 2000] Fact: A random walk on an undirected, connected, non-bipartite, and regular graph converges to a uniform distribution. Proof:  P: random walk’s probability transition matrix P is stochastic 1 is a right eigenvector with e.v. 1: P1 = 1  Graph is connected  RW is irreducible  Graph is non-bipartite  RW is aperiodic  Hence, RW is ergodic, and thus has a stationary distribution  :  is a left eigenvector of P with e.v. 1:  P = 

14 14 Random Walks on Regular Graphs Proof (cont.):  d: graph’s degree  A: graph’s adjacency matrix Symmetric, because graph is undirected  P = (1/d) A Hence, also P is symmetric Its left eigenvectors and right eigenvectors are the same  = (1/n) 1

15 15 Web as a Regular Graph Problems  Web is not connected  Web is directed  Web is non-regular Solutions  Focus on indexable web, which is connected  Ignore directions of links  Add a weighted self loop to each node weight(w) = deg max – deg(w) All pages then have degree deg max Overestimate on deg max doesn’t hurt

16 16 Mixing Time Analysis Theorem Mixing time of a random walk is log(|W|) / (1 - 2 )  1 - 2 : spectral gap of P Experiment (over a large web crawl):  1 – 2 ~ 1/100,000  log(|W|) ~ 34 Hence: mixing time ~ 3.4 million steps  Self loop steps are free  About 1 in 30,000 steps is not a self loop step (deg max ~ 300,000, deg avg ~ 10)  Actual mixing time: ~ 115 steps!

17 17 Random Walks on Regular Graphs: Reality How to get incoming links?  Search engines Potential bias towards search engine index Do not provide full list of in-links? Costly communication  Random walk’s history Important for avoiding dead ends Requires storage How to estimate deg(w)? Solution: run random walk on the sub-graph of W spanned by the available links  Sub-graph may no longer have the good mixing time properties

18 18 Top 20 Internet Domains (Summer 2003)

19 19 Search Engine Coverage (Summer 2000)

20 20 Random Sampling from a Search Engine’s Index

21 21 Search Engine Samplers Index Public Interface Public Interface Search Engine Sampler Web D Queries Top k results Random document x  D Indexed Documents

22 22 Motivation Useful tool for search engine evaluation:  Freshness Fraction of up-to-date pages in the index  Topical bias Identification of overrepresented/underrepresented topics  Spam Fraction of spam pages in the index  Security Fraction of pages in index infected by viruses/worms/trojans  Relative Size Number of documents indexed compared with other search engines

23 23 Size Wars August 2005 : We index 20 billion documents. So, who’s right? September 2005 : We index 8 billion documents, but our index is 3 times larger than our competition’s.

24 24 The Bharat-Broder Sampler: Preprocessing Step C Large corpus L t 1, freq(t 1,C) t 2, freq(t 2,C) … … Lexicon

25 25 The Bharat-Broder Sampler Search Engine BB Sampler t 1 AND t 2 Top k results Random document from top k results L Two random terms t 1, t 2 Only if: all queries return the same number of results ≤ k all documents are of the same length Then, samples are uniform. Only if: all queries return the same number of results ≤ k all documents are of the same length Then, samples are uniform.

26 26 The Bharat-Broder Sampler: Drawbacks Documents have varying lengths  Bias towards long documents Some queries have more than k matches  Bias towards documents with high static rank

27 27 Search Engines as Hypergraphs results(q) = { documents returned on query q } queries(x) = { queries that return x as a result } P = query pool = a set of queries Query pool hypergraph:  Vertices:Indexed documents  Hyperedges:{ result(q) | q  P } “news” “bbc” “google” “maps”

28 28 Query Cardinalities and Document Degrees Query cardinality: card(q) = |results(q)| Document degree: deg(x) = |queries(x)| Examples:  card(“news”) = 4, card(“bbc”) = 3  deg( = 1, deg( = 2 “news” “bbc” “google” “maps”

29 29 Pool-Based Sampler [Bar-Yossef,Gurevich] Preprocessing Step C Large corpus P q1q1 … … Query Pool Example: P = all 3-word phrases that occur in C  If “to be or not to be” occurs in C, P contains: “to be or”, “be or not”, “or not to”, “not to be” Choose P that “covers” most documents in D q2q2

30 30 Monte Carlo Simulation Don’t know how to generate uniform samples from D directly How to use biased samples to generate uniform samples? Samples with weights that represent their bias can be used to simulate uniform samples Monte Carlo Simulation Methods Rejection Sampling Importance Sampling Metropolis- Hastings Maximum- Degree

31 31 Document Degree Distribution Can generate biased samples from the “document degree distribution” Advantage: Can compute weights representing the bias of p:

32 32 Unnormalized forms of distributions p: a distribution on a domain D Unnormalized form of p: : (unknown) normalization constant Examples:  p = uniform:  p = degree distribution:

33 33 Monte Carlo Simulation  : Target distribution  In our case:  =uniform on D p: Trial distribution  In our case: p = document degree distribution Bias weight of p(x) relative to  (x):  In our case: Monte Carlo Simulator Samples from p Sample from  x  Sampler (x 1,w(x)), (x 2,w(x)), … p-Sampler

34 34 C: envelope constant  C ≥ w(x) for all x The algorithm:  accept := false  while (not accept) generate a sample x from p toss a coin whose heads probability is if coin comes up heads, accept := true  return x In our case: C = 1 and acceptance prob = 1/deg(x) Rejection Sampling [von Neumann]

35 35 Pool-Based Sampler Degree distribution sampler Search Engine Rejection Sampling q 1,q 2,… results(q 1 ), results(q 2 ),… x Pool-Based Sampler (x 1,1/deg(x 1 )), (x 2,1/deg(x 2 )),… Uniform sample Documents sampled from degree distribution with corresponding weights Degree distribution: p(x) = deg(x) /  x’ deg(x’)

36 36 Sampling documents by degree Select a random q  P Select a random x  results(q) Documents with high degree are more likely to be sampled If we sample q uniformly  “oversample” documents that belong to narrow queries We need to sample q proportionally to its cardinality “news” “bbc” “google” “maps”

37 37 Query Cardinality Distribution Unrealistic assumptions: Can sample queries from the cardinality distribution  In practice, don’t know a priori card(q) for all q  P  q  P, card(q) ≤ k  In practice, some queries overflow (vol(q) > k) results(q) = results returned on query q card(q) = |results(q)| Cardinality distribution:

38 38 Degree Distribution Sampler Search Engine results(q) x Cardinality Distribution Sampler Sample x uniformly from results(q) q Analysis : Degree Distribution Sampler Query sampled from cardinality distribution Document sampled from degree distribution

39 39 Sampling queries by cardinality Sampling queries from pool uniformly:Easy Sampling queries from pool by cardinality: Hard  Requires knowing cardinalities of all queries in the search engine Use Monte Carlo methods to simulate biased sampling via uniform sampling:  Target distribution: the cardinality distribution  Trial distribution: uniform distribution on P

40 40 Sampling queries by cardinality Bias weight of cardinality distribution relative to the uniform distribution:  Can be computed using a single search engine query Use rejection sampling:  Envelope constant for rejection sampling:  Queries are sampled uniformly from P  Each query q is accepted with probability

41 41 Degree Distribution Sampler Complete Pool-Based Sampler Search Engine Rejection Sampling x (x,1/deg(x)),… Uniform document sample Documents sampled from degree distribution with corresponding weights Uniform Query Sampler Rejection Sampling (q,card(q)),… Uniform query sample Query sampled from cardinality distribution (q,results(q)),…

42 42 Dealing with Overflowing Queries Problem: Some queries may overflow (card(q) > k)  Bias towards highly ranked documents Solutions:  Select a pool P in which overflowing queries are rare (e.g., phrase queries)  Skip overflowing queries  Adapt rejection sampling to deal with approximate weights Theorem: Samples of PB sampler are at most  -away from uniform. (  = overflow probability of P)

43 43 Relative Sizes of Google, MSN and Yahoo! Google = 1 Yahoo! = 1.28 MSN Search = 0.73

44 44 End of Lecture 10

