1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 10 June 4, 2006

Slides:



Advertisements
Similar presentations
1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.
Advertisements

1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Department of Electrical Engineering, Technion Maxim Gurevich Department of Electrical Engineering,
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
6.896: Probability and Computation Spring 2011 Constantinos (Costis) Daskalakis lecture 2.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 11, 2006
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Link Analysis, PageRank and Search Engines on the Web
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Automatic Evaluation Of Search Engines Project Presentation Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim.
Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute.
1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef and Maxim Gurevich Department of Electrical Engineering Technion Presentation at group meeting,
Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Overview of Web Ranking Algorithms: HITS and PageRank
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
CompSci 100E 3.1 Random Walks “A drunk man wil l find his way home, but a drunk bird may get lost forever”  – Shizuo Kakutani Suppose you proceed randomly.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Seminar on random walks on graphs Lecture No. 2 Mille Gandelsman,
The Markov Chain Monte Carlo Method Isabelle Stanton May 8, 2008 Theory Lunch.
CompSci 100E 4.1 Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
Theory of Computational Complexity Probability and Computing Lee Minseon Iwama and Ito lab M1 1.
Random Sampling Algorithms with Applications Kyomin Jung KAIST Aug ERC Workshop.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
The Monte Carlo Method/ Markov Chains/ Metropolitan Algorithm from sec in “Adaptive Cooperative Systems” -summarized by Jinsan Yang.
The PageRank Citation Ranking: Bringing Order to the Web
15-499:Algorithms and Applications
Search Engines and Link Analysis on the Web
Link-Based Ranking Seminar Social Media Mining University UC3M
DTMC Applications Ranking Web Pages & Slotted ALOHA
Uniform Sampling from the Web via Random Walks
Lecture 22 SVD, Eigenvector, and Web Search
Haim Kaplan and Uri Zwick
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 10 June 4,

2 Random Sampling of Web Pages

3 Outline Problem definition Random sampling of web pages according to their PageRank Uniform sampling of web pages (Henzinger et al) Uniform sampling of web pages (Bar- Yossef et al)

4 Random Sampling of Web Pages W = a snapshot of the “indexable web”  Consider only “static” HTML web pages  = a probability distribution over W Goal: Design an efficient algorithm for generating samples from W distributed according to . Our focus:   = PageRank   = Uniform Indexable web

5 Random Sampling of Web Pages: Motivation Compute statistics about the web  Ex: What fraction of web pages belong to.il?  Ex: What fraction of web pages are written in Chinese?  Ex: What fraction of hyperlinks are advertisements? Compare coverage of search engines  Ex: Is Google larger than MSN?  Ex: What is the overlap between Google and Yahoo? Data mining of the web  Ex: How frequently computer science pages cite biology pages?  Ex: How are pages distributed by topic?

6 Random Sampling of Web Pages: Challenges Naïve solution: crawl, index, sample  Crawls cannot get complete coverage  Web is constantly changing  Crawling is slow and expensive Our goals:  Accuracy: generate samples from a snapshot of the entire indexable web  Speed: samples should be generated quickly  Low cost: sampling procedure should run on a desktop PC

7 A Random Walk Approach Design a random walk on W whose stationary distribution is   P = Random walk’s probability transition matrix   P =  Run random walk for sufficiently many steps  Recall: For any initial distribution q,  Mixing time: # of steps required to get close to the limit Use reached node as a sample Repeat for as many samples as needed

8 A Random Walk Approach: Advantages & Issues Advantages:  Accuracy: random walk can potentially visit every page on the web  Speed: no need to scan the whole web  Low cost: no need for large storage or multiple processors Issues:  How to design the random walk so it converges to  ?  How to analyze the mixing time of the random walk?

9 PageRank Sampling [Henzinger et al 1999] Use the “random surfer” random walk:  Start at some initial node v 0  When visiting a page v Toss a coin with heads probability  If coin is heads, go to a uniformly chosen page If coin is tails, go to a random out-neighbor of v Limit distribution: PageRank Mixing time: fast (will see later)

10 PageRank Sampling: Reality Problem: how to pick a page uniformly at random? Solutions:  Jump to a random page from the history of the walk Creates bias towards dense web-sites  Pick a random host from the hosts in the walk’s history and jump to a random page from the pages visited on that host Not converging to PageRank anymore Experiments indicate it is still fine

11 Uniform Sampling via PageRank Sampling [Henzinger et al 2000] Sampling algorithm: 1.Use previous random walk to generate a sample w according to the PageRank distribution 2.Toss a coin with heads probability 3.If coin is heads, output w as a sample 4.If coin is tails, goto step 1 Analysis:   Need C/|W| iterations until getting a single sample

12 Uniform Sampling via PageRank Sampling: Reality How to estimate PR(w)?  Use the random walk itself: VR(w) = visit ratio of w (# of times w was visited by the walk divided by length of the walk) Approximation is very crude  Use the subgraph spanned by nodes visited to compute PageRank Bias towards the neighborhood of the initial page  Use Google

13 Uniform Sampling by RW on Regular Graphs [Bar-Yossef et al 2000] Fact: A random walk on an undirected, connected, non-bipartite, and regular graph converges to a uniform distribution. Proof:  P: random walk’s probability transition matrix P is stochastic 1 is a right eigenvector with e.v. 1: P1 = 1  Graph is connected  RW is irreducible  Graph is non-bipartite  RW is aperiodic  Hence, RW is ergodic, and thus has a stationary distribution  :  is a left eigenvector of P with e.v. 1:  P = 

14 Random Walks on Regular Graphs Proof (cont.):  d: graph’s degree  A: graph’s adjacency matrix Symmetric, because graph is undirected  P = (1/d) A Hence, also P is symmetric Its left eigenvectors and right eigenvectors are the same  = (1/n) 1

15 Web as a Regular Graph Problems  Web is not connected  Web is directed  Web is non-regular Solutions  Focus on indexable web, which is connected  Ignore directions of links  Add a weighted self loop to each node weight(w) = deg max – deg(w) All pages then have degree deg max Overestimate on deg max doesn’t hurt

16 Mixing Time Analysis Theorem Mixing time of a random walk is log(|W|) / (1 - 2 )  : spectral gap of P Experiment (over a large web crawl):  1 – 2 ~ 1/100,000  log(|W|) ~ 34 Hence: mixing time ~ 3.4 million steps  Self loop steps are free  About 1 in 30,000 steps is not a self loop step (deg max ~ 300,000, deg avg ~ 10)  Actual mixing time: ~ 115 steps!

17 Random Walks on Regular Graphs: Reality How to get incoming links?  Search engines Potential bias towards search engine index Do not provide full list of in-links? Costly communication  Random walk’s history Important for avoiding dead ends Requires storage How to estimate deg(w)? Solution: run random walk on the sub-graph of W spanned by the available links  Sub-graph may no longer have the good mixing time properties

18 Top 20 Internet Domains (Summer 2003)

19 Search Engine Coverage (Summer 2000)

20 Random Sampling from a Search Engine’s Index

21 Search Engine Samplers Index Public Interface Public Interface Search Engine Sampler Web D Queries Top k results Random document x  D Indexed Documents

22 Motivation Useful tool for search engine evaluation:  Freshness Fraction of up-to-date pages in the index  Topical bias Identification of overrepresented/underrepresented topics  Spam Fraction of spam pages in the index  Security Fraction of pages in index infected by viruses/worms/trojans  Relative Size Number of documents indexed compared with other search engines

23 Size Wars August 2005 : We index 20 billion documents. So, who’s right? September 2005 : We index 8 billion documents, but our index is 3 times larger than our competition’s.

24 The Bharat-Broder Sampler: Preprocessing Step C Large corpus L t 1, freq(t 1,C) t 2, freq(t 2,C) … … Lexicon

25 The Bharat-Broder Sampler Search Engine BB Sampler t 1 AND t 2 Top k results Random document from top k results L Two random terms t 1, t 2 Only if: all queries return the same number of results ≤ k all documents are of the same length Then, samples are uniform. Only if: all queries return the same number of results ≤ k all documents are of the same length Then, samples are uniform.

26 The Bharat-Broder Sampler: Drawbacks Documents have varying lengths  Bias towards long documents Some queries have more than k matches  Bias towards documents with high static rank

27 Search Engines as Hypergraphs results(q) = { documents returned on query q } queries(x) = { queries that return x as a result } P = query pool = a set of queries Query pool hypergraph:  Vertices:Indexed documents  Hyperedges:{ result(q) | q  P } news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

28 Query Cardinalities and Document Degrees Query cardinality: card(q) = |results(q)| Document degree: deg(x) = |queries(x)| Examples:  card(“news”) = 4, card(“bbc”) = 3  deg( = 1, deg(news.bbc.co.uk) = news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

29 Pool-Based Sampler [Bar-Yossef,Gurevich] Preprocessing Step C Large corpus P q1q1 … … Query Pool Example: P = all 3-word phrases that occur in C  If “to be or not to be” occurs in C, P contains: “to be or”, “be or not”, “or not to”, “not to be” Choose P that “covers” most documents in D q2q2

30 Monte Carlo Simulation Don’t know how to generate uniform samples from D directly How to use biased samples to generate uniform samples? Samples with weights that represent their bias can be used to simulate uniform samples Monte Carlo Simulation Methods Rejection Sampling Importance Sampling Metropolis- Hastings Maximum- Degree

31 Document Degree Distribution Can generate biased samples from the “document degree distribution” Advantage: Can compute weights representing the bias of p:

32 Unnormalized forms of distributions p: a distribution on a domain D Unnormalized form of p: : (unknown) normalization constant Examples:  p = uniform:  p = degree distribution:

33 Monte Carlo Simulation  : Target distribution  In our case:  =uniform on D p: Trial distribution  In our case: p = document degree distribution Bias weight of p(x) relative to  (x):  In our case: Monte Carlo Simulator Samples from p Sample from  x  Sampler (x 1,w(x)), (x 2,w(x)), … p-Sampler

34 C: envelope constant  C ≥ w(x) for all x The algorithm:  accept := false  while (not accept) generate a sample x from p toss a coin whose heads probability is if coin comes up heads, accept := true  return x In our case: C = 1 and acceptance prob = 1/deg(x) Rejection Sampling [von Neumann]

35 Pool-Based Sampler Degree distribution sampler Search Engine Rejection Sampling q 1,q 2,… results(q 1 ), results(q 2 ),… x Pool-Based Sampler (x 1,1/deg(x 1 )), (x 2,1/deg(x 2 )),… Uniform sample Documents sampled from degree distribution with corresponding weights Degree distribution: p(x) = deg(x) /  x’ deg(x’)

36 Sampling documents by degree Select a random q  P Select a random x  results(q) Documents with high degree are more likely to be sampled If we sample q uniformly  “oversample” documents that belong to narrow queries We need to sample q proportionally to its cardinality news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

37 Query Cardinality Distribution Unrealistic assumptions: Can sample queries from the cardinality distribution  In practice, don’t know a priori card(q) for all q  P  q  P, card(q) ≤ k  In practice, some queries overflow (vol(q) > k) results(q) = results returned on query q card(q) = |results(q)| Cardinality distribution:

38 Degree Distribution Sampler Search Engine results(q) x Cardinality Distribution Sampler Sample x uniformly from results(q) q Analysis : Degree Distribution Sampler Query sampled from cardinality distribution Document sampled from degree distribution

39 Sampling queries by cardinality Sampling queries from pool uniformly:Easy Sampling queries from pool by cardinality: Hard  Requires knowing cardinalities of all queries in the search engine Use Monte Carlo methods to simulate biased sampling via uniform sampling:  Target distribution: the cardinality distribution  Trial distribution: uniform distribution on P

40 Sampling queries by cardinality Bias weight of cardinality distribution relative to the uniform distribution:  Can be computed using a single search engine query Use rejection sampling:  Envelope constant for rejection sampling:  Queries are sampled uniformly from P  Each query q is accepted with probability

41 Degree Distribution Sampler Complete Pool-Based Sampler Search Engine Rejection Sampling x (x,1/deg(x)),… Uniform document sample Documents sampled from degree distribution with corresponding weights Uniform Query Sampler Rejection Sampling (q,card(q)),… Uniform query sample Query sampled from cardinality distribution (q,results(q)),…

42 Dealing with Overflowing Queries Problem: Some queries may overflow (card(q) > k)  Bias towards highly ranked documents Solutions:  Select a pool P in which overflowing queries are rare (e.g., phrase queries)  Skip overflowing queries  Adapt rejection sampling to deal with approximate weights Theorem: Samples of PB sampler are at most  -away from uniform. (  = overflow probability of P)

43 Relative Sizes of Google, MSN and Yahoo! Google = 1 Yahoo! = 1.28 MSN Search = 0.73

44 End of Lecture 10