1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef and Maxim Gurevich Department of Electrical Engineering Technion Presentation at group meeting,

Slides:



Advertisements
Similar presentations
1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.
Advertisements

1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Department of Electrical Engineering, Technion Maxim Gurevich Department of Electrical Engineering,
Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Fast Algorithms For Hierarchical Range Histogram Constructions
Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.
Maximizing the Spread of Influence through a Social Network
Ch 11. Sampling Models Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by I.-H. Lee Biointelligence Laboratory, Seoul National.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
1 A Random-Surfer Web-Graph Model (Joint work with Avrim Blum & Hubert Chan) Mugizi Rwebangira.
Introduction to Sampling based inference and MCMC Ata Kaban School of Computer Science The University of Birmingham.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
BAYESIAN INFERENCE Sampling techniques
Evaluating Search Engine
1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 11, 2006
Automatic Evaluation Of Search Engines Project Poster Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Computer Science Approximately Uniform Random Sampling in Sensor Networks Boulat A. Bash, John W. Byers and Jeffrey Considine.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Automatic Evaluation Of Search Engines Project Presentation Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 10 June 4, 2006
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY- HARSH SINGH A Random Walk Approach to Sampling Hidden Databases By Arjun Dasgupta, Dr. Gautam Das and Heikki Mannila.
Introduction to Monte Carlo Methods D.J.C. Mackay.
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
Random Sampling, Point Estimation and Maximum Likelihood.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Module 1: Statistical Issues in Micro simulation Paul Sousa.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Monte Carlo Methods1 T Special Course In Information Science II Tomas Ukkonen
Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Ch. 14: Markov Chain Monte Carlo Methods based on Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009.; C, Andrieu, N, de Freitas,
An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.
The Markov Chain Monte Carlo Method Isabelle Stanton May 8, 2008 Theory Lunch.
CS155b: E-Commerce Lecture 16: April 10, 2001 WWW Searching and Google.
1 CHAPTER 6 Sampling Distributions Homework: 1abcd,3acd,9,15,19,25,29,33,43 Sec 6.0: Introduction Parameter The "unknown" numerical values that is used.
Advisor: Koh Jia-Ling Nonhlanhla Shongwe EFFICIENT QUERY EXPANSION FOR ADVERTISEMENT SEARCH WANG.H, LIANG.Y, FU.L, XUE.G, YU.Y SIGIR’09.
ICS 353: Design and Analysis of Algorithms
Introduction to Sampling Methods Qi Zhao Oct.27,2004.
Markov-Chain-Monte-Carlo (MCMC) & The Metropolis-Hastings Algorithm P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/19/2016:
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
How many iterations in the Gibbs sampler? Adrian E. Raftery and Steven Lewis (September, 1991) Duke University Machine Learning Group Presented by Iulian.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
The Monte Carlo Method/ Markov Chains/ Metropolitan Algorithm from sec in “Adaptive Cooperative Systems” -summarized by Jinsan Yang.
Markov Chain Monte Carlo in R
Introduction to Sampling based inference and MCMC
MCMC Output & Metropolis-Hastings Algorithm Part I
FORA: Simple and Effective Approximate Single­-Source Personalized PageRank Sibo Wang, Renchi Yang, Xiaokui Xiao, Zhewei Wei, Yin Yang School of Information.
Uniform Sampling from the Web via Random Walks
Haim Kaplan and Uri Zwick
Stability Analysis of MNCM Class of Algorithms and two more problems !
Markov Networks.
Presentation transcript:

1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef and Maxim Gurevich Department of Electrical Engineering Technion Presentation at group meeting, Oct., 24 Allen, Zhenjiang Lin

2 Outline Introduction  Search Engine Samplers  Motivation The Bharat-Broder Sampler (WWW’98) Infrastructure of Proposed Methods  Search Engines as Hypergraphs  Monte Carlo Simulation Methods – Rejection Sampling The Pool-based Sampler The Random Walk Sampler Experimental Results Conclusions

3 Search Engine Samplers Index Public Interface Public Interface Search Engine Sampler Web D Queries Top k results Random document x  D Indexed Documents

4 Motivation Useful tool for search engine evaluation:  Freshness Fraction of up-to-date pages in the index  Topical bias Identification of overrepresented/underrepresented topics  Spam Fraction of spam pages in the index  Security Fraction of pages in index infected by viruses/worms/trojans  Relative Size Number of documents indexed compared with other search engines

5 Size Wars August 2005 : We index 20 billion documents. So, who’s right? September 2005 : We index 8 billion documents, but our index is 3 times larger than our competition’s.

6 Why Does Size Matter, Anyway? Comprehensiveness  A good crawler covers the most documents possible Narrow-topic queries  E.g., get homepage of John Doe Prestige  A marketing advantage

7 Measuring size using random samples [BharatBroder98, CheneyPerry05, GulliSignorni05] Sample pages uniformly at random from the search engine’s index Two alternatives  Absolute size estimation Sample until collision Collision expected after k ~ N ½ random samples (birthday paradox) Return k 2  Relative size estimation Check how many samples from search engine A are present in search engine B and vice versa

8 Related Work Random Sampling from a Search Engine’s Index [BharatBroder98, CheneyPerry05, GulliSignorni05] Anecdotal queries [SearchEngineWatch, Google, BradlowSchmittlein00] Queries from user query logs [LawrenceGiles98, DobraFeinberg04] Random sampling from the whole web [Henzinger et al 00, Bar-Yossef et al 00, Rusmevichientong et al 01]

9 The Bharat-Broder Sampler: Preprocessing Step C Large corpus L t 1, freq(t 1,C) t 2, freq(t 2,C) … … Lexicon

10 The Bharat-Broder Sampler Search Engine BB Sampler t 1 AND t 2 Top k results Random document from top k results L Two random terms t 1, t 2 Only if: all queries return the same number of results ≤ k all documents are of the same length Then, samples are uniform. Only if: all queries return the same number of results ≤ k all documents are of the same length Then, samples are uniform.

11 The Bharat-Broder Sampler: Drawbacks Documents have varying lengths  Bias towards long documents Some queries have more than k matches  Bias towards documents with high static rank

12 Two novel samplers A pool-based sampler  Guaranteed to produce near-uniform samples  Needs an lexicon / query pool A random walk sampler  After sufficiently many steps, guaranteed to produce near-uniform samples  Does not need an explicit lexicon / pool at all! Focus of this talk

13 Search Engines as Hypergraphs results(q) = { documents returned on query q } queries(x) = { queries that return x as a result } P = query pool = a set of queries Query pool hypergraph:  Vertices:Indexed documents  Hyperedges:{ result(q) | q  P } news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

14 Query Cardinalities and Document Degrees Query cardinality: card(q) = |results(q)| Document degree: deg(x) = |queries(x)| Examples:  card(“news”) = 4, card(“bbc”) = 3  deg( = 1, deg(news.bbc.co.uk) = news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

15 Sampling documents uniformly Sampling documents from D uniformlyHard Sampling documents from D non-uniformly: Easier Will show later: can sample documents proportionally to their degrees:

16 Sampling documents by degree p(news.bbc.co.uk) = 2/13 p( = 1/ news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

17 Monte Carlo Simulation We need:Samples from the uniform distribution We have:Samples from the degree distribution Can we somehow use the samples from the degree distribution to generate samples from the uniform distribution? Yes! Monte Carlo Simulation Methods Rejection Sampling Importance Sampling Metropolis- Hastings Maximum- Degree

18 Rejection Sampling Algorithm Sampling values from an arbitrary probability distribution f(x) by using an instrumental distribution g(x) The algorithm (due to John von Neumann) is as follows:John von Neumann Sample x from g(x) and u from U(0,1) Check whether or not u < f(x) / Mg(x).  If this holds, accept x as a realization of f(x);  if not, reject the value of x and repeat the sampling step. M > 1 is an appropriate bound on f(x) / g(x). Prove: p RS (x) = g(x). f(x) / Mg(x) = f(x) / M. f(x) / Mg(x) ≤ 1 M ≥ f(x) / g(x), ∨ x ∈ D.

19 Rejection Sampling: An Example Sampling u.a.r from Square: g(x) Easy Sampling u.a.r from Disc: f(x) Hard Since f(x)=F, g(x)=G, set M = F/G; Generate a candidate point x from unit square, g(x); If x is in unit disc, f(x) = F≠ 0, thus f(x)/Mg(x)=1, accept x; If x is in square/disc, f(x) = 0, thus f(x)/Mg(x)=0, reject x; Therefore, x is sampled u.a.r from the unit disc.

20 Monte Carlo Simulation  : Target distribution  In our case:  = uniform on D p: Trial distribution  In our case: p = degree distribution Bias weight of p(x) relative to  (x):  In our case: Monte Carlo Simulator Samples from p Sample from  x  Sampler (x 1,w(x)), (x 2,w(x)), … p-Sampler

21 Bias Weights Unnormalized forms of  and p: : (unknown) normalization constants Examples:   = uniform:  p = degree distribution: Bias weight:

22 C: envelope constant  C ≥ w(x) for all x The algorithm:  accept := false  while (not accept) generate a sample x from p toss a coin whose heads probability is if coin comes up heads, accept := true  return x In our case: C = 1 and acceptance prob = 1/deg(x) Rejection Sampling [von Neumann]

23 Pool-Based Sampler Degree distribution sampler Search Engine Rejection Sampling q 1,q 2,… results(q 1 ), results(q 2 ),… x Pool-Based Sampler (x 1,1/deg(x 1 )), (x 2,1/deg(x 2 )),… Uniform sample Documents sampled from degree distribution with corresponding weights Degree distribution: p(x) = deg(x) /  x’ deg(x’)

24 Sampling documents by degree Select a random query q Select a random x  results(q) Documents with high degree are more likely to be sampled If we sample q uniformly  “oversample” documents that belong to narrow queries-the weights of queries are different. We need to sample q proportionally to its cardinality news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

25 Sampling documents by degree (2) Select a query q proportionally to its cardinality Select a random x  results(q) Analysis: news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

26 Degree Distribution Sampler Search Engine results(q) x Cardinality Distribution Sampler Sample x uniformly from results(q) q Degree Distribution Sampler Query sampled from cardinality distribution Document sampled from degree distribution

27 Sampling queries by cardinality Sampling queries from pool uniformly:Easy Sampling queries from pool by cardinality: Hard  Requires knowing cardinalities of all queries in the search engine Use Monte Carlo methods to simulate biased sampling via uniform sampling:  Target distribution: the cardinality distribution  Trial distribution: uniform distribution on the query pool

28 Sampling queries by cardinality Bias weight of cardinality distribution relative to the uniform distribution:  Can be computed using a single search engine query Use rejection sampling:  Envelope constant for rejection sampling:  Queries are sampled uniformly from the pool  Each query q is accepted with probability

29 Degree Distribution Sampler Complete Pool-Based Sampler Search Engine Rejection Sampling x (x,1/deg(x)),… Uniform document sample Documents sampled from degree distribution with corresponding weights Uniform Query Sampler Rejection Sampling (q,card(q)),… Uniform query sample Query sampled from cardinality distribution (q,results(q)),…

30 Dealing with Overflowing Queries Problem: Some queries may overflow (card(q) > k)  Bias towards highly ranked documents Solutions:  Select a pool P in which overflowing queries are rare (e.g., phrase queries)  Skip overflowing queries  Adapt rejection sampling to deal with approximate weights Theorem: Samples of PB sampler are at most  -away from uniform. (  = overflow probability of P)

31 Creating the query pool C Large corpus P q1q1 … … Query Pool Example: P = all 3-word phrases that occur in C  If “to be or not to be” occurs in C, P contains: “to be or”, “be or not”, “or not to”, “not to be” Choose P that “covers” most documents in D q2q2

32 A random walk sampler Define a graph G over the indexed documents  (x,y)  E iff queries(x) ∩ queries(y) ≠   Run a random walk on G  Limit distribution = degree distribution  Use MCMC methods to make limit distribution uniform. Metropolis-Hastings Maximum-Degree Does not need a preprocessing step Less efficient than the pool-based sampler

33 Bias towards Long Documents

34 Relative Sizes of Google, MSN and Yahoo! Google = 1 Yahoo! = 1.28 MSN Search = 0.73

35 Top-Level Domains in Google, MSN and Yahoo!

36 Conclusions Two new search engine samplers  Pool-based sampler  Random walk sampler Samplers are guaranteed to produce near- uniform samples, under plausible assumptions. Samplers show no or little bias in experiments.

37 Thank You