1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Department of Electrical Engineering, Technion Maxim Gurevich Department of Electrical Engineering,

Slides:

Advertisements

Similar presentations

Towards Data Mining Without Information on Knowledge Structure

Advertisements

TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST

Bellwork If you roll a die, what is the probability that you roll a 2 or an odd number? P(2 or odd) 2. Is this an example of mutually exclusive, overlapping,

Lower Bounds for Local Search by Quantum Arguments Scott Aaronson.

STATISTICS HYPOTHESES TEST (II) One-sample tests on the mean and variance Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National.

Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.

…to Ontology Repositories Mathieu dAquin Knowledge Media Institute, The Open University From…

1 Use of Electronic Resources in Research Prof. Dr. Khalid Mahmood Department of Library & Information Science University of the Punjab.

1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett Slides:

1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.

Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

ALGEBRAIC EXPRESSIONS

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

MULTIPLYING MONOMIALS TIMES POLYNOMIALS (DISTRIBUTIVE PROPERTY)

ADDING INTEGERS 1. POS. + POS. = POS. 2. NEG. + NEG. = NEG. 3. POS. + NEG. OR NEG. + POS. SUBTRACT TAKE SIGN OF BIGGER ABSOLUTE VALUE.

SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION

MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

Year 6 mental test 10 second questions Numbers and number system Numbers and the number system, fractions, decimals, proportion & probability.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 4.1 Chapter 4 : Searching the Web The mechanics.

STATISTICAL INFERENCE ABOUT MEANS AND PROPORTIONS WITH TWO POPULATIONS

Correctness of Gossip-Based Membership under Message Loss Maxim GurevichIdit Keidar Technion.

Chapter 7 Sampling and Sampling Distributions

Google as a Hacking Tool James Lee Advanced Searching.

Context-Sensitive Query Auto-Completion AUTHORS:NAAMA KRAUS AND ZIV BAR-YOSSEF DATE OF PUBLICATION:NOVEMBER 2010 SPEAKER:RISHU GUPTA 1.

5-1 Chapter 5 Theory & Problems of Probability & Statistics Murray R. Spiegel Sampling Theory.

ACM CIKM 2008, Oct , Napa Valley 1 Mining Term Association Patterns from Search Logs for Effective Query Reformulation Xuanhui Wang and ChengXiang.

© S Haughton more than 3?

Text Categorization.

Polynomial Factor Theorem Polynomial Factor Theorem

Solving Equations How to Solve Them

Linking Verb? Action Verb or. Question 1 Define the term: action verb.

Routing and Congestion Problems in General Networks Presented by Jun Zou CAS 744.

Past Tense Probe. Past Tense Probe Past Tense Probe – Practice 1.

Module 16: One-sample t-tests and Confidence Intervals

Dan Bassett, Jonathan Canfield December 13, 2011.

Addition 1’s to 20.

25 seconds left…...

Test B, 100 Subtraction Facts

‘You Be George’ Activity. ProblemLearning TargetRight?Wrong?Simple mistake? More study? 1 Place Value: I can write numerals in expanded form to 10 thousands.

We will resume in: 25 Minutes.

1 Random Sampling - Random Samples. 2 Why do we need Random Samples? Many business applications -We will have a random variable X such that the probability.

12 January 2009SDS batch generation, distribution and web interface 1 ExESS IT tool for SDS batch generation, distribution and web interface ExESS IT tool.

INFORMATION SOLUTIONS Citation Analysis Reports. Copyright 2005 Thomson Scientific 2 INFORMATION SOLUTIONS Provide highly customized datasets based on.

Application of Ensemble Models in Web Ranking

Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.

1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 11, 2006

Automatic Evaluation Of Search Engines Project Poster Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim.

Automatic Evaluation Of Search Engines Project Presentation Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 10 June 4, 2006

The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.

1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.

1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef and Maxim Gurevich Department of Electrical Engineering Technion Presentation at group meeting,

Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.

Introduction to Monte Carlo Methods D.J.C. Mackay.

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Uniform Sampling from the Web via Random Walks

Agreeing to Disagree: Search Engines and Their Public Interfaces

Sic Transit Gloria Telae: Towards an Understanding of the Web's Decay

Presentation transcript:

1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Department of Electrical Engineering, Technion Maxim Gurevich Department of Electrical Engineering, Technion IBM Haifa Research Lab

2 Search Engine Samplers Index Public Interface Public Interface Search Engine Sampler Web D Queries Top k results Random document x D Indexed Documents

3 Motivation Useful tool for search engine evaluation: Freshness Fraction of up-to-date pages in the index Topical bias Identification of overrepresented/underrepresented topics Spam Fraction of spam pages in the index Security Fraction of pages in index infected by viruses/worms/trojans Relative Size Number of documents indexed compared with other search engines

4 Size Wars August 2005 : We index 20 billion documents. So, whos right? September 2005 : We index 8 billion documents, but our index is 3 times larger than our competitions.

5 Related Work Random Sampling from a Search Engines Index [BharatBroder98, CheneyPerry05, GulliSignorni05] Anecdotal queries [SearchEngineWatch, Google, BradlowSchmittlein00] Queries from user query logs [LawrenceGiles98, DobraFeinberg04] Random sampling from the whole web [Henzinger et al 00, Bar-Yossef et al 00, Rusmevichientong et al 01]

6 Our Contributions A pool-based sampler Guaranteed to produce near-uniform samples A random walk sampler After sufficiently many steps, guaranteed to produce near-uniform samples Does not need an explicit lexicon/pool at all! Focus of this talk

7 Search Engines as Hypergraphs results(q) = { documents returned on query q } queries(x) = { queries that return x as a result } P = query pool = a set of queries Query pool hypergraph: Vertices:Indexed documents Hyperedges:{ result(q) | q P } news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com news bbc google maps en.wikipedia.org/wiki/BBC

8 Query Cardinalities and Document Degrees Query cardinality: card(q) = |results(q)| Document degree: deg(x) = |queries(x)| Examples: card(news) = 4, card(bbc) = 3 deg( = 1, deg(news.bbc.co.uk) = news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com news bbc google maps en.wikipedia.org/wiki/BBC

9 The Pool-Based Sampler: Preprocessing Step C Large corpus P q1q1 … … Query Pool Example: P = all 3-word phrases that occur in C If to be or not to be occurs in C, P contains: to be or, be or not, or not to, not to be Choose P that covers most documents in D q2q2

10 Monte Carlo Simulation We dont know how to generate uniform samples from D directly How can we use biased samples to generate uniform samples? Samples with weights that represent their bias can be used to simulate uniform samples Monte Carlo Simulation Methods Rejection Sampling Importance Sampling Metropolis- Hastings Maximum- Degree

11 Document Degree Distribution We are able to generate biased samples from the document degree distribution Advantage: Can compute weights representing the bias of p:

12 Rejection Sampling [von Neumann] accept := false while (not accept) generate a sample x from p toss a coin whose heads probability is w p (x) if coin comes up heads, accept := true return x

13 Pool-Based Sampler Degree distribution sampler Search Engine Rejection Sampling q 1,q 2,… results(q 1 ), results(q 2 ),… x Pool-Based Sampler (x 1,1/deg(x 1 )), (x 2,1/deg(x 2 )),… Uniform sample Documents sampled from degree distribution with corresponding weights Degree distribution: p(x) = deg(x) / x deg(x)

14 Sampling documents by degree Select a random q P Select a random x results(q) Documents with high degree are more likely to be sampled If we sample q uniformly oversample documents that belong to narrow queries We need to sample q proportionally to its cardinality news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com news bbc google maps en.wikipedia.org/wiki/BBC

15 Sampling queries by cardinality Sampling queries from pool uniformly:Easy Sampling queries from pool by cardinality: Hard Requires knowing cardinalities of all queries in the search engine Use Monte Carlo methods to simulate biased sampling via uniform sampling: Sample queries uniformly from P Compute cardinality weight for each sample: Obtain queries sampled by their cardinality

16 Dealing with Overflowing Queries Problem: Some queries may overflow (card(q) > k) Bias towards highly ranked documents Solutions: Select a pool P in which overflowing queries are rare (e.g., phrase queries) Skip overflowing queries Adapt rejection sampling to deal with approximate weights Theorem: Samples of PB sampler are at most -away from uniform. ( = overflow probability of P)

17 Bias towards Long Documents

18 Relative Sizes of Google, MSN and Yahoo! Google = 1 Yahoo! = 1.28 MSN Search = 0.73

19 Conclusions Two new search engine samplers Pool-based sampler Random walk sampler Samplers are guaranteed to produce near- uniform samples, under plausible assumptions. Samplers show no or little bias in experiments.

20 Thank You