CS276 Information Retrieval and Web Search Hugh E. Williams

Slides:



Advertisements
Similar presentations
Psych 5500/6500 t Test for Two Independent Groups: Power Fall, 2008.
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
ECommerce Success Strategies A Lunch and Learn Seminar Presented by Patrick Bieser Sr., President Northwoods Software.
The Scientific Method.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
USABILITY AND EVALUATION Motivations and Methods.
Search Engines and Information Retrieval
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
INFO 624 Week 3 Retrieval System Evaluation
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
Computing Trust in Social Networks
1 Validation and Verification of Simulation Models.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Information Retrieval
Effect Sizes, Power Analysis and Statistical Decisions Effect sizes -- what and why?? review of statistical decisions and statistical decision errors statistical.
Some Notes on the Design and Analysis of Experiments.
Introduction to Science: The Scientific Method
Hugh E. Williams Vice President, Experience, Search, and Challenges in Commerce Search.
1 CS 178H Introduction to Computer Science Research What is CS Research?
Modern Retrieval Evaluations Hongning Wang
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Chapter 1: Introduction to Statistics
Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.
Search Engines and Information Retrieval Chapter 1.
Copyright © 2012 Pearson Education. All rights reserved. Chapter 2 Data.
10.3: Use and Abuse of Tests Carrying out a z-confidence interval calculation is simple. On a TI-83, it’s STAT-->TESTS-->7 Similarly, carrying out a z-signficance.
Tutor: Prof. A. Taleb-Bendiab Contact: Telephone: +44 (0) CMPDLLM002 Research Methods Lecture 8: Quantitative.
Visual User Interfaces David Rashty. “Grasping the whole is a gigantic theme. Arguably, intellectual history’s most important. Ant-vision is humanity’s.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Concept of Power ture=player_detailpage&v=7yeA7a0u S3A.
Ramakrishnan Srikant Sugato Basu Ni Wang Daryl Pregibon 1.
Fan Guo 1, Chao Liu 2 and Yi-Min Wang 2 1 Carnegie Mellon University 2 Microsoft Research Feb 11, 2009.
Scientific Method Chapter 1. What is Science? “The goal of science is to investigate and understand the natural world, to explain events in the natural.
Bug Localization with Machine Learning Techniques Wujie Zheng
Test Loads Andy Wang CIS Computer Systems Performance Analysis.
The Scientific Method Defined: step by step procedure of scientific problem solving (5) Major steps are listed below.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Seven Quality Tools The Seven Tools –Histograms, Pareto Charts, Cause and Effect Diagrams, Run Charts, Scatter Diagrams, Flow Charts, Control Charts.
1 William P. Cunningham University of Minnesota Mary Ann Cunningham Vassar College Copyright © The McGraw-Hill Companies, Inc. Permission required for.
Research Topics CSC Parallel Computing & Compilers CSC 3990.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
(Really) Knowing Your Customer with Big Data Hugh E.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Lesson Overview Lesson Overview What Is Science? Lesson Overview 1.1 What Is Science?
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Adish Singla, Microsoft Bing Ryen W. White, Microsoft Research Jeff Huang, University of Washington.
Marketing Research Approaches. Research Approaches Observational Research Ethnographic Research Survey Research Experimental Research.
Copyright © 2006 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Chapter 12 Confidence Intervals and Hypothesis Tests for Means © 2010 Pearson Education 1.
CS5103 Software Engineering Lecture 02 More on Software Process Models.
1 Running Experiments for Your Term Projects Dana S. Nau CMSC 722, AI Planning University of Maryland Lecture slides for Automated Planning: Theory and.
Week 6. Statistics etc. GRS LX 865 Topics in Linguistics.
CPSC 871 John D. McGregor Module 8 Session 1 Testing.
Modern Retrieval Evaluations Hongning Wang
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Chapter 9 Audit Sampling – Part a.
Chapter 2 Notes Ms. Sager. Science as Inquiry What is Science? – Word derived from Latin – means “to know” – A way of knowing – How to answer questions.
Test Loads Andy Wang CIS Computer Systems Performance Analysis.
CPSC 873 John D. McGregor GQM.
Objective of This Course
Statistical Thinking and Applications
Professor John Canny Fall 2004
DESIGN OF EXPERIMENTS by R. C. Baker
Designing Experimental Investigations
Presentation transcript:

CS276 Information Retrieval and Web Search Hugh E. Williams

Search in Industry A few stories with practical illustrations: – Understanding users with A/B testing at scale – Using data to build systems: clicks, graphs, and their applications – The practical intersection of software and people

Search in Industry A “flywheel” of software, data, and analytics Optimizing the “spinning” of the flywheel Ensuring you have more data to build better products Ensuring you have better analytics to build better products (SOFTWARE) PRODUCTS ANALYTICS DATA

UNDERSTANDING USERS WITH A/B TESTING (SOFTWARE) PRODUCTS ANALYTICS DATA

An A/B test: Larger Images in eBay’s Search

A/B experimentation Divide the users into populations One (or two) population is the control (or A) One or more populations are the tests (or B) Collect (vast amounts of) data from each population Compute “metrics” from the data, including confidence intervals Understand the results Make decisions

The A/B Experimentation Flywheel An example of a “flywheel” of software, data, and analytics Build an experimental product Collect data about the new feature Compare the new feature to existing features – iterate and improve (SOFTWARE) PRODUCTS ANALYTICS DATA

Basic Scientific Method A/B experimentation must follow basic scientific method. We need to: – Have a hypothesis – Have a control – Have one independent variable – Measure statistical significance – Draw conclusions – Keep track of data – Ensure it is repeatable

A/B experimentation… From scientific method follows experimental design: – Writing, agreeing, and approving an hypothesis for testing. Must include the metrics – Understanding the nature and limitations of the control – Trying to minimize the number of independent variables – Getting enough traffic to find signal from noise: Understanding the exposure of the feature Deciding how many users see the experiment concurrently Deciding how long to run the experiment Deciding on the confidence interval (CI) – Agreeing on the metrics, statistical measures, and CIs – Keeping track of data, and watching the experiment – Ensuring it is actually repeatable

Is it ready for real customers?

A/B experimentation… On any given day, Internet companies run many experiments It’s very likely each user is in one or more: – It could be “more” when they the experiments are probably orthogonal – It could be “more” when they are small, short, sanity checks – they do not always worry whether they’re orthogonal

eBay conversion increased by 13% Better Search 2010 Simple Flows Better Images Merch’ing Other 2012

A/B experimentation… It is hard to segment the population: – Many users are not logged in – Many users use several browsers – Many users use several devices – (It would be easier to segment in other ways – but it is user effects we want to measure) Attributing a result to a change is hard – Need to know that the user was affected by the change – Need to understand interaction effects caused by changes – Metrics are noisy Decisions are made with less than 100% confidence

A/B experimentation… Test versus control experimentation is challenging: – Hard to measure long-term effects – Hard to understand “burn-in” time – Does not work well for large changes – May not work well for very small changes – Some effects are seasonal – Metrics do not always capture the change – Difficult in the modern world of mobile applications, and multi-device users

Taking care with decisions Do not use or report the mean average; measure quartiles or percentiles There are (almost) always examples where the test is worse than the control – What fraction of users have a worse experience? – How bad is the worst 5%? – What can you learn from failures? What can you learn from successes?

What won’t quantitative testing tell you? (Usually) An unambiguous picture What you do not measure Exactly what to do Anything qualitative When to take risks When to be decisive How to leap from one mountain to another (more on this later)

CLICKS, GRAPHS, AND THEIR APPLICATIONS (SOFTWARE) PRODUCTS ANALYTICS DATA

The Product Development Flywheel A related “flywheel” of software, data, and analytics Working with (vast) data sets to analyze and understand user behaviors Building products based on insights and (often) powered by data sets (SOFTWARE) PRODUCTS ANALYTICS DATA

Example #1: Ranking with Clicks For each prior query, sort results by prior “good click” count Display results sorted by decreasing “good click” count Pros: wisdom of the crowds; simple; distills query intent Cons: needs bootstrapping; self-reinforcing prophecy problems; less effective with less data (“tail” problems)

Ranking in the tail: the “Panda Graph” From Craswell and Szummer, 2007

The argument for data-oriented approaches With less data, more complex algorithms outperform simple algorithms With more data, simple algorithms improve Sometimes, simple algorithms outperform complex algorithms when there is enough data (From Brill and Banko, 2001.)

Example #2: Using query-click graphs for query rewriting In 2009, eBay’s search engine was literal The goal was to make it more intuitive Idea: Using query, click, and session data, build query rewrites that improve effectiveness by intuitively including synonyms and structured data Query Rewrite Search User Query eBay Results Search Query

How do eBay buyers purchase the pilzlampe? It turns out, they do one of a few things: – Type pilzlampe, and purchase – Type pilzlampe, …, pilz lampe, and purchase – Type pilzlampe, …, pilzlampen, and purchase – Type pilz lampen, …, pilzlampe, and purchase – …

How do buyers purchase the pilzlampe? From eBay’s large scale data mining and processing: – Without human supervision, eBay automatically discovers that pilz lampe and pilzlampe are the same – They also discover that pilz and pilze are the same, and lampe and lampen are the same From these patterns, they (figuratively) rewrite the user’s query pilzlampe as: pilzlampe OR “pilz lampe” OR “pilz lampen” OR pilzlampen OR “pilze lampe” OR pilzelampe OR “pilze lampen” OR pilzelampen

Scale and the tail Nothing is easy in the tail or at scale – Incorrect strong signals: CMU is not Central Michigan University Mariners is not the same as Marines – Context matters Correcting Seattle Marines to Seattle Mariners is (generally) right Denver Nuggets is not Denver in the Jewelry & Watches category at eBay – Freshness matters When the iPad launched, eBay corrected iPad to iPod Even the query britney spears needs freshness and is dynamic – The tail is sparse: Error prone, and hard to tell when errors have been made Need to use traditional techniques, or need to map a query carefully to another query The reality is that blended approaches work best, and they are built using machine learning (see the “Learning to Rank” lecture in a few weeks)

Example #3: Other Applications Query-document ranking and query alterations are two prominent areas Others include: – Adult and spam filtering – Query suggestions (related searches) – Query autosuggest – Creating document snippets / captions / summaries – Spike and news-worthiness detection – Showing local results – …

HUMANS IN THE LOOP (SOFTWARE) PRODUCTS ANALYTICS DATA

Where People Matter in Search Within the “flywheel”: – Analytics to product: decisions, intuition, and innovation – Product to data: instrumentation, cleansing, and organization – Data to analytics: metrics, measurement, and understanding There is more to search than the flywheel… (SOFTWARE) PRODUCTS ANALYTICS DATA

Example #1: The local vs. global maxima phenomena

Example #2: Intuition using Data

Example #3: Don’t over-engineer

QUESTIONS & ANSWERS PS. Enjoyed the lecture? Give something back at

Additional Reading Nick Craswell and Martin Szummer Random walks on the click graph. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '07). ACM, New York, NY, USA, Michele Banko and Eric Brill Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics (ACL '01). Association for Computational Linguistics, Stroudsburg, PA, USA,