Ronny Kohavi, Microsoft Joint work with Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, Ya Xu Based on KDD 2012 talk, available at

Slides:



Advertisements
Similar presentations
Roger Longbotham, Principal Statistician, Microsoft.
Advertisements

GENERAL USABILITY CONSIDERATIONS. Overview of usability Aspects of usability – Learnability, memorability, efficiency, error reduction Techniques for.
Optimizing your business online Web Analytics How to raise effectiveness of websites and advertising campaigns Andrew Yunisov Managing Partner.
Google Chrome & Search C Chapter 18. Objectives 1.Use Google Chrome to navigate the Word Wide Web. 2.Manage bookmarks for web pages. 3.Perform basic keyword.
Updates to USDA LINC Available August 18, 2008.
COLLECTIVE BARGAINING REPORTING Gateway User Guide Data Entry and Submission January 2014.
Chapter 14: Usability testing and field studies. 2 FJK User-Centered Design and Development Instructor: Franz J. Kurfess Computer Science Dept.
Swami NatarajanJune 17, 2015 RIT Software Engineering Reliability Engineering.
SE 450 Software Processes & Product Metrics Reliability Engineering.
Non-Experimental designs: Developmental designs & Small-N designs
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
© 2009, Microsoft Corporation Sponsored By: Top 7 Testing Pitfalls Presented live November 18, 2009 Featuring Guest Star: Ronny Kohavi GM, Microsoft Experimentation.
Inside the Mind of the 21st Century Customer Alan Page.
Evaluation Eyal Ophir CS 376 4/28/09. Readings Methodology Matters (McGrath, 1994) Practical Guide to Controlled Experiments on the Web (Kohavi et al.,
WebCT CE-6 Assignment Tool. Assignment Tool and Assignment Drop Box Use “Assignment” button under Course Tools (your must be in “Build” mode) to: –Modify.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
BROWSERS & BROWSING What, Which & Why. WHAT IS A BROWSER? Once you have an Internet connection, some programs access the internet automatically to operate.
Five Fundamentals for Managing a Small Business Web Site William Garnsey E-Commerce Chair.
HTML and Designing Web Pages. u At its creation, the web was all about –Web pages were clumsily assembled –Web sites were accumulations of hyperlinked.
ASP.NET AJAX. Content ASP.NET AJAX Ajax Control Toolkit Muzaffer DOĞAN - Anadolu University2.
Basic Web Design. Technology is a tool  FIRST, understand how people actually interact with each other and with the information in their lives, in all.
1 CS 178H Introduction to Computer Science Research What is CS Research?
Chapter 4 How Businesses Work McGraw-Hill/Irwin Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Basics of Web Databases With the advent of Web database technology, Web pages are no longer static, but dynamic with connection to a back-end database.
System Analysis and Design
Ronny Kohavi with Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, Ya Xu Slides available at
Five Challenging Problems for A/B/n Tests Slides at (Follow-on talk to KDD 2015 keynote on Online Controlled Experiments: Lessons.
Mamma.com.
Ibm.com © 2005 IBM Corporation Big Site/Big Company SEM Search Engine Strategies, New York March 3, 2005.
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
Module 5 Planning for SQL Server® 2008 R2 Indexing.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
From Quality Control to Quality Assurance…and Beyond Alan Page Microsoft.
Guest Speaker today from GSI Commerce. Which Ad is better (A or B)? Jim Jansen College of Information Sciences and Technology The Pennsylvania.
EVALUATE YOUR SITE’S PERFORMANCE. Web site statistics Affiliate Sales Figures.
A/B testing aka split testing, bucket testing, and multivariant testing.
AJAX 10 Most Common Mistakes. 1. Not giving immediate visual cues for clicking widgets. If something I'm clicking on is triggering Ajax actions, you have.
PERFORMANCE ENHANCEMENT IN ASP.NET By Hassan Tariq Session #1.
What is delicious. Why use deliccious s a quick primer for those who haven't hopped on the bandwagon yet: social bookmarking websites allow you to save.
Challenging Problems in Online Controlled Experiments Slides at Ron Kohavi, Distinguished Engineer,
1 Running Experiments for Your Term Projects Dana S. Nau CMSC 722, AI Planning University of Maryland Lecture slides for Automated Planning: Theory and.
Advancing Science: OSTI’s Current and Future Search Strategies Jeff Given IT Operations Manager Computer Protection Program Manager Office of Scientific.
ASP. ASP is a powerful tool for making dynamic and interactive Web pages An ASP file can contain text, HTML tags and scripts. Scripts in an ASP file are.
Measuring the value of search trails in web logs Presentation by Maksym Taran & Scott Breyfogle Research by Ryen White & Jeff Huang.
By Pamela Drake SEARCH ENGINE OPTIMIZATION. WHAT IS SEO? Search engine optimization (SEO) is the process of affecting the visibility of a website or a.
Thực hiện: D3 GVLT: BROWERS. Browser Compatibility I Check the compatibility II Tools III.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
 SEO Terms A few additional terms Search site: This Web site lets you search through some kind of index or directory of Web sites, or perhaps both an.
 AJAX technology  Rich User Experience  Characteristics  Real live examples  JavaScript and AJAX  Web application workflow model – synchronous vs.
Concurrency and Performance Based on slides by Henri Casanova.
Ronny Kohavi, Microsoft Slides available at
Portal Construction 301. Where We Are In Portal Construction 101and 201 we created a Group Profile in the local system and uploaded to our Web Reservation.
Ronny Kohavi, Distinguished Engineer, General Manager, Analysis and Experimentation, Microsoft Joint work with Thomas Crook, Brian Frasca, and Roger Longbotham,
Step 1 Lead Notifications Dear Partner, New leads have been assigned to your organization based on customer preference and are available for you.
1 Chapter 1- Introduction How Bugs affect our lives What is a Bug? What software testers do?
How to Prioritize Your Website Tests A Monetate template to help you find quick wins, figure out what’s next and maximize the impact of your testing plan.
By: Bryce Carlson. -What is A/B testing? -What companies are using A/B testing? -Why use A/B testing? -How does A/B testing work? -Advantages/Disadvantages.
4 Reasons Website Monitoring Service is Mandatory for Online Success.
Zaap Visualization of web traffic from http server logs.
The Internet.
What is Google Analytics?
Pitfalls in Online Controlled Experiments Slides at
Pitfalls in Online Controlled Experiments Slides at
Seven Pitfalls to Avoid when Running Controlled Experiments on the Web
Ronny Kohavi Online Services Division, Microsoft
Experimentation Challenges
Designing Experimentation Metrics
Cognos Analytics v For Report Viewers
Presentation transcript:

Ronny Kohavi, Microsoft Joint work with Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, Ya Xu Based on KDD 2012 talk, available at Which Test Won 8/27/2012

2 This is an extended presentation of the KDD paper presented in Beijing a few weeks ago (KDD = Knowledge Discovery and Data mining) At Bing, we ran thousands of experiments It is not uncommon to see experiments that impact annual revenue by millions of dollars, sometimes 10s of millions Trustworthiness is critical, so surprising results are investigated We share puzzling results that each took weeks to months to analyze deeply, understand, and explain Moreover, the issues uncovered in these specific examples surfaced in multiple other experiments, so they are not isolated incidents

3 3 Any figure that looks interesting or different is usually wrong If something is “amazing,” find the flaw! It’s usually there. Examples If you have a mandatory birth date field and people think it’s unnecessary, you’ll find lots of 11/11/11 or 01/01/01 If you have an optional drop down, do not default to the first alphabetical entry, or you’ll have lots jobs = Astronaut Traffic to web sites doubled between 1-2AM November 6, 2011 for many sites, relative to the same hour a week prior. Why? In this talk, we share puzzling results that triggered Twyman’s law for us, so we investigated and found the flaw

4 “Find a house” widget variations Which is best for the OEC (Overall Evaluation Criterion) of Revenue to Microsoft, generated every time a user clicks 4 F E D C B A

5 Version C was 8.5% better Since this is the #1 monetization for MSN Real Estate, it improved revenues significantly In the “throwdown” (vote for the winning variant), nobody from MSN Real Estate or the company that did the creative voted for the winning widget This is very common: we are terrible at correctly assessing the value of our own ideas/designs This is why running controlled experiments is so critical if we want to be data-driven

6 Concept is trivial Randomly split traffic between two (or more) versions A/Control B/Treatment Collect metrics of interest Analyze Unless you are testing on one of largest sites in the world, use 50/50% (high stat power) Must run statistical tests to confirm differences are not due to chance Best scientific way to prove causality, i.e., the changes in metrics are caused by changes introduced in the treatment(s)

7 Your baby is not as beautiful as you think Our statistic from thousands of controlled experiments: only 10-30% of experiments move the metrics they were designed to improve “Google ran approximately 12,000 randomized experiments in 2009, with [only] about 10 percent of these leading to business changes” – Jim Manzi “80% of the time you/we are wrong about what a customer wants” -- Avinash Kaushik “Netflix considers 90% of what they try to be wrong” -- Mike Moran

8 An OEC is the Overall Evaluation Criterion It is a metric (or set of metrics) that guides the org as to whether A is better than B in an A/B test In prior work, we emphasized long-term focus and thinking about customer lifetime value, but operationalizing it is hard Search engines (Bing, Google) are evaluated on query share (distinct queries) and revenue as long-term goals Puzzle A ranking bug in an experiment resulted in very poor search results Distinct queries went up over 10%, and revenue went up over 30% What metrics should be in the OEC for a search engine?

9

10 A piece of code was added, such that when a user clicked on a search result, additional JavaScript was executed (a session-cookie was updated with the destination) before navigating to the destination page This slowed down the user experience slightly, so we expected a slightly negative experiment. Results showed that users were clicking more! Why?

11 User clicks (and form submits) are instrumented and form the basis for many metrics Instrumentation is typically done by having the web browser request a web beacon (1x1 pixel image) Classical tradeoff here Waiting for the beacon to return slows the action (typically navigating away) Making the call asynchronous is known to cause click-loss, as the browsers can kill the request (classical browser optimization because the result can’t possibly matter for the new page) Small delays, on-mouse-down, or redirect are used

12 Click-loss varies dramatically by browser Chrome, Firefox, Safari are aggressive at terminating such reqeuests. Safari’s click loss > 50%. IE respects image requests for backward compatibility reasons White paper available on this issue herehere Other cases where this impacts experiments Opening link in new tab/window will overestimate the click delta Because the main window remains open, browsers can’t optimize and kill the beacon request, so there is less click-loss Using HTML5 to update components of the page instead of refreshing the whole page has the overestimation problem

13 Primacy effect occurs when you change the navigation on a web site Experienced users may be less efficient until they get used to the new navigation Control has a short-term advantage Novelty effect happens when a new design is introduced Users investigate the new feature, click everywhere, and introduce a “novelty” bias that dies quickly if the feature is not truly useful Treatments have a short-term advantage

14 Given the high failure rate of ideas, new experiments are followed closely to determine if new idea is a winner Multiple graphs of effect look like this Negative on day 1: -0.55% Less negative on day 2: -0.38% Less negative on day 3: -0.21% Less negative on day 4: -0.13% The experimenter extrapolates linearly and says: primacy effect. This will be positive in a couple of days, right? Wrong! This is expected

15

16 The longer graph This was an A/A test, so the true effect is 0

17

18

19 Experiment is run, results are surprising. (This by itself is fine, as our intuition is poor.) Rerun the experiment, and the effects disappear Reason: bucket system recycles users, and the prior experiment had carryover effects These can last for months! Must run A/A tests, or re-randomize

20 OEC: evaluate long-term goals through short-term metrics The difference between theory and practice is greater in practice than in theory Instrumentation issues (e.g., click-tracking) must be understood Carryover effects impact “bucket systems” used by Bing, Google, and Yahoo require rehashing and A/A tests Experimentation insight: Effect trends are expected Longer experiments do not increase power for some metrics. Fortunately, we have a lot of users

21 Multiple papers available at Survey and practical guide Seven Pitfalls to Avoid when Running Controlled Experiments on the Web Online Experimentation at Microsoft Talks and tutorials at Questions?