Entropy of Search Logs - How Big is the Web? - How Hard is Search? - With Personalization? With Backoff? Qiaozhu Mei †, Kenneth Church ‡ † University of.

Slides:



Advertisements
Similar presentations
Recommender Systems & Collaborative Filtering
Advertisements

CautPromotii.ro Meeting place for the consumers that seek good deals and the brands that advertise special offers. Concept introduction and advertising.
Verifiable Resource Accounting for Cloud Computing Services Vyas Sekar, Petros Maniatis ISTC for Secure Computing 1.
Thank you for putting cell phone on silent Please Response Encourage Success – refrain from criticisms Affiliate Network Platform Make Money Online Presented.
CONSUMER & COMMERCIAL PERFORMANCE SOLUTIONS | FOR INTERNAL USE ONLY | DO NOT COPY OR DISTRIBUTE | © COPYRIGHT WELLPOINT, INC. Producer Toolbox Exchange.
Maximise Your Online Presence SEO & Social Media Strategies For Local Business Owners.
1 Company Proprietary and Confidential Copyright Info Goes Here Just Like This The Reinvention of Facebook Presented By Joseph Sok To: Facebook Boards.
Towards Google-like Search on Spoken Documents with Zero Resources How to get something from nothing in a language that you’ve never heard of Language/CS.
Indian Statistical Institute Kolkata
Sean Blong Presents: 1. What are they…?  “[…] specific type of information filtering (IF) technique that attempts to present information items (movies,
Discover How My 11yr Old Daughter is Getting Sales Online And YOU Can Too!
YouTube For Marketing Broadcast yourself By: Kanakamadala Bharath.
Recommender Systems Aalap Kohojkar Yang Liu Zhan Shi March 31, 2008.
Learning to Rank: New Techniques and Applications Martin Szummer Microsoft Research Cambridge, UK.
1 CSSE 477 – A bit more on Performance Steve Chenoweth Friday, 9/9/11 Week 1, Day 2 Right – Googling for “Performance” gets you everything from Lady Gaga.
Jan. 26 No Need to log on to computer!
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Recommender systems Ram Akella November 26 th 2008.
Using Hispanic Market Information Sources in SimplyMap Charles Swartz Vice President, Technology Geographic Research, Inc.
Cohort Modeling for Enhanced Personalized Search Jinyun YanWei ChuRyen White Rutgers University Microsoft BingMicrosoft Research.
A Test of Usability By Shannon Johnson.  What is the site’s purpose? In their own words: “Barnes & Noble.com leverages the power of the Barnes & Noble.
Business Models in the Internet of Services Nikolay Mehandjiev, University of Manchester Benjamin Gil, Atos Origin.
Online Advertising with Adwords and Facebook Dan Belhassen greatBIGnews.com Modern Earth Inc.
GETTING BUTTS INTO THE SEATS. SOCIAL MEDIA FACTS As of tomorrow Facebook will be 10 years old and has an estimated 1.3 BILLION users Facebook StatisticsData.
Big Data Dr. Michael Stachiw January 10, 2015 What it is, what it means, and where do we go from here..
How to make it easy for you customers to find and research you and your services!
By: Aaron Gustafson Owner Computers N’ Stuff.  Facebook is FREE!!!  Youtube is FREE!!!  Twitter is FREE!!!  Google Plus is FREE!!!  Website hosting.
SPORTS AND ENTERTAINMENT MARKETING
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Internet and Social Networking Research Tools for Academic Writing Copyright © 2014 Todd A. Whittaker
Students in Today’s Schools John Bailey Director of Educational Technology U.S. Department of Education.
Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Online Marketing & Social Media for Voluntary Organisations Mike Hughes Microsoft Ireland
Presented to you by Christian A. Penner - Mortgage Banker WebSite: Facebook:
Pattern Recognition Problems in Computational Linguistics Information Retrieval: – Is this doc more like relevant docs or irrelevant docs? Author Identification:
1 E-Commerce Advertising Jerry Post Westgate Management Development Center Eberhardt School of Business University of the Pacific.
1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)
Data Structures & Algorithms and The Internet: A different way of thinking.
2002/4/10IDSL seminar Estimating Business Targets Advisor: Dr. Hsu Graduate: Yung-Chu Lin Data Source: Datta et al., KDD01, pp
Getting Analytical A Guide to Better Results for the Number-challenged Courtney Milan NINC conference, October 2, 2015.
Presented by: Your Name Your Phone Number Your Website Address How a Mobile Website Can Help You Connect With Local Consumers.
Online Advertising Core Concepts are Identical to traditional advertising: –Building Brand Awareness –Creating Consumer Demand –Informing Consumers of.
Facebook for Business Greg Clement and Rick Scheeser.
The New Way to Do Real Estate. The Internet has changed how consumers shop for homes 1. Research shows that buyers are 5 times more likely to sell a home.
Social Media 101 An Overview of Social Media Basics.
Building A Multi-Functional Communications Program Balancing the paradigm shift of collaborative comprehensive analytical convergence techniques and best.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
By: Sam Poggi Google Inc. 39 employees Mostly engineers Money was running out, and Google needed a business model that would begin to bring in money.
RM. Why have this meeting now? A little bit about year 5. How can parents and teachers work together to maximise children’s achievement?
Weekly Sales Meeting Topic
1 Company Proprietary and Confidential Copyright Info Goes Here Just Like This PRESENTATION: Facebook’s Revolution JUNE 2012 Company Proprietary and Confidential.
Develop a Business Plan Chapter #5. Why do you need a Business Plan Business Plan –Written document that describes all the steps necessary in opening.
Why should I as a small business owner have a web site?
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
TO Each His Own: Personalized Content Selection Based on Text Comprehensibility Date: 2013/01/24 Author: Chenhao Tan, Evgeniy Gabrilovich, Bo Pang Source:
E-Marketing 5/E Judy Strauss and Raymond Frost
Operating Plan Outlining Day-to-Day Operations. Benefits of an Operating Plan The Operating Plan (also known as the Business Plan, requires the business.
Why Decision Engine Bing Demos Search Interaction model Data-driven Research Problems Q & A.
Recommendation Systems By: Bryan Powell, Neil Kumar, Manjap Singh.
Mining of Massive Datasets Ch4. Mining Data Streams.
The Cinema Analytics Opportunity 1 Join the Data Revolution.
Data Collection Techniques
Yellow Pages Training Building a better understanding of the Yellow Pages industry to achieve greater success in Newspaper Sales.
Recommender Systems & Collaborative Filtering
Web Mining Ref:
Setting up an online account
Recommender Systems Copyright: Dietmar Jannah, Markus Zanker and Gerhard Friedrich (slides based on their IJCAI talk „Tutorial: Recommender Systems”)
Democracy and Information
Democracy and Information
Presentation transcript:

Entropy of Search Logs - How Big is the Web? - How Hard is Search? - With Personalization? With Backoff? Qiaozhu Mei †, Kenneth Church ‡ † University of Illinois at Urbana-Champaign ‡ Microsoft Research 1

2 Big How Big is the Web? 5B? 20B? More? Less? What if a small cache of millions of pages – Could capture much of the value of billions? Big Could a Big bet on a cluster in the clouds – Turn into a big liability? Examples of Big Bets – Computer Centers & Clusters Capital (Hardware) Expense (Power) Dev (Mapreduce, GFS, Big Table, etc.) – Sales & Marketing >> Production & Distribution Small

3 Millions (Not Billions)

4 Population Bound With all the talk about the Long Tail – You’d think that the Web was astronomical – Carl Sagan: Billions and Billions… Lower Distribution $$  Sell Less of More But there are limits to this process – NetFlix: 55k movies (not even millions) – Amazon: 8M products – Vanity Searches: Infinite??? Personal Home Pages << Phone Book < Population Business Home Pages << Yellow Pages < Population Millions, not Billions (until market saturates)

5 It Will Take Decades to Reach Population Bound Most people (and products) don’t have a web page (yet) Currently, I can find famous people (and academics) but not my neighbors – There aren’t that many famous people (and academics)… – Millions, not billions (for the foreseeable future)

6 Equilibrium: Supply = Demand If there is a page on the web, – And no one sees it, – Did it make a sound? How big is the web? – Should we count “silent” pages – That don’t make a sound? How many products are there? – Do we count “silent” flops – That no one buys?

7 Demand Side Accounting Consumers have limited time – Telephone Usage: 1 hour per line per day – TV: 4 hours per day – Web: ??? hours per day Suppliers will post as many pages as consumers can consume (and no more) Size of Web: O(Consumers)

8 How Big is the Web? Related questions come up in language How big is English? – Dictionary Marketing – Education (Testing of Vocabulary Size) – Psychology – Statistics – Linguistics Two Very Different Answers – Chomsky: language is infinite – Shannon: 1.25 bits per character How many words do people know? What is a word? Person? Know?

9 Chomskian Argument: Web is Infinite One could write a malicious spider trap –   Not just academic exercise Web is full of benign examples like – – Infinitely many months – Each month has a link to the next

10 Big How Big is the Web? 5B? 20B? More? Less? More (Chomsky) – Less (Shannon) Entropy (H) Query 21.1  22.9 URL 22.1  22.4 IP 22.1  22.6 All But IP23.9 All But URL26.0 All But Query27.1 All Three27.2 Millions (not Billions) MSN Search Log 1 month  x18 Cluster in Cloud  Desktop  Flash Comp Ctr ($$$$)  Walk in the Park ($) More Practical Answer

11 Entropy (H) Difficulty of encoding information (a distr.) – Size of search space; difficulty of a task H = 20  1 million items distributed uniformly Powerful tool for sizing challenges and opportunities – How hard is search? – How much does personalization help?

12 How Hard Is Search? Traditional Search – H(URL | Query) – 2.8 (= 23.9 – 21.1) Personalized Search IP – H(URL | Query, IP) – 1.2 – 1.2 (= 27.2 – 26.0) Entropy (H) Query21.1 URL22.1 IP22.1 All But IP23.9 All But URL26.0 All But Query27.1 All Three27.2 Personalization cuts H in Half!

Difficulty of Queries Easy queries (low H(URL|Q)): – google, yahoo, myspace, ebay, … Hard queries (high H(URL|Q)): – dictionary, yellow pages, movies, “what is may day?” 13

14 How Hard are Query Suggestions? The Wild Thing? C* Rice  Condoleezza Rice Traditional Suggestions – H(Query) – 21 bits Personalized IP – H(Query | IP) – 5 – 5 bits (= 26 – 21) Entropy (H) Query21.1 URL22.1 IP22.1 All But IP23.9 All But URL26.0 All But Query27.1 All Three27.2 Personalization cuts H in Half! Twice

15 Personalization with Backoff Ambiguous query: MSG – Madison Square Garden – Monosodium Glutamate Disambiguate based on user’s prior clicks When we don’t have data – Backoff to classes of users Proof of Concept: – Classes defined by IP addresses Better: – Market Segmentation (Demographics) – Collaborative Filtering (Other users who click like me)

16 Backoff Proof of concept: bytes of IP define classes of users If we only know some of the IP address, does it help? Bytes of IP addressesH(URL| IP, Query) * *.* *.*.*1.95 *.*.*.*2.74 Cuts H in half even if using the first two bytes of IP Some of the IP is better than none

17 Backing Off by IP Personalization with Backoff λs estimated with EM and CV A little bit of personalization – Better than too much – Or too little λ 4 : weights for first 4 bytes of IP λ 3 : weights for first 3 bytes of IP λ 2 : weights for first 2 bytes of IP …… Sparse DataMissed Opportunity

18 Personalization with Backoff  Market Segmentation Traditional Goal of Marketing: – Segment Customers (e.g., Business v. Consumer) – By Need & Value Proposition Need: Segments ask different questions at different times Value: Different advertising opportunities Segmentation Variables – Queries, URL Clicks, IP Addresses – Geography & Demographics (Age, Gender, Income) – Time of day & Day of Week

19 Business Queries on Business Days Consumer Queries (Weekends & Every Day)

20 Business Days v. Weekends: More Clicks and Easier Queries Easier More Clicks

Day v.s. Night: More Queries, More Diversified Queries 21 More clicks and diversified queries Less clicks, more unified queries

Harder Queries at TV Time 22 Harder queries Weekends are harder

23 Conclusions: Millions (not Billions) How Big is the Web? – Upper bound: O(Population) Not Billions Not Infinite Shannon >> Chomsky – How hard is search? – Query Suggestions? – Personalization? Cluster in Cloud ($$$$)  Walk-in-the-Park ($) Entropy is a great hammer

24 Conclusions: Personalization with Backoff Personalization with Backoff – Cuts search space (entropy) in half – Backoff  Market Segmentation Example: Business v. Consumer – Need: Segments ask different questions at different times – Value: Different advertising opportunities Demographics: – Partition by ip, day, hour, business/consumer query… Future Work: – Model combinations of surrogate variables – Group users with similarity  collaborative search

25 Thanks!

26 Prediction Task: Historical Logs  Coverage of Future Demand Training: Estimate Pr(x) – Given what we know today, – Estimate, Pr(x), tomorrow’s demand for url x – (There are an infinite set of urls x.) Test: Score Pr(x) by Cross Entropy – Given tomorrow’s demand, x 1 …x k – Score ≡ −log 2 of geometric mean of Pr(x 1 ) … Pr(x k ) One forecast is better than another – If it has better (less) cross entropy Cross entropy  Entropy (H) – The score for the best possible forecast (that only God knows) Coverage

27 Millions, Not Billions (Until Market Saturates) Telephones are a Mature Market – Saturated Universal Service is limited by population Loops (telephone numbers) ≈ population Everyone (and every business) is listed in phonebook – (unless they have opted out) Web is Growth Market – Decades from saturation – When everybody and every product has a page The number of pages will be bounded by the population – In the meantime, millions are good enough

28 Smoothing (Cont.) Use interpolation smoothing: where IP i is the first i bytes of an IP address. e.g., IP 4 = ; IP 2 = *.* Use one month’s search log (Jan 06) as training data, new incoming log (Feb 06) as testing sets λ i determined by EM algorithm maximizing the cross conditional entropy on test set.

Cross Validation 29 IP in the future might not be seen in the history But parts of it is seen in the history Personalization with backoff No personalization Complete personalization Cross Entropy: H(future | history) Knows every byte Knows at least two bytes

Cross Validations Weekends are harder to predict Weekdays to predict weekdays >> weekends to predict weekdays Day time to predict day time >> nights to predict day time 30

31 Partition by Day-Week (CV) Test data: weekdays and weekends Weekdays are easier to be predicted by history of weekdays Weekends are more difficult to predict Training data: weekdays and weekends