BBM: Bayesian Browsing Model from Petabyte-scale Data Chao Liu, MSR-Redmond Fan Guo, Carnegie Mellon University Christos Faloutsos, Carnegie Mellon University.

Slides:

Advertisements

Similar presentations

Evaluating Novelty and Diversity Charles Clarke School of Computer Science University of Waterloo two talks in one!

Advertisements

Modelling Relevance and User Behaviour in Sponsored Search using Click-Data Adarsh Prasad, IIT Delhi Advisors: Dinesh Govindaraj SVN Vishwanathan* Group:

Optimizing search engines using clickthrough data

Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.

Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.

1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.

Improving relevance prediction by addressing biases and sparsity in web search click data Qi Guo, Dmitry Lagun, Denis Savenkov, Qiaoling Liu

Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.

Statistic Models for Web/Sponsored Search Click Log Analysis The Chinese University of Hong Kong 1 Some slides are revised from Mr Guo Fan’s tutorial at.

A Graphical Model For Simultaneous Partitioning And Labeling Philip Cowans & Martin Szummer AISTATS, Jan 2005 Cambridge.

Click Evidence Signals and Tasks Vishwa Vinay Microsoft Research, Cambridge.

Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

Lecture 3 Aug 31, 2011 Goals: Chapter 2 (algorithm analysis) Examples: Selection sorting rules for algorithm analysis discussion of lab – permutation generation.

Carnegie Mellon AISTATS 2009 Jonathan Huang Carlos Guestrin Carnegie Mellon University Xiaoye Jiang Leonidas Guibas Stanford University.

Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.

Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.

1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG, CHUNG.

1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)

Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.

Introduction to MCMC and BUGS. Computational problems More parameters -> even more parameter combinations Exact computation and grid approximation become.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

User Browsing Graph: Structure, Evolution and Application Yiqun Liu, Yijiang Jin, Min Zhang, Shaoping Ma, Liyun Ru State Key Lab of Intelligent Technology.

1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.

Path Computation in External Memory In this work we focus on undirected, unweighted graphs with small diameter. This fits well for real world graph data.

Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:

Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.

Ramakrishnan Srikant Sugato Basu Ni Wang Daryl Pregibon 1.

Fan Guo 1, Chao Liu 2 and Yi-Min Wang 2 1 Carnegie Mellon University 2 Microsoft Research Feb 11, 2009.

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

1 Practical Techniques for Searches on Encrypted Data Dawn Song, David Wagner, Adrian Perrig.

Implicit Acquisition of Context for Personalization of Information Retrieval Systems Chang Liu, Nicholas J. Belkin School of Communication and Information.

Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.

AutoPlait: Automatic Mining of Co-evolving Time Sequences Yasuko Matsubara (Kumamoto University) Yasushi Sakurai (Kumamoto University) Christos Faloutsos.

Mining and Querying Multimedia Data Fan Guo Sep 19, 2011 Committee Members: Christos Faloutsos, Chair Eric P. Xing William W. Cohen Ambuj K. Singh, University.

Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Chao Liu Internet Services Research Center Microsoft Research-Redmond.

Analysis of Topic Dynamics in Web Search Xuehua Shen (University of Illinois) Susan Dumais (Microsoft Research) Eric Horvitz (Microsoft Research) WWW 2005.

Jun Li, Peng Zhang, Yanan Cao, Ping Liu, Li Guo Chinese Academy of Sciences State Grid Energy Institute, China Efficient Behavior Targeting Using SVM Ensemble.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

More Than Relevance: High Utility Query Recommendation By Mining Users' Search Behaviors Xiaofei Zhu, Jiafeng Guo, Xueqi Cheng, Yanyan Lan Institute of.

1 Click Chain Model in Web Search Fan Guo Carnegie Mellon University PPT Revised and Presented by Xin Xin.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Date: 2013/9/25 Author: Mikhail Ageev, Dmitry Lagun, Eugene Agichtein Source: SIGIR’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Improving Search Result.

The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.

Document Clustering and Collection Selection Diego Puppin Web Mining,

Why Decision Engine Bing Demos Search Interaction model Data-driven Research Problems Q & A.

Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning.

To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.

Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.

User Modeling for Personal Assistant

Search User Behavior: Expanding The Web Search Frontier

Content-Aware Click Modeling

Parameter Learning 2 Structure Learning 1: The good

Click Chain Model in Web Search

Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007

Efficient Multiple-Click Models in Web Search

Learning to Rank with Ties

Information Retrieval and Web Design

Presentation transcript:

BBM: Bayesian Browsing Model from Petabyte-scale Data Chao Liu, MSR-Redmond Fan Guo, Carnegie Mellon University Christos Faloutsos, Carnegie Mellon University

Massive Log Streams Search log – 10+ terabyte each day (keeps increasing!) – Involves billions of distinct (query, url)’s Questions – Can we infer user-perceived relevance for each (query, url) pair? – How many passes of the data are needed? Is one enough? – Can the inference be parallel? Our answer: Yes, Yes, and Yes!

BBM: Bayesian Browsing Model query URL 1 URL 2 URL 3 URL 4 C1C1 C2C2 C3C3 C4C4 S1S1 S2S2 S3S3 S4S4 Relevance E1E1 E2E2 E3E3 E4E4 Examine Snippet ClickThroughs

Dependencies in BBM S1S1 E1E1 E2E2 C1C1 S2S2 C2C2 … … … SiSi EiEi CiCi the preceding click position before i

Road Map Exact Model Inference Algorithms through an Example Experiments Conclusions

Notations For a given query – Top-M positions, usually M=10 Positional relevance M(M+1)/2 combinations of (r, d)’s – n search instances – N documents impressed in total: Document relevance

Model Inference Ultimate goal Observation: conditional independence

P(C|S) by Chain Rule Likelihood of search instance From S to R:

Putting things together Posterior with Re-organize by R j ’s How many times d j was clicked How many times d j was not clicked when it is at position (r + d) and the preceding click is on position r

What Tells US Exact inference with joint posterior in closed form Joint posterior factorizes and hence mutually independent At most M(M+1)/2 + 1 numbers to fully characterize each posterior – Count vector:

Road Map Exact Model Inference Algorithms through an Example Experiments Conclusions

LearnBBM: One-Pass Counting Find R j

An Example Compute Count vector for R N4N4 N 4, r, d 1 1

LearnBBM on MapReduce Map: emit((q,u), idx) Reduce: construct the count vector

Example on MapReduce (U1, 0) (U2, 4) (U3, 0) Map (U1, 1) (U3, 0) (U4, 7) Map (U1, 1) (U3, 0) (U4, 0) Map (U1, 0, 1, 1) (U2, 4)(U4, 0, 7)(U3, 0, 0, 0) Reduce

Road Map Exact Model Inference Algorithms through an Example Experiments Conclusions

Experiments Compare with the User Browsing Model ( Dupret and Piwowarski, SIGIR’08 ) – The same dependence structure – But point-estimation of document relevance rather than Bayesian – Approximate inference through iterations Data: – Collected from Aug and Sept 2008 – 10 algorithmic results only – Split to training/test sets according to time stamps for each query – 51 million search instances of 1.15 million distinct queries, 10X larger than the SIGIR’08 study

Overall Comparison on Log-Likelihood Experiments in 20 batches LL Improvement Ratio =

Comparison w.r.t. Frequency Intuition – Hard to predict clicks for infrequent queries – Easy for frequent ones

Model Comparison on Efficiency 57 times faster

Petabyte-Scale Experiment Setup: – 8 weeks data, 8 jobs – Job k takes first k- week data Experiment platform – SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets [Chaiken et al, VLDB’08]

Scalability of BBM Increasing computation load – more queries, more urls, more impressions Near-constant elapse time Computation Overload Elapse Time on SCOPE 3 hours Scan 265 terabyte data Full posteriors for 1.15 billion (query, url) pairs

Road Map Exact Model Inference Algorithms through an Example Experiments Conclusions

Bayesian Browsing Model for Search streams – Exact Bayesian inference – Joint posterior in closed form – A single pass suffices – Map-Reducible for Parallelism – Admissible to incremental updates – Perfect for mining click streams Models for other stream data – Browsing, twittering, Web 2.0, etc?

Thanks!