Bayesian inference for Plackett-Luce ranking models

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Part 2: Unsupervised Learning

CS188: Computational Models of Human Behavior

Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

Pattern Recognition and Machine Learning

Biointelligence Laboratory, Seoul National University

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

Visual Recognition Tutorial

Variational Inference and Variational Message Passing

. Learning Bayesian networks Slides by Nir Friedman.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Lecture 5: Learning models using EM

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

Machine Learning CMPT 726 Simon Fraser University

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

Visual Recognition Tutorial

Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.

Sep 26, 2013 Lirong Xia Computational social choice Statistical approaches.

Chapter Two Probability Distributions: Discrete Variables

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Computer vision: models, learning and inference

The horseshoe estimator for sparse signals CARLOS M. CARVALHO NICHOLAS G. POLSON JAMES G. SCOTT Biometrika (2010) Presented by Eric Wang 10/14/2010.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.

Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.

Lecture note for Stat 231: Pattern Recognition and Machine Learning 4. Maximum Likelihood Prof. A.L. Yuille Stat 231. Fall 2004.

Probabilistic Graphical Models

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

First topic: clustering and pattern recognition Marc Sobel.

Maximum Likelihood Estimation Methods of Economic Investigation Lecture 17.

Probability and Measure September 2, Nonparametric Bayesian Fundamental Problem: Estimating Distribution from a collection of Data E. ( X a distribution-valued.

Efficient computation of Robust Low-Rank Matrix Approximations in the Presence of Missing Data using the L 1 Norm Anders Eriksson and Anton van den Hengel.

Randomized Algorithms for Bayesian Hierarchical Clustering

CS Statistical Machine learning Lecture 24

1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:

Tractable Inference for Complex Stochastic Processes X. Boyen & D. Koller Presented by Shiau Hong Lim Partially based on slides by Boyen & Koller at UAI.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Lecture 2: Statistical learning primer for biologists

1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,

Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.

Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning.

Sparse Approximate Gaussian Processes. Outline Introduction to GPs Subset of Data Bayesian Committee Machine Subset of Regressors Sparse Pseudo GPs /

Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.

Bayesian Conditional Random Fields using Power EP Tom Minka Joint work with Yuan Qi and Martin Szummer.

STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.

Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.

1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.

Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.

Probability Theory and Parameter Estimation I

ICS 280 Learning in Graphical Models

Ch3: Model Building through Regression

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Latent Variables, Mixture Models and EM

Statistical Learning Dong Liu Dept. EEIS, USTC.

Bayesian Ranking using Expectation Propagation and Factor Graphs

Basic Econometrics Chapter 4: THE NORMALITY ASSUMPTION:

Bayesian Models in Machine Learning

Probabilistic Models with Latent Variables

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Topic Models in Text Processing

Parametric Methods Berlin Chen, 2005 References:

BN Semantics 3 – Now it’s personal! Parameter Learning 1

Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.

Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.

Learning Bayesian networks

Presentation transcript:

Bayesian inference for Plackett-Luce ranking models Bayesian inference for Packet-Lube ranking models John Guiver, Edward Snelson MSRC

Distributions over orderings Many problems in ML/IR concern ranked lists of items Data in the form of multiple independent orderings of a set of K items How to characterize such a set of orderings? Need to learn a parameterized probability model over orderings Examples: Meta-search, multi-player games and competitions

Notation Easy to get confused! Ordering: B, A, C Ranking: 2, 1, 3 for (A,B,C)

Distributions Ranking distributions are defined over the domain of all K! rankings (or orderings) A fully parameterised distribution would have a probability for each possible ranking which sum to 1. E.g. For three items: A ranking distribution is a point in this simplex A model is a parameterised family within the simplex

Plackett-Luce: vase interpretation Probability: vr vg vb In this talk we are looking at a particular parameterised ranking model with very nice properties which we’ll look at over the next two slides. In this model there is one parameter per item. Consider an example where we have three items Red Green and Blue with parameters vR, vG, and vB. In this picture these parameters represent the proportions of the different coloured balls in these infinite urns. First we mix all the balls together. Now we’ll describe a sampling procedure which leads to an ordering –i.e. one of the six orderings {RGB,RBG, ...}. This is equivalent to the P-L likelihood for the observed ordering rgb

Plackett-Luce model PL likelihood for a single complete ordering: Picking the item for ranking position k, out of the items that have not yet been ranked. Complete ordering has K factors. More realistically, in ML, we have many items but only some of them have been ranked i.e. Sparse data

Plackett-Luce: vase interpretation Top N Partial orderings Only need as many factors as items ranked for a given data point. PL is internally consistent – marginalising out items leads to the same model form, and the same likelihood as if they were not there. Bradley-Terry model for case of pairs

Luce’s Choice Axiom Probability of choosing wine as favourite drink equals probability of choosing an alcoholic drink as the favourite X probability that wine is the favourite alcoholic drink. Not the same as conditional probablility. P-L is only multi-stage model (choose favourite, then choose next favourite from remaining etc.) whose probabilities satisfy Luce’s choice axiom at every stage.

Gumbel Thurstonian model Each item represented by a score distribution on the real line. Marginal matrix Probability of an item in a position Thustonian model: each item is represented by a score distribution on the real line. Assume a Thurstonian Model with each score having identical distributions except for their means. Then: The score distributions give rise to a Plackett-Luce model if and only the scores are distributed according to a Gumbel distribution (Yellott)

Thurstonian Models, and Yellott’s Theorem Assume a Thurstonian Model with each score having identical distributions except for their means. Then: The score distributions give rise to a Plackett-Luce model if and only the scores are distributed according to a Gumbel distribution (Yellott) Result depends on some nice properties of the Gumbel distribution: Pairwise probability is erf with Gaussian, but logistic with Gumbel. Relation to Plackett Luce: v = exp(mu/beta).

Maximum likelihood estimation Hunter (2004) describes minorize/maximize (MM) algorithm to find MLE Can over-fit with sparse data (especially incomplete rankings) Strong assumption for convergence: “in every possible partition of the items into two nonempty subsets, some item in the second set ranks higher than some item in the first set at least once in the data” MM algorithm is like EM (EM is a special case). The E-step is a specific example of minorization. We’ll show an example of over-fitting later

Bayesian inference: factor graph Gamma priors vA vD vB vC vE B A E D E 5 items. Two data points, incomplete rankings: BAE and DE. Item C not involved in any rankings, so remains as prior.

Fully factored approximation Posterior over P-L parameters, given N orderings : Approximate as fully factorised product of Gammas:

Expectation Propagation [Minka 2001] Iterate until convergence – but only need to do a single pass if tree or chain Marginal is product of incoming messages Computing message from factor to variable requires a sum or integral over factor and variables connected to that factor. Not always tractable. Therefore we use an extension called power EP.

Alpha-divergence Let p,q be two distributions (don’t need to be normalised) Kullback-Leibler (KL) divergence Alpha-divergence ( is any real number) Similarity measures between two probability distributions: p is the truth, and q an approximation

Alpha-divergence – special cases Similarity measures between two distributions (p is the truth, and q an approximation) α Well known special cases. Note p(x) in denominator for alpha = -1

Minimum alpha-divergence q is Gaussian, minimizes D(p||q)  = 0.5  = ∞  = 0  = -∞  = 1

Structure of alpha space inclusive (zero avoiding) zero forcing BP, EP MF  1

Bayesian inference: factor graph Gamma priors vA vD vB vC vE B A E D E A choice of α = -1 leads to a particularly nice simplification for the P-L likelihood Sum of Gammas can be projected back onto single Gamma

Inferring known parameters

Ranking NASCAR drivers Hunter used this example of NASCAR races to illustrate MLE estimation in P-L. 36 Races in NASCAR 2002 season. 83 drivers, but some drivers only competed in a few races, some in all races. Therefore incomplete rankings. Table is ordered by simplest possible ranking method: average place. Show rank given by MLE estimate of P-L, rank given by EP estimate (ordered on means), MLE and EP parameter estimates. Jones, Pruett: only competed in 1 race, but did well. MLE places them first and second. EP places them only 18th, and 19th, and has high uncertainty on P-L parameters. In contrast, EP places Martin at top, who’s competed in all 36 races. Similarly at the bottom end of the scale some drivers such as Morgan Shepherd and Dick Trickle have gone substantially down in the rankings compared to maximum likelihood because there is more evidence that they consistently do badly. And there’s Dick Trickle bringing up the rear......

Posterior rank distributions EP MLE Imagine all drivers did race in one race. Where would you expect them to come? Some visualisations of posterior rank distributions over driver ranks from 1 .. 83. Representation of the P-L distributions over ranks implied by the learnt parameters. Light = high prob Dark = low prob Driver rank : 1 .... 83

Conclusions and future work We have given an efficient Bayesian treatment for P-L models using Power EP Advantage of Bayesian approach is: Avoid over-fitting on sparse data Gives uncertainty information on the parameters Gives estimation of model evidence Future work: Mixture models Feature-based ranking models

Thank you http://www.research.microsoft.com/infernet

Ranking movie genres

Incomplete orderings Internally consistent: “the probability of a particular ordering does not depend on the subset from which the items are assumed to be drawn” Likelihood for an incomplete ordering (only a few items or top-S items are ranked) simple: only include factors for those items that are actually ranked in datum n Suppose we have two sets of items A and B where B in A. This means that the probability of a particular ordering of the items in B, marginalizing over all possible unknown positions of the items left over in A, is exactly the same as the P-L probability of simply ordering those items in B completely independently from A

Power EP for Plackett-Luce A choice of α = -1 leads to a particularly nice simplification for the P-L likelihood An example of the type of calculation in the EP updates, with a factor connecting two items A, E: Sum of Gammas can be projected back onto single Gamma α = -1 power makes this tractable