An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.

Slides:



Advertisements
Similar presentations
Bayesian network classification using spline-approximated KDE Y. Gurwicz, B. Lerner Journal of Pattern Recognition.
Advertisements

Point Estimation Notes of STAT 6205 by Dr. Fan.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Unsupervised Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Sampling: Final and Initial Sample Size Determination
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.
Hypothesis testing Some general concepts: Null hypothesisH 0 A statement we “wish” to refute Alternative hypotesisH 1 The whole or part of the complement.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Part 2b Parameter Estimation CSE717, FALL 2008 CUBS, Univ at Buffalo.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Prénom Nom Document Analysis: Non Parametric Methods for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Conditional Random Fields
Computer vision: models, learning and inference Chapter 10 Graphical Models.
The Neymann-Pearson Lemma Suppose that the data x 1, …, x n has joint density function f(x 1, …, x n ;  ) where  is either  1 or  2. Let g(x 1, …,
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
6. Experimental Analysis Visible Boltzmann machine with higher-order potentials: Conditional random field (CRF): Exponential random graph model (ERGM):
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Graphical models for part of speech tagging
Probabilistic Mechanism Analysis. Outline Uncertainty in mechanisms Why consider uncertainty Basics of uncertainty Probabilistic mechanism analysis Examples.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Applied Bayesian Inference, KSU, April 29, 2012 § ❷ / §❷ An Introduction to Bayesian inference Robert J. Tempelman 1.
Simulation of the matrix Bingham-von Mises- Fisher distribution, with applications to multivariate and relational data Discussion led by Chunping Wang.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
By Sharath Kumar Aitha. Instructor: Dr. Dongchul Kim.
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
Consistency An estimator is a consistent estimator of θ, if , i.e., if
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
1 We will now look at the properties of the OLS regression estimators with the assumptions of Model B. We will do this within the context of the simple.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Training Conditional Random Fields using Virtual Evidence Boosting Lin Liao, Tanzeem Choudhury †, Dieter Fox, and Henry Kautz University of Washington.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Lecture 2: Statistical learning primer for biologists
Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan.
by Ryan P. Adams, Iain Murray, and David J.C. MacKay (ICML 2009)
John Lafferty Andrew McCallum Fernando Pereira
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
M.Sc. in Economics Econometrics Module I Topic 4: Maximum Likelihood Estimation Carol Newman.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Learning From Measurements in Exponential Families Percy Liang, Michael I. Jordan and Dan Klein ICML 2009 Presented by Haojun Chen Images in these slides.
M ODEL IS W RONG ?! S. Eguchi, ISM & GUAS. What is MODEL? No Model is True ! Feature of interests can reflect on Model Patterns of interests can incorporate.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Learning Deep Generative Models by Ruslan Salakhutdinov
Chapter 3: Maximum-Likelihood Parameter Estimation
ESTIMATION.
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Multimodal Learning with Deep Boltzmann Machines
Generalized Spatial Dirichlet Process Models
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
Econometrics Chengyuan Yin School of Mathematics.
Learning From Observed Data
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Presentation transcript:

An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan He ECE, Duke University June 27, 2008

Introduction Exponential family estimators Generative Fully discriminative Pseudolikelihood discriminative Asymptotic analysis Experiments Conclusions Outline

Introduction  Data points are not considered to be drawn independently.  There are correlations between data points.  Given data, we have to consider the joint distribution over all the data points.  Correspondingly, the overall likelihood is not the product of the likelihood for each data point.

Introduction Generative vs. Discriminative Generative model: A model for randomly generating observed data; Learning a joint probability distribution over both observations and labels Discriminative model: A model only of the label variables conditional on the observed data; Learning a conditional distribution over labels given observations

Introduction Full Likelihood vs. Pseudolikelihood Full likelihood: Pseudolikelihood: An approximation of the full likelihood; Computationally more efficient. Could be intractable; Computationally inefficient. A set of dependencies between data points

Estimators Exponential Family Estimators for features model parameters normalization Example: conditional random field

Estimators Composite Likelihood Estimators [Lindsay 1988]  One class of pseudolikelihood estimator;  Consists of a weighted sum of component likelihoods, each of which is the probability of one subset of data points conditioned on another.  Partitions the output space (denoted by r) according to a fixed distribution P r, and obtains the component likelihood.  Defines criterion function which reflects the quality of the estimator.  The maximum composite likelihood estimator

Estimators Three estimators to be compared in the paper:  Generative: one component  Fully discriminative: one component  Pseudolikelihood discriminative: for each data point, we have one component

Estimators Risk Decomposition Bayes risk have only finite dataintrinsic suboptimality of the estimator Define unrelated to data samples z

Asymptotic Analysis before Well-specified model:, achieves O(n -1 ) convergence rate. Misspecified model:only fully discriminative estimator achieves O(n -1 ) rate.

Asymptotic Analysis

Experiments Toy example:four-node binary-valued graphical model True model: Learned model: When, the learned model is well-specified; When, the learned model is misspecified.

Experiments well-specified misspecified

Experiments Part-of-speech (POS) Tagging: Input: a sequence of words Output: a sequence of POS tags, i.e. noun, verb,etc. (45 tags total) Specified model: Node features : indicator functions of the form Edge features : indicator functions of the form Training: Wall Street Journal, 38K sentences. Testing: Wall Street Journal, 5.5K sentences, different sections from training.

Experiments Use the learned generative model to sample 1000 training samples and 1000 test samples, as synthetic data.

Conclusions  When model is well-specified: Three estimators all achieve O(n -1 ) convergence rate; There are no approximation error; The asymptotic estimation error generative < fully discriminative < pseudolikelihood discriminative  When model is misspecified: Fully discriminative estimator still achieves O(n -1 ) convergence rate, but the other two estimators achieve O(n -1/2 ) convergence rate ; The approximation error and asymptotic estimation error for fully discriminative estimator is lower than the generative estimator and the pseudolikelihood discriminative estimator.