1 Clustering in Generalized Linear Mixed Model Using Dirichlet Process Mixtures Ya Xue Xuejun Liao April 1, 2005.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Part 2: Unsupervised Learning
Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Teg Grenager NLP Group Lunch February 24, 2005
Xiaolong Wang and Daniel Khashabi
Markov Chain Sampling Methods for Dirichlet Process Mixture Models R.M. Neal Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.
Course: Neural Networks, Instructor: Professor L.Behera.
MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Hierarchical Dirichlet Process (HDP)
Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.
Hierarchical Dirichlet Processes
Bayesian dynamic modeling of latent trait distributions Duke University Machine Learning Group Presented by Kai Ni Jan. 25, 2007 Paper by David B. Dunson,
CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct
HW 4. Nonparametric Bayesian Models Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Visual Recognition Tutorial
Nonparametric Bayes and human cognition Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
Latent Dirichlet Allocation a generative model for text
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Radial Basis Function Networks
Review of Lecture Two Linear Regression Normal Equation
Introduction to Multilevel Modeling Using SPSS
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Latent Variable Models Christopher M. Bishop. 1. Density Modeling A standard approach: parametric models  a number of adaptive parameters  Gaussian.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
Randomized Algorithms for Bayesian Hierarchical Clustering
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Roghayeh parsaee  These approaches assume that the study sample arises from a homogeneous population  focus is on relationships among variables 
Stick-Breaking Constructions
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
The Infinite Hierarchical Factor Regression Model Piyush Rai and Hal Daume III NIPS 2008 Presented by Bo Chen March 26, 2009.
Boosted Particle Filter: Multitarget Detection and Tracking Fayin Li.
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Gaussian Processes For Regression, Classification, and Prediction.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Bayesian Density Regression Author: David B. Dunson and Natesh Pillai Presenter: Ya Xue April 28, 2006.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Hierarchical Beta Process and the Indian Buffet Process by R. Thibaux and M. I. Jordan Discussion led by Qi An.
The Nested Dirichlet Process Duke University Machine Learning Group Presented by Kai Ni Nov. 10, 2006 Paper by Abel Rodriguez, David B. Dunson, and Alan.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Bayesian Semi-Parametric Multiple Shrinkage
Bayesian Generalized Product Partition Model
CH 5: Multivariate Methods
Non-Parametric Models
Kernel Stick-Breaking Process
Hidden Markov Models Part 2: Algorithms
Filtering and State Estimation: Basic Concepts
Hierarchical Topic Models and the Nested Chinese Restaurant Process
Generalized Spatial Dirichlet Process Models
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Topic Models in Text Processing
Generalized Additive Model
Presentation transcript:

1 Clustering in Generalized Linear Mixed Model Using Dirichlet Process Mixtures Ya Xue Xuejun Liao April 1, 2005

2 Introduction Concept drift is in the framework of generalized linear mixed model, but brings new question of exploiting the structuring of auxiliary data. Mixtures with a countably infinite number of components can be handled in a Bayesian framework by employing Dirichlet process priors.

3 Outline Part I: generalized linear mixed model Generalized linear model (GLM) Generalized linear mixed model (GLMM) Advanced applications Bayesian feature selection in GLMM Part II: nonparametric method Chinese restaurant process Dirichlet process (DP) Dirichlet process mixture models Variational inference for Dirichlet process mixtures

4 Part I Generalized Linear Mixed Model

5 Generalized Linear Model (GLM) A linear model specifies the relationship between a dependent (or response) variable Y, and a set of predictor variables, Xs, so that GLM is a generalization of normal linear regression models to exponential family (normal, Poisson, Gamma, binomial, etc).

6 GLM differs from linear model in two major respects: The distribution of Y can be non-normal, and does not have to be continuous. Y still can be predicted from a linear combination of Xs, but they are "connected" via a link function. Generalized Linear Model (GLM)

7 DDE Example: binomial distribution Scientific interest: does DDE exposure increase the risk of cancer ? Test on rats. Let i index rat. Dependent variables: Independent variable: dose of DDE exposure, denoted by x i.

8 Likelihood function of y i : Choosing the canonical link, the likelihood function becomes Generalized Linear Model(GLM)

9 GLMM – Basic Model Returning to the DDE example, 19 labs all over the world participated this bioassay. There are unmeasured factors that vary between the different labs. For example, rodent diet. GLMM is an extension of the generalized linear model by adding random effects to the linear predictor (Schall 1991).

10 GLMM – Basic Model The previous linear predictor is modified as:, where index lab, index rat within lab. are “ fixed ” effects - parameters common to all rats. are “ random ” effects - deviations for lab i.

11 GLMM – Basic Model If we choose x ij = z ij, then all the regression coefficients are assumed to vary for the different labs. If we choose z ij = 1, then only the intercept varies for the different labs (random intercept model).

12 GLMM - Implementation Gibbs sampling Disadvantage: slow convergence. Solution: hierarchical centering reparametrisation (Gelfand 1994; Gelfand 1995) Deterministic methods are only available for logit and probit models. EM algorithm (Anderson 1985) Simplex method (Im 1988)

13 GLMM – Advanced Applications Nested GLMM: within each lab, rats were group housed with three cats per cage. let i index lab, j index cage and k index rat. Crossed GLMM: for all labs, four dose protocols were applied on different rats. let i index lab, j index rat and k indicate the protocol applied on rat i,j.

14 GLMM – Advanced Applications Nested GLMM: within each lab, rats were group housed with three cats per cage. Two-level GLMM: level I – lab, level II – cage. Crossed GLMM: for all labs, four dose protocols were applied on different rats. Rats are sorted into 19 groups by lab. Rats are sorted into 4 groups by protocol.

15 GLMM – Advanced Applications Temporal/spatial statistics: Account for correlation between the random effects at different times/locations. Dynamic latent variable model (Dunson 2003) Let i index patient and t index follow-up time,

16 GLMM – Advanced Applications Spatially varying coefficient processes (Gelfand 2003): random effects are modeled as spatially correlated process. Possible application: A landmine field where landmines tend to be close together.

17 Bayesian Feature Selection in GLMM Simultaneous selection of fixed and random effects in GLMM (Cai and Dunson 2005) Mixture prior:

18 Fixed effects: choose mixture priors for the fixed effects coefficients. Random effects: reparameterization LDU decomposition of the random effect covariance Choose mixture prior for the elements in the diagonal matrix. Bayesian Feature Selection in GLMM

19 Missing Identification in GLMM Data table of DDE bioassay What if the first column is missing? Unusual case in statistics, so few people work on it. But this is the problem we have to solve for concept drift. …… Berlin Berlin Tokyo Tokyo ……

20 Concept Drift Primary data Auxiliary data If we treat the drift variable as random variable, concept drift is a random intercept model - a special case of GLMM.

21 Clustering in Concept Drift K = 51 clusters (including 0) out of 300 auxiliary data points Bin resolution = 1

22 Clustering in Concept Drift There are intrinsic clusters in auxiliary data with respect to drift value. “ The simplest explanation is best. ” Occam Razor Why don ’ t we instead give each cluster a random effect variable?

23 Clustering in Concept Drift In usual statistics applications, we know which individuals share the same random effect. However, in concept drift, we do not know which individuals (data points or features) share the same random-intercept. Can we train the classifier and cluster the auxiliary data simultaneously? This is a new problem we aim to solve.

24 Clustering in Concept Drift How many clusters (K) should we include in our model? Does choosing K actually make sense? Is there a better way?

25 Part II Nonparametric Method

26 Nonparametric method Parametric method: the forms of the underlying density functions were known. Nonparametric method is a wide category, e.g. NN, minmax, bootstrapping... Nonparametric Bayesian method: make use of the Bayesian calculus without prior parameterized knowledge.

27 Cornerstones of NBM Dirichlet process (DP) allow flexible structures to be learned and allow sharing of statistical strength among sets of related structures. Gaussian process (GP) allow sharing in the context of multiple nonparametric regressions (suggest to have a separate seminar on GP)

28 Chinese restaurant process (CRP) is a distribution on partitions of integers. CRP is used to represent uncertainty over the number of components in a mixture model. Chinese Restaurant Process

29 Chinese Restaurant Process  Unlimited number of tables  Each table has an unlimited capacity to seat customers.

30 Chinese Restaurant Process The (m+1) th subsequent customer sits at a table drawn from the following distribution: where m i is the number of previous customers at table i and is a parameter.

31 Chinese Restaurant Process Example: The probability that next customer sits at table

32 CRP yields an exchangeable distribution on partitions of integers, i.e., the specific ordering of the customers is irrelevant. An infinite set of random variables is said to be infinitely exchangeable if for every finite subset, we have Chinese Restaurant Process for any permutation.

33 Dirichlet Process G 0 : any probability measure on the reals, : partition. A process is a Dirichlet process if the following equation holds for all partitions: where is a concentration parameter. Note: Dir – Dirichlet distribution, DP - Dirichlet process.

34 Denote a sample from the Dirichlet process as G is a distribution. Denote a sample from the distribution G as Dirichlet Process Graphical model for a DP generating the parameters.

35 Dirichlet Process Properties of DP:

36 Dirichlet Process The marginal probabilities for a new This is Chinese restaurant process.

37 DP Mixtures If F is a normal distribution, this is the a Gaussian mixture model.

38 Applications of DP Infinite Gaussian Mixture Model (Rasmussen 2000) Infinite Hidden Markov Model (Beal 2002) Hierarchical Topic Models and the Nested Chinese Restaurant Process (Blei 2004)

39 Implementation of DP Gibbs sampling If G 0 is a conjugate prior for the likelihood given by F: (Escobar 1995) Non-conjugate prior: (Neal 1998)

40 Variational Inference for DPM The goal is to compute the predictive density under DP mixture Also, we minimized the KL distance between p and a variational distribution q. This algorithm is based on the stick-breaking representation of DP. (I would suggest to have a separate seminar on stick-breaking view of DP and variational DP.)

41 Open Questions Can we apply ideas of infinite models beyond identifying the number of states or components in a mixture? Under what conditions can we expect these models to give consistent estimates of densities?... Specified to our problem: Non conjugate due to sigmoid function