Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering.

Slides:



Advertisements
Similar presentations
Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Advertisements

Teg Grenager NLP Group Lunch February 24, 2005
Xiaolong Wang and Daniel Khashabi
MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Hierarchical Dirichlet Process (HDP)
A Hierarchical Multiple Target Tracking Algorithm for Sensor Networks Songhwai Oh and Shankar Sastry EECS, Berkeley Nest Retreat, Jan
Hierarchical Dirichlet Processes
DEPARTMENT OF ENGINEERING SCIENCE Information, Control, and Vision Engineering Bayesian Nonparametrics via Probabilistic Programming Frank Wood
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
HW 4. Nonparametric Bayesian Models Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Learning Scalable Discriminative Dictionaries with Sample Relatedness a.k.a. “Infinite Attributes” Jiashi Feng, Stefanie Jegelka, Shuicheng Yan, Trevor.
Structural Inference of Hierarchies in Networks BY Yu Shuzhi 27, Mar 2014.
Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process Chong Wang and David M. Blei NIPS 2009 Discussion led by Chunping Wang.
Overview Full Bayesian Learning MAP learning
Software Engineering Laboratory1 Introduction of Bayesian Network 4 / 20 / 2005 CSE634 Data Mining Prof. Anita Wasilewska Hiroo Kusaba.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Nonparametric Bayes and human cognition Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
Visual Recognition Tutorial
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Experimental Evaluation
The Analysis of Variance
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Hierarchical Bayesian Nonparametrics with Applications Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth,
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University.
Random Sampling, Point Estimation and Maximum Likelihood.
1 Institute of Engineering Mechanics Leopold-Franzens University Innsbruck, Austria, EU H.J. Pradlwarter and G.I. Schuëller Confidence.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
(Infinitely) Deep Learning in Vision Max Welling (UCI) collaborators: Ian Porteous (UCI) Evgeniy Bart UCI/Caltech) Pietro Perona (Caltech)
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Hierarchical Topic Models and the Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum presented by Rodrigo de Salvo Braz.
Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
Timeline: A Dynamic Hierarchical Dirichlet Process Model for Recovering Birth/Death and Evolution of Topics in Text Stream (UAI 2010) Amr Ahmed and Eric.
Infinite block models for belief networks, social networks, and cultural knowledge Josh Tenenbaum, MIT 2007 MURI Review Meeting Work of Charles Kemp, Chris.
Randomized Algorithms for Bayesian Hierarchical Clustering
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Hierarchical Bayesian Model Specification Model is specified by the Directed Acyclic Network (DAG) and the conditional probability distributions of all.
Three Frameworks for Statistical Analysis. Sample Design Forest, N=6 Field, N=4 Count ant nests per quadrat.
Variational Inference for the Indian Buffet Process
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Latent Class Regression Model Graphical Diagnostics Using an MCMC Estimation Procedure Elizabeth S. Garrett Scott L. Zeger Johns Hopkins University
1 Clustering in Generalized Linear Mixed Model Using Dirichlet Process Mixtures Ya Xue Xuejun Liao April 1, 2005.
Exploiting Ontologies for Automatic Image Annotation Munirathnam Srikanth, Joshua Varner, Mitchell Bowden, Dan Moldovan Language Computer Corporation SIGIR.
Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University.
Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture Duke University Machine Learning Group Presented by Kai Ni August.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Multi-target Detection in Sensor Networks Xiaoling Wang ECE691, Fall 2003.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.
Hierarchical Beta Process and the Indian Buffet Process by R. Thibaux and M. I. Jordan Discussion led by Qi An.
The Nested Dirichlet Process Duke University Machine Learning Group Presented by Kai Ni Nov. 10, 2006 Paper by Abel Rodriguez, David B. Dunson, and Alan.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Fast search for Dirichlet process mixture models
Extracting Mobile Behavioral Patterns with the Distant N-Gram Topic Model Lingzi Hong Feb 10th.
A Non-Parametric Bayesian Method for Inferring Hidden Causes
Hierarchical Topic Models and the Nested Chinese Restaurant Process
Parametric Methods Berlin Chen, 2005 References:
Rational models of categorization
Presentation transcript:

Unsupervised Group Discovery in Relational Datasets: A nonparametric Bayesian Approach P.S. Koutsourelakis School of Civil and Environmental Engineering Cornell University Artificial Intelligence Seminar, 10/12/07 Joint work with T. Eliassi-Rad, LLNL

P.S. Koutsourelakis, Problem Setting A B D C age income location … age income location … age income location … age income location … friend co-worker phone call Traditional Clustering  Can we improve clustering by using relational data ?  What if only relational data was available ?  Can we make predictions about missing links or attributes?

P.S. Koutsourelakis, Problem Setting A collection of objects belonging to various types/domains (i.e. people, papers, locations, devices, movies, etc) Each object might have (observable) attributes Links/relations between: – Two or more objects – Objects can be of the same or different types – Binary (absence/presence), integer or real-valued Each link might have (observable) attributes Find groups of objects of each type, or Find common identities between objects of each type, or Organize objects into clusters that relate to each other in predictable ways Goal:

P.S. Koutsourelakis, Problem Setting A B D C Given an adjacency matrix where R i,j = 0 or 1 (observables), find cluster assignment I i (hidden/latent). ABCD A 000 B 0 00 C 10 0 D 011 Probabilistic - Bayesian Formulation posterior likelihood prior

P.S. Koutsourelakis, Problem Setting Likelihood: The relational behavior of the objects is completely determined by their cluster assignments I i For example: matrix specifying link probability between any two groups Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T. & Ueda, N. Learning systems of concepts with an infinite relational model. AAAI 2006.

P.S. Koutsourelakis, Augmented Problem Setting If objects have attributes (i.e., x i which are also observed), then we can augment likelihood : If links R i,j are real-valued (i.e. duration of phone call, number of bytes etc): Functions of group assignments

P.S. Koutsourelakis, Problem Setting We need a prior on group assignments p(I). What is an appropriate prior p(K) on the number of clusters K? Groups are unlikely to be related as above. The distribution on I i should be exchangeable. That is, the order in which nodes are assigned can be permuted without changing the probability of resulting partition. Likelihood Function

P.S. Koutsourelakis, Nonparametric Bayesian Methods* Bayesian methods are most powerful when your prior adequately captures your beliefs. Inflexible models (e.g. with a fixed number of groups) might yield unreasonable inferences. Non-parametrics provide a way of getting very flexible models. Non-parametric models can automatically infer an adequate model size/complexity from the data, without needing to explicitly do Bayesian model comparison Many can be derived by starting with a finite parametric model and taking the limit as number of parameters * Nonparametric doesn’t mean there are no parameters, but that “the number of parameters grows with the data” (e.g. as in Parzen window density estimation)

P.S. Koutsourelakis, Chinese Restaurant Process (CRP) (potentially infinite dishes) MENU.

P.S. Koutsourelakis, Chinese Restaurant Process (CRP) Properties: CRP is exchangeable (i.e. order in which customers entered doesn’t matter) The number of groups grows as O(log n) where n is the number of nodes Inference with Gibbs sampling can be based on the conditionals above Larger γ favors more clusters number of people already eating dish j

P.S. Koutsourelakis, Infinite Relational Model (IRM) “Forward” Interpretation (single domain) 1) Sample group assignments I i from CRP(γ) resulting in K clusters 2) Sample iid η(a,b) for all a,b=1,2,..,K from Beta(β 1,β 2 ) 3) Sample iid each R i,j from Bernoulli(η(I i, I j )) From Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T. & Ueda, N. Learning systems of concepts with an infinite relational model. AAAI 2006.

P.S. Koutsourelakis, 2 domains (animals + features) Animals form two groups: birds + 4-legged mammals Application: Object-Feature Dataset

P.S. Koutsourelakis, Application: Object-Feature Dataset Maximum-Likelihood Configuration Animal Domain Group 1: dove, hen, owl, falcon, eagle Group 2: duck, goose Group 3: fox, cat Group 4: horse, zebra Group 5: dog, wolf, tiger, lion, cow Feature Domain Group 1: small, 2-legs, feathers, fly Group 2: medium, hunt Group 3: big, hooves, mane, run Group 4: 4-legs, hair Group 5: swim

P.S. Koutsourelakis, Application: Object-Feature Dataset

P.S. Koutsourelakis, Predicting Missing Links % of Missing Links AUCAccuracy 10% % % % Can we make predictions about missing links?

P.S. Koutsourelakis, Infinite Relational Model (IRM) Advantages: It is an unsupervised learner with only two tunable parameters β and γ. It can be applied to multiple node types and relations. It has all the advantages of a Bayesian formulations (missing data, confidence intervals) and nonparametric methods (adaptation to data, outlier accommodation). It has been successfully used for co-clustering object features, learning ontologies and social networks. Disadvantages: Significant computational effort It does not capture “multiple personalities.”

P.S. Koutsourelakis, “Multiple Personalities” In real data, objects (e.g. people) do not belong exclusively to one group, i.e. their identity is a mixture of basic components. These components can be the same for each object type but the mixing proportions might vary from one object to another.. IRM assumes that each object participates in all the relations it is involved with a single identity. A proper model should account for a different mixture for each object over all the possible identity components (which are common for the whole domain). This way we learn not only all the groups of the population but also all the existing mixtures of them. This can be achieved by introducing a Bayesian hierarchy groups ≡ identities

P.S. Koutsourelakis, Mixed-Membership Model (MMM) A: No, because the groups for each CRP will not be shared across objects Q: Can we use an independent CRP for each object

P.S. Koutsourelakis, Chinese Restaurant Franchise N restaurants with a common menu Object 1 = restaurant 1 Object 2 = restaurant 2 Object N = restaurant N ……………… Phase 1: Table Assignment Phase 2: Dish Assignment Y.W. Teh, M.I. Jordan, M.J. Beal and D.M. Blei. Hierarchical Dirichlet Processes. JASA, 2006.

P.S. Koutsourelakis, Chinese Restaurant Franchise number of customers already sitting at table t table assignment for customer m at restaurant i number of tables number of tables already eating dish k dish assignment for table t in restaurant i

P.S. Koutsourelakis, Mixed-Membership Model dish assignment of node i Properties: - Has a few more parameters, γ i, but also has higher expressivity - Inference with Gibbs sampling can be based on the conditionals above

P.S. Koutsourelakis, Non-Identifiability A B two objects: A, B two groups: 1, 2 100% group 1 50% group 1 50% group 2 Probability of a 1 link between any pair of groups

P.S. Koutsourelakis, Non-Identifiability A B 100% group 1 50% group 1 50% group 2  Different configurations (with 2, 3 or 4 groups) have the same likelihood  Prior determines inference results

P.S. Koutsourelakis, Application: Mixed-Membership 1 domains – 16 objects 4 distinct identities fully observed adjacency matrix

P.S. Koutsourelakis, Application: Mixed-Membership Model

P.S. Koutsourelakis, Application: Mixed-Membership Model

P.S. Koutsourelakis, Application: Mixed-Membership Model

P.S. Koutsourelakis, Application: Mixed-Membership IRM MMM Error w.r.t. actual probability that any pair of objects belong to the same group

P.S. Koutsourelakis, Application: Mixed-Membership IRM MMM

P.S. Koutsourelakis, Application: Mixed-Membership Model 2 domains (animals + features) Animals form two groups: birds + 4-legged mammals

P.S. Koutsourelakis, Application: Mixed-Membership Model

P.S. Koutsourelakis, Application: Mixed-Membership Model COW: Average posterior pairwise probabilities of belonging to the same group

P.S. Koutsourelakis, 34 people A disagreement between administrator (34) and instructor (1) led to the split of the club in two (circles and squares) Used binary matrix that records “like” relation Zachary’s Karate Club from M Girvan and MEJ Newman, Proc. Natl. Acad. Sci. USA, 2002

P.S. Koutsourelakis, Zachary’s Karate Club

P.S. Koutsourelakis, Learning Hierarchies Can we meaningfully infer a hierarchy of groups/identities? Identity 1 Identity 2 Identity 3 Identity 4 most general most specific

P.S. Koutsourelakis, Learning Hierarchies Nonparametric prior on trees Level 0 Level L Level L-1 each box is a different group/identity CRP L (a L ) CRP L-1 (a L-1 )

P.S. Koutsourelakis, Learning Hierarchies “Forward” interpretation (for a single domain) Hierarchical Mixed Membership Model (HMMM)

P.S. Koutsourelakis, Application: Artificial Dataset 1 domain – 40 objects 4 distinct identities fully observed adjacency matrix

P.S. Koutsourelakis, Application: Artificial Dataset

P.S. Koutsourelakis, Application: Political Books  43 liberal, 49 conservative, 13 neutral  Links imply frequent co-purchasing by the same buyers (Amazon.com)

P.S. Koutsourelakis, Application: Political Books 27% 50% 23% 19% 46% 35% 0% 100% 8% 84% 6% 94% 0% 100% 0% 100% 0% 100% 0% 100%

P.S. Koutsourelakis, Reality Mining MIT Data  1 node type (people)  97 people + all outsiders in one node  22 different positions (professor,staff,1styeargrad,….) sloan 29% faculty& staff 5% students 52% other 14%

P.S. Koutsourelakis, Reality Mining MIT Data 12% 64% 12% 50% 33% 17 % 0%0% 7% 86% 7% 0% 4% 0% 96% 100% 0% 100% 0% 33% 0% 50 % 17 % 15% 83% 4%4% 8%8% 25% 50% 0% 25%

P.S. Koutsourelakis, Conclusions and Outlook Relational data contain significant information about group structure Bayesian models allow the analyst to make inferences about communities of interest while quantifying the level of confidence, even when a significant proportion of the data is missing Nonparametric models provide a way of getting very flexible priors that allow the model to adapt to the data. IRM is a very lightweight framework with a very wide range of applicability, but cannot capture multiple identities. MMM and HMMM allows for increased flexibility and provides additional information about objects that simultaneously belong to several groups. Challenges:  Accelerated inference especially when dealing with large datasets: - Variational methods - Sequential Monte Carlo  Appropriate priors for time dependent datasets are needed

P.S. Koutsourelakis, Application: Senate Vote 2002  50 Democrats, 49 Republicans, 1 Independent  Link R i,j =1 if: - voted the same - have both taken more or less than the average contribution average: $13,800

P.S. Koutsourelakis, Application: Senate Vote % 100% 0% 67% 33% 0% 25% 75% 3% 12% 85% 0% 70% 30%