Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Slides:

Advertisements

Similar presentations

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

Advertisements

ECG Signal processing (2)

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Data Mining Classification: Alternative Techniques

An Introduction of Support Vector Machine

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Supervised Learning Recap

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

Chapter 4: Linear Models for Classification

Segmentation and Fitting Using Probabilistic Methods

Overview Full Bayesian Learning MAP learning

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

Decision Tree Rong Jin. Determine Milage Per Gallon.

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.

CES 514 – Data Mining Lecture 8 classification (contd…)

Expectation Maximization Algorithm

Classification 10/03/07.

Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(

Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.

Bayesian Learning Rong Jin.

Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9

Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Ensemble Learning (2), Tree and Forest

Radial Basis Function Networks

Crash Course on Machine Learning

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

EM and expected complete log-likelihood Mixture of Experts

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.

An Introduction to Support Vector Machine (SVM)

Linear Models for Classification

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.

Ensemble Methods in Machine Learning

Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.

Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.

Logistic Regression William Cohen.

1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.

Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819

Hierarchical Mixture of Experts Presented by Qi An Machine learning reading group Duke University 07/15/2005.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Data Mining Lecture 11.

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Text Categorization Berlin Chen 2003 Reference:

Linear Discrimination

Presentation transcript:

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition (10:00 am - 11:00 am) Computational Analysis of Drosophila Gene Expression Pattern Image (11:00 am - 12:00 pm) 3D General Lesion Segmentation in CT (3:00 pm - 4:00 pm)

Hierarchical Mixture Expert Model Rong Jin

Good Things about Decision Trees  Decision trees introduce nonlinearity through the tree structure Viewing A^B^C as A*B*C Compared to kernel methods  Less adhoc  Easy understanding

Example Kernel method x=0 Generalized Tree +   + In general, mixture model is powerful in fitting complex decision boundary, for instance, stacking, boosting, bagging

Generalize Decision Trees From slides of Andrew Moore Each node of decision tree only depends on a single feature. Is this the best idea?

Partition Datasets  The goal of each node is to partition the data set into disjoint subsets such that each subset is easier to classify. Original Dataset Partition by a single attribute cylinders = 4 cylinders = 5 cylinders = 6 cylinders = 8

Partition Datasets (cont’d)  More complicated partitions Original Dataset Partition by multiple attributes Other cases Cylinders 4 ton Cylinders  6 and Weight < 3 ton  How to accomplish such a complicated partition?  Each partition  a class  Partition a dataset into disjoint subsets  Classify a dataset into multiple classes Using a classification model for each node

A More General Decision Tree +   + a decision tree with simple data partition +   a decision tree using classifiers for data partition   + Each node is a linear classifier Attribute 1 Attribute 2 classifier

General Schemes for Decision Trees  Each node within the tree is a linear classifier  Pro: Usually result in shallow trees Introducing nonlinearity into linear classifiers (e.g. logistic regression) Overcoming overfitting issues through the regularization mechanism within the classifier. Partition datasets with soft memberships A better way to deal with real-value attributes  Example: Neural network Hierarchical Mixture Expert Model +  

Hierarchical Mixture Expert Model (HME) Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) Classifier Determines the class for input x Router Decides which classifier should x be route to x

Hierarchical Mixture Expert Model (HME) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) Which group should be used for classifying x ? ??

Hierarchical Mixture Expert Model (HME) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) r(x) = +1

Hierarchical Mixture Expert Model (HME) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) Which expert should be used for classifying x ? ??

Hierarchical Mixture Expert Model (HME) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) g 1 (x) = -1

Hierarchical Mixture Expert Model (HME) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) m 1,2 (x) =+1 The class label for +1

Hierarchical Mixture Expert Model (HME) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) Which group should be used for classifying x ? ?? More Complicated Case

Hierarchical Mixture Expert Model (HME) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) r(+1|x) = ¾, r(-1|x) = ¼ More Complicated Case

Hierarchical Mixture Expert Model (HME) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) x Which expert should be used for classifying x ? ???? r(+1|x) = ¾, r(-1|x) = ¼ More Complicated Case

Hierarchical Mixture Expert Model (HME) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) g 1 (+1|x) = ¼, g 1 (-1|x) = ¾ g 2 (+1|x) = ½, g 2 (-1|x) = ½ r(+1|x) = ¾, r(-1|x) = ¼ +1 m 1,1 (x)¼¾ m 1,2 (x)¾¼ m 2,1 (x)¼¾ m 2,2 (x)¾¼ More Complicated Case x How to compute the probability p(+1|x) and p(-1|x)?

HME: Probabilistic Description Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) Random variable g = {1, 2} r(+1|x)=p(g = 1|x), r(-1|x)=p(g = 2|x) Random variable m = {11, 12, 21, 22} g 1 (+1|x) = p(m=11|x, g=1), g 1 (-1|x) = p(m=12|x, g=1) g 2 (+1|x) =p(m=21|x, g=2) g 2 (-1|x) =p(m=22|x, g=2)

HME: Probabilistic Description g 1 (+1|x) = ¼, g 1 (-1|x) = ¾ g 2 (+1|x) = ½, g 2 (-1|x) = ½ r(+1|x) = ¾, r(-1|x) = ¼ +1 m 1,1 (x)¼¾ m 1,2 (x)¾¼ m 2,1 (x)¼¾ m 2,2 (x)¾¼ Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) Compute P(+1|x) and P(-1|x)

HME: Probabilistic Description g 1 (+1|x) = ¼, g 1 (-1|x) = ¾ g 2 (+1|x) = ½, g 2 (-1|x) = ½ r(+1|x) = ¾, r(-1|x) = ¼ +1 m 1,1 (x)¼¾ m 1,2 (x)¾¼ m 2,1 (x)¼¾ m 2,2 (x)¾¼ Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x)

HME: Probabilistic Description g 1 (+1|x) = ¼, g 1 (-1|x) = ¾ g 2 (+1|x) = ½, g 2 (-1|x) = ½ r(+1|x) = ¾, r(-1|x) = ¼ +1 m 1,1 (x)¼¾ m 1,2 (x)¾¼ m 2,1 (x)¼¾ m 2,2 (x)¾¼

Hierarchical Mixture Expert Model (HME) r(x ) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) y Is HME more powerful than a simple majority vote approach?

Problem with Training HME  Using logistic regression to model r(x), g(x), and m(x)  No training examples r(x), g(x) For each training example (x, y), we don’t know its group ID or expert ID.  can’t apply training procedure of logistic regression model to train r(x) and g(x) directly. Random variables g, m are called hidden variables since they are not exposed in the training data.  How to train a model with incomplete data?

Start with Random Guess … x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} Iteration 1: random guess: Randomly assign points to groups and experts

Start with Random Guess … x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} Iteration 1: random guess: Randomly assign points to groups and experts {1,2,} {6,7}{3,4,5} {8,9} {1}{6}{2}{7}{3}{9} {5,4}{8}

Start with Random Guess … x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} Iteration 1: random guess: Randomly assign points to groups and experts Learn r(x), g 1 (x), g 2 (x), m 11 (x), m 12 (x), m 21 (x), m 22 (x) {1,2,} {6,7}{3,4,5} {8,9} {1}{6}{2}{7}{3}{9} {5,4}{8} Now, what should we do?

Refine HME Model x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} Iteration 2: regroup data points Reassign the group membership to each data point Reassign the expert membership to each expert {1,5} {6,7}{2,3,4} {8,9} But, how?

Determine Group Memberships g 1 (+1|x) = ¼, g 1 (-1|x) = ¾ g 2 (+1|x) = ½, g 2 (-1|x) = ½ r(+1|x) = ¾, r(-1|x) = ¼ +1 m 1,1 (x)¼¾ m 1,2 (x)¾¼ m 2,1 (x)¼¾ m 2,2 (x)¾¼ Group 1 g 1 (x) m 1,1 (x) Group Layer Exper tLayer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x) m 2,2 (x) x Consider an example (x, +1) Compute the posterior on your own sheet !

Determine Group Memberships g 1 (+1|x) = ¼, g 1 (-1|x) = ¾ g 2 (+1|x) = ½, g 2 (-1|x) = ½ r(+1|x) = ¾, r(-1|x) = ¼ +1 m 1,1 (x)¼¾ m 1,2 (x)¾¼ m 2,1 (x)¼¾ m 2,2 (x)¾¼ Group 1 g 1 (x) m 1,1 (x) Group Layer Exper tLayer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x) m 2,2 (x) x Consider an example (x, +1)

Determine Expert Memberships g 1 (+1|x) = ¼, g 1 (-1|x) = ¾ g 2 (+1|x) = ½, g 2 (-1|x) = ½ r(+1|x) = ¾, r(-1|x) = ¼ +1 m 1,1 (x)¼¾ m 1,2 (x)¾¼ m 2,1 (x)¼¾ m 2,2 (x)¾¼ Group 1 g 1 (x) m 1,1 (x) Group Layer Exper tLayer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x) m 2,2 (x) x Consider an example (x, +1)

Refine HME Model x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} Iteration 2: regroup data points Reassign the group membership to each data point Reassign the expert membership to each expert Compute the posteriors p(g|x,y) and p(m|x,y,g) for each training example (x,y) Retrain r(x), g 1 (x), g 2 (x), m 11 (x), m 12 (x), m 21 (x), m 22 (x) using estimated posteriors {1,5} {6,7}{2,3,4} {8,9} But, how ?

Logistic Regression: Soft Memberships  Example: train r(x) Soft memberships

Logistic Regression: Soft Memberships  Example: train m 11 (x) Soft memberships

Start with Random Guess … x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} Repeat the above procedure until it converges (it guarantees to converge a local minimum) {1,5} {6,7}{2,3,4} {8,9} {1}{6}{5}{7}{2,3}{9} {4}{8} This is famous Expectation-Maximization Algorithm (EM) ! Iteration 2: regroup data points Reassign the group membership to each data point Reassign the expert membership to each expert Compute the posteriors p(g|x,y) and p(m|x,y,g) for each training example (x,y) Retrain r(x), g 1 (x), g 2 (x), m 11 (x), m 12 (x), m 21 (x), m 22 (x)

Formal EM algorithm for HME  Unknown logistic regression models r(x;  r ), {g i (x;  g )} and {m i (x;  m )}  Unknown group memberships and expert memberships p(g|x,y), p(m|x, y, g) E-step Fixed logistic regression model and estimate memberships: Estimate p(g=1|x,y), p(g=2|x,y) for all training examples Estimate p(m=11, 12|x, y, g=1) and p(m=21, 22|x, y, g=2) for all training examples M-step Fixed memberships and learn logistic regression models Train r(x;  r ) using soft memberships p(g=1|x,y) and p(g=2|x,y) Train g 1 (x;  g ) and g 2 (x;  g ) using soft memberships p(m=11, 12|x, y, g=1), p(m=21, 22|x, y,g=2) Train m 11 (x;  m ), m 12 (x;  m ), m 21 (x;  m ), and m 22 (x;  m ) using soft memberships p(m=11,12|x,y,g=1), p(m=21,22|x,y,g=2)

Formal EM algorithm for HME  Unknown logistic regression models r(x;  r ), {g i (x;  g )} and {m i (x;  m )}  Unknown group memberships and expert memberships p(g|x,y), p(m|x, y, g) E-step Fixed logistic regression model and estimate memberships: Estimate p(g=1|x,y), p(g=2|x,y) for all training examples Estimate p(m=11, 12|x, y, g=1) and p(m=21, 22|x, y, g=2) for all training examples M-step Fixed memberships and learn logistic regression models Train r(x;  r ) using soft memberships p(g=1|x,y) and p(g=2|x,y) Train g 1 (x;  g ) and g 2 (x;  g ) using soft memberships p(m=11, 12|x, y, g=1), p(m=21, 22|x, y,g=2) Train m 11 (x;  m ), m 12 (x;  m ), m 21 (x;  m ), and m 22 (x;  m ) using soft memberships p(m=11,12|x,y,g=1), p(m=21,22|x,y,g=2)

What are We Doing?  What is the objective of doing Expectation-Maximization?  It is still a simple maximum likelihood!  Expectation-Maximization algorithm actually tries to maximize the log-likelihood function  Most time, it converges to local maximum, not a global one  Improved version: annealing EM

Annealing EM

Improve HME  It is sensitive to initial assignments How can we reduce the risk of initial assignments?  Binary tree  K-way trees Logistic regression  conditional exponential model  Tree structure Can we determine the optimal tree structure for a given dataset?

Comparison of Classification Models  The goal of classifier Predicting class label y for an input x Estimate p(y|x)  Gaussian generative model p(y|x) ~ p(x|y) p(y): posterior = likelihood  prior Difficulty in estimating p(x|y) if x comprises of multiple elements  Naïve Bayes: p(x|y) ~ p(x 1 |y) p(x 2 |y)… p(x d |y)  Linear discriminative model Estimate p(y|x) Focusing on finding the decision boundary

Comparison of Classification Models  Logistic regression model A linear decision boundary: w  x+b A probabilistic model p(y|x) Maximum likelihood approach for estimating weights w and threshold b

Comparison of Classification Models  Logistic regression model Overfitting issue  In text classification problem, words that only appears in only one document will be assigned with infinite large weight Solution: regularization  Conditional exponential model  Maximum entropy model A dual problem of conditional exponential model

Comparison of Classification Models  Support vector machine Classification margin Maximum margin principle: two objective  Minimize the classification error over training data  Maximize classification margin Support vector  Only support vectors have impact on the location of decision boundary denotes +1 denotes -1 Support Vectors

Comparison of Classification Models  Separable case  Noisy case Quadratic programming!

Comparison of Classification Models  Similarity between logistic regression model and support vector machine Log-likelihood can be viewed as a measurement of accuracy Identical terms Logistic regression model is almost identical to support vector machine except for different expression for classification errors

Comparison of Classification Models Generative models have trouble at the decision boundary Classification boundary that achieves the least training error Classification boundary that achieves large margin