Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Mixture Models and the EM Algorithm
Unsupervised Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Visual Recognition Tutorial
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem, random variables, pdfs 2Functions.
G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1 Statistical Data Analysis: Lecture 2 1Probability, Bayes’ theorem 2Random variables and.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Lecture II-2: Probability Review
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
B AD 6243: Applied Univariate Statistics Understanding Data and Data Distributions Professor Laku Chidambaram Price College of Business University of Oklahoma.
Gaussian Mixture Model and the EM algorithm in Speech Recognition
880.P20 Winter 2006 Richard Kass 1 Confidence Intervals and Upper Limits Confidence intervals (CI) are related to confidence limits (CL). To calculate.
EM and expected complete log-likelihood Mixture of Experts
PROBABILITY & STATISTICAL INFERENCE LECTURE 3 MSc in Computing (Data Analytics)
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Risk Analysis & Modelling Lecture 2: Measuring Risk.
Tim Marks, Dept. of Computer Science and Engineering Random Variables and Random Vectors Tim Marks University of California San Diego.
Mixture of Gaussians This is a probability distribution for random variables or N-D vectors such as… –intensity of an object in a gray scale image –color.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
HMM - Part 2 The EM algorithm Continuous density HMM.
Roghayeh parsaee  These approaches assume that the study sample arises from a homogeneous population  focus is on relationships among variables 
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Maximum likelihood estimators Example: Random data X i drawn from a Poisson distribution with unknown  We want to determine  For any assumed value of.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Random Variables By: 1.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
MECH 373 Instrumentation and Measurements
Probability Theory and Parameter Estimation I
Classification of unlabeled data:
Latent Variables, Mixture Models and EM
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Igor V. Cadez, Padhraic Smyth, Geoff J. Mclachlan, Christine and E
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
ECE 417 Lecture 4: Multivariate Gaussians
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Gaussian Mixture Models And their training with the EM algorithm
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Continuous Probability Distributions
EM Algorithm 主講人:虞台文.
Data Exploration and Pattern Recognition © R. El-Yaniv
Presentation transcript:

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell

Slide 2 EE3J2 Data Mining Objectives  To review basic statistical modelling  To review the notion of probability distribution  To review the notion of probability density function  To introduce mixture densities  To introduce the multivariate Gaussian density

Slide 3 EE3J2 Data Mining Discrete variables  Suppose that Y is a random variable which can take any value in a discrete set X={x 1,x 2,…,x M }  Suppose that y 1,y 2,…,y N are samples of the random variable Y  If c m is the number of times that the y n = x m then an estimate of the probability that y n takes the value x m is given by:

Slide 4 EE3J2 Data Mining Discrete Probability Mass Function Symbol Total Num.Occurrences

Slide 5 EE3J2 Data Mining Continuous Random Variables  In most practical applications the data are not restricted to a finite set of values – they can take any value in N-dimensional space  Simply counting the number of occurrences of each value is no longer a viable way of estimating probabilities…  …but there are generalisations of this approach which are applicable to continuous variables – these are referred to as non-parametric methods

Slide 6 EE3J2 Data Mining Continuous Random Variables  An alternative is to use a parametric model  In a parametric model, probabilities are defined by a small set of parameters  Simplest example is a normal, or Gaussian model  A Gaussian probability density function (PDF) is defined by two parameters – its mean  and variance 

Slide 7 EE3J2 Data Mining Gaussian PDF  ‘Standard’ 1-dimensional Guassian PDF: – mean  =0 –variance  =1

Slide 8 EE3J2 Data Mining Gaussian PDF a b P(a  x  b)

Slide 9 EE3J2 Data Mining Gaussian PDF  For a 1-dimensional Gaussian PDF p with mean  and variance  : Constant to ensure area under curve is 1 Defines ‘bell’ shape

Slide 10 EE3J2 Data Mining More examples  =0.1  =1.0  =10.0  =5.0

Slide 11 EE3J2 Data Mining Fitting a Gaussian PDF to Data  Suppose y = y 1,…,y n,…,y N is a set of N data values  Given a Gaussian PDF p with mean  and variance , define:  How do we choose  and  to maximise this probability?

Slide 12 EE3J2 Data Mining Fitting a Gaussian PDF to Data Poor fitGood fit

Slide 13 EE3J2 Data Mining Maximum Likelihood Estimation  Define the best fitting Gaussian to be the one such that p(y| ,  ) is maximised.  Terminology: –p(y| ,  ), thought of as a function of y is the probability (density) of y –p(y| ,  ), thought of as a function of ,  is the likelihood of ,   Maximising p(y| ,  ) with respect to ,  is called Maximum Likelihood (ML) estimation of , 

Slide 14 EE3J2 Data Mining ML estimation of ,   Intuitively: –The maximum likelihood estimate of  should be the average value of y 1,…,y N, (the sample mean) –The maximum likelihood estimate of  should be the variance of y 1,…,y N. (the sample variance)  This turns out to be true: p(y| ,  ) is maximised by setting:

Slide 15 EE3J2 Data Mining Multi-modal distributions  In practice the distributions of many naturally occurring phenomena do not follow the simple bell- shaped Gaussian curve  For example, if the data arises from several difference sources, there may be several distinct peaks (e.g. distribution of heights of adults)  These peaks are the modes of the distribution and the distribution is called multi-modal

Slide 16 EE3J2 Data Mining Gaussian Mixture PDFs  Gaussian Mixture PDFs, or Gaussian Mixture Models (GMMs) are commonly used to model multi-modal, or other non-Gaussian distributions.  A GMM is just a weighted average of several Gaussian PDFs, called the component PDFs  For example, if p 1 and p 2 are Gaussiam PDFs, then p(y) = w 1 p 1 (y) + w 2 p 2 (y) defines a 2 component Gaussian mixture PDF

Slide 17 EE3J2 Data Mining Gaussian Mixture - Example  2 component mixture model –Component 1:  =0,  =0.1 –Component 2:  =2,  =1 –w 1 = w 2 =0.5

Slide 18 EE3J2 Data Mining Example 2  2 component mixture model –Component 1:  =0,  =0.1 –Component 2:  =2,  =1 –w 1 = 0.2 w 2 =0.8

Slide 19 EE3J2 Data Mining Example 3  2 component mixture model –Component 1:  =0,  =0.1 –Component 2:  =2,  =1 –w 1 = 0.2 w 2 =0.8

Slide 20 EE3J2 Data Mining Example 4  5 component Gaussian mixture PDF

Slide 21 EE3J2 Data Mining Gaussian Mixture Model  In general, an M component Gaussian mixture PDF is defined by: where each p m is a Gaussian PDF and

Slide 22 EE3J2 Data Mining Estimating the parameters of a Gaussian mixture model  A Gaussian Mixture Model with M components has: –M means:  1,…,  M –M variances  1,…,  M –M mixture weights w 1,…,w M.  Given a set of data y = y 1,…,y N, how can we estimate these parameters?  I.e. how do we find a maximum likelihood estimate of  1,…,  M,  1,…,  M, w 1,…,w M ?

Slide 23 EE3J2 Data Mining Parameter Estimation  If we knew which component each sample y t came from, then parameter estimation would be easy: –Set  m to be the average value of the samples which belong to the m th component –Set  m to be the variance of the samples which belong to the m th component –Set w m to be the proportion of samples which belong to the m th component  But we don’t know which component each sample belongs to.

Slide 24 EE3J2 Data Mining Solution – the E-M algorithm  Guess initial values  For each n calculate the probabilities  Use these probabilities to estimate how much each sample y n ‘belongs to’ the m th component  Calculate: This is a measure of how much y n ‘belongs to’ the m th component REPEAT

Slide 25 EE3J2 Data Mining The E-M algorithm Parameter set  p(y |  )  (0) …  (i) local optimum

Slide 26 EE3J2 Data Mining Multivariate Gaussian PDFs  All PDFs so far have been 1-dimensional  They take scalar values  But most real data will be represented as D- dimensional vectors  The vector equivalent of a Gaussian PDF is called a multivariate Gaussian PDF

Slide 27 EE3J2 Data Mining Multivariate Gaussian PDFs Contours of equal probability 1- dimensional Gaussian PDFs

Slide 28 EE3J2 Data Mining Multivariate Gaussian PDFs 1- dimensional Gaussian PDFs

Slide 29 EE3J2 Data Mining Multivariate Gaussian PDF  The parameters of a multivariate Gaussian PDF are: –The (vector) mean  –The (vector) variance  –The covariance The covariance matrix 

Slide 30 EE3J2 Data Mining Multivariate Gaussian PDFs  Multivariate Gaussian PDFs are commonly used in pattern processing and data mining  Vector data is often not unimodal, so we use mixtures of multivariate Gaussian PDFs  The E-M algorithm works for multivariate Gaussian mixture PDFs

Slide 31 EE3J2 Data Mining Summary  Basic statistical modelling  Probability distributions  Probability density function  Gaussian PDFs  Gaussian mixture PDFs and the E-M algorithm  Multivariate Gaussian PDFs

Slide 32 EE3J2 Data Mining Summary