Review Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005.

Review Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

PatReco: Introduction Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

PatReco:Applications  Speech/audio/music/sounds Speech recognition, Speaker verification/id,  Image/video OCR, AVASR, Face id, Fingerpring id, Video segmentation  Text/Language Machine translatoin, document class., lnag mod., text underst.  Medical/Biology Disease diagnosis, DNA sequencing, Gene disease models  Other Data User modeling (books/music), Ling analysis (web), Games

Basic Concepts  Why statistical modeling? Variability: differences between two examples of the same class in training Mismatch: differences between two examples of the same class (one in training one in testing)  Learning modes: Supervised learning: class labels known Unsupervised learning: class labels unknown Re-inforced learning: only positive/negative feedback

Basic Concepts  Feature selection Separate classes, Low correlation  Model selection Model type, Model order  Prior knowledge E.g., a priori class probability  Missing features/observations  Modeling of time series Correlation in time (model?), segmentation

PatReco: Algorithms  Parametric vs Non-Parametric  Supervised vs Unsupervised  Basic Algorithms: Bayesian Non-parametric Discriminant Functions Non-Metric Methods

PatReco: Algorithms  Bayesian methods Formulation (describe class characteristics) Bayes classifier Maximum likelihood estimation Bayesian learning Estimation-Maximization Markov models, hidden Markov models Bayesian Nets  Non-parametric Parzen windows Nearest Neighbour

PatReco: Algorithms  Discriminant Functions Formulation (describe boundary) Learning: Gradient descent Perceptron MSE=minimum squared error LMS=least mean squares Neural Net generalizations Support vector machines  Non-Metric Methods Classification and Regression Trees String Matching

PatReco: Algorithms  Unsupervised Learning: Mixture of Gaussians K-means  Other not-covered Multi-layered Neural Nets Stochastic Learning (Simulated Annealing) Genetic Algorithms Fuzzy Algorithms Etc…

PatReco: Problem Solving 1.Data Collection 2.Data Analysis 3.Feature Selection 4.Model Selection 5.Model Training 6.Classification 7.Classifier Evaluation

Evaluation  Training Data Set 1234 examples of class 1 and class 2  Testing/Evaluation Data Set 134 examples of class 1 and class 2  Misclassification Error Rate Training: 11.61% (150 errors) Testing: 13.43% (18 errors)  Correct for chance (Training 22%, Testing 26%) Why?

PatReco: Discriminant Functions for Gaussians Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

PatReco: Problem Solving 1.Data Collection 2.Data Analysis 3.Feature Selection 4.Model Selection 5.Model Training 6.Classification 7.Classifier Evaluation

Discriminant Functions  Define class boundaries (instead of class characteristics)  Dualism: Parametric class description  Bayes classifier  Decision boundary  Parametric Discriminant Functions

Normal Density  1D  Multi-D Full covariance Diagonal covariance Diagonal covariance + univariate  Mixture of Gaussians Usually diagonal covariance

Gaussian Discriminant Functions  Same variance ALL classes Hyper-planes  Different variance among classes Hyper-quadratics (hyper-parabolas, hyper- ellipses etc.)

Hyper-Planes  When the covariance matrix is common across Gaussian classes The decision boundary is a hyper-plane that is vertical to the line connecting the means of the Gaussian distributions If the a-priori probabilities of classes are equal the hyper-planes cuts the line connecting the Gaussian means in the middle  Euclidean classifier

Gaussian Discriminant Functions  Same variance ALL classes Hyper-planes  Different variance among classes Hyper-quadratics (hyper-parabolas, hyper- ellipses etc.)

Hyper-Quadratics  When the Gaussian class variances are different the boundary can be hyper-plane, multiple hyper-planes, hyper-sphere, hyper- parabola, hyper-elipsoid etc. The boundary in general in NOT vertical to the Gaussian mean connecting line If the a-priori probabilities of classes are equal the resulting classifier is a Mahalanobois classifier

Conclusions  Parametric statistical models describe class characteristics x by modeling the observation probabilities p(x|class)  Discriminant functions describe class boundaries parametrically  Parametric statistical models have an equivalent parametric discriminant function  For Gaussian p(x|class) distributions the decision boundaries are hyper-planes or hyper-quadratics

PatReco: Detection Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

Detection  Goal: Detect an Event Hit (Success) False Alarm Miss (Failure) False Reject

PatReco: Estimation/Training Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

Estimation/Training  Goal: Given observed data (re-)estimate the parameters of the model e.g., for a Gaussian model estimate the mean and variance for each class

Supervised-Unsupervised  Supervised training: All data has been (manually) labeled, i.e., assigned to classes  Unsupervised training: Data is not assigned a class label

Observable data  Fully observed data: all information necessary for training is available (features, class labels etc.)  Partially observed data: some of the features or some of the class labels are missing

Supervised Training (fully observable data)  Maximum likelihood estimation (ML)  Maximum a posteriori estimation (MAP)  Bayesian estimation (BE)

Training process  Collected data used for training consists of the following examples D = {x 1, x 2, … x N }  Step 1: Label each example with the corresponding class label ω 1, ω 2,... ω Κ  Step 2: For each of the classes separately estimate the model parameters using ML, MAP, BE and the corresponding training examples D 1, D 2..D K

Training Process: Step 1 D = {x 1, x 2, x 3, x 4, x 5, … x N } Label manually ω 1, ω 2,... ω Κ D 1 = {x 11, x 12, x 13, … x 1 N 1 } D 2 = {x 21, x 22, x 23, … x 2 N 2 } ………… D K = {x K 1, x K 2, x K 3, … x KN k }

Training Process: Step 2  Maximum Likelihood θ 1 = argmax Θ P(D 1 |θ 1 )  Maximum-a-posteriori θ 1 = argmax Θ P(D 1 |θ 1 ) P(θ 1 )  Bayesian estimation P (x|ω 1 ) =  P(x| θ 1 )P( θ 1 |D 1 ) d θ 1

ML Estimation Assumptions 1.P(x|ω i ) follows a parametric distribution with parameters θ 2.D j tells us nothing about P(x|ω i ) (functional independence) 3.Observations x 1, x 2, x 3, … x N are iid (independent identically distributed 4a (ML only!) θ is a quantity whose value is fixed but unknown

ML estimate for Gaussian pdf If P(x|ω) = Ν(μ,σ 2 ) and θ=(μ,σ 2 ) then 1-D μ = (1/Ν) Σ j=1..N x j σ 2 = (1/Ν) Σ j=1..N (x j – μ) 2 Multi-D : θ=( μ, Σ ) μ = (1/Ν) Σ j=1..N x j Σ = (1/Ν) Σ j=1..N (x j – μ) Τ (x j – μ)

Bayesian Estimat. Assumptions 1.P(x|ω i ) follows a parametric distribution with parameters θ 2.D j tells us nothing about P(x|ω i ) (functional independence) 3.Observations x 1, x 2, x 3, … x N are iid (independent identically distributed) 4b (MAP, BE) θ is a random variable whose prior distribution p(θ) is known

Bayesian Estimation P (x|D) =  P(x,θ|D) dθ =  P(x|θ,D)P(θ|D) dθ =  P(x|θ)P(θ|D) dθ STEP 1: P(θ)  P(θ|D) P(x|D) = P(D|θ)P(θ)/P(D) STEP 2: P(x|θ)  P (x|D)

Bayesian Estimate for Gaussian pdf and priors If P(x|θ) = Ν(μ, σ 2 ) and p(θ) = Ν(μ 0, σ 0 2 ) then STEP 1: P(θ|D)=Ν(μ n, σ n 2 ) STEP 2: P(x|D)=N(μ n, σ 2 +σ n 2 ) μ n = σ 0 2 /(n σ 0 2 + σ 2 ) ( Σ j x j ) + σ 2 /(n σ 0 2 + σ 2 ) μ 0 σ n 2 = σ 2 σ 0 2 /(n σ 0 2 + σ 2 ) For large n (number of training samples) maximum likelihood and Bayesian estimation equivalent!!!

Conclusions  Maximum likelihood estimation is simple and gives good estimates when the number of training samples is large  Bayesian adaptation gives good estimates even for small amounts of training data provided that a good prior is selected  Bayesian adaptation is hard and often does not have a closed form solution (in which case try: iterative recursive Bayesian estimation)

PatReco: Model and Feature Selection Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

Breakdown of Classification Error  Bayes error  Model selection error  Model estimation error  Data mismatch error (training-testing)

True statements about Bayes error (valid within statistical significance)  The Bayes error is ALWAYS smaller than the total (empirical) classification error  If the model, estimation and mismatch errors are zero than the total classification error equals the Bayes error  The ONLY way to reduce the Bayes error is to add new features in the classifier design

More true statements  Adding new features can only reduce the Bayes error (this is not true about the total classification error!!!)  Adding new features will NOT reduce the Bayes error if the new features are Very bad at discriminating between classes (feature pdfs overlapping) Highly correlated with existing features

Gaussian classification Bayes Error For two classes ω 1 and ω 2 following Gaussian distributions with means μ 1, μ 2 and the same variance σ 2 then the Bayes error is: P(error) = 1/(2π) 0.5  r/2 exp{-u 2 /2} du where r = |μ 1 -μ 2 |/σ 

Feature Selection  If we had infinite amounts of data then The more features the better!  However in practice finite data: More features  more parameters to train!!!  Good features: Uncorrelated Able to discriminate among classes

Model selection  Number of model parameters is number of parameters that need to be estimated  Overfiting: too many parameters, too little data!!!  Gaussian models-Model selection: Single Gaussians Mixture of Gaussians Fixed Variance Tied Variance Diagonal Variance

Conclusion  Introducing more features and/or more complex models can only reduce the classification error (if infinite amounts of training data are available)  In practice: number of features and number of model parameters is a function of amount of training data available (avoid overfiting!)  Good features are uncorrelated and discriminative

PatReco: Expectation Maximization Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

When do we use EM?  Partially observable data Missing some features from some samples, e.g., D={(1,2),(2,3),(?,4)} Missing class labels, e.g., hidden states of HMMs Missing class sub-labels, e.g., mixture label for mixture of Gaussian models

The EM algorithm  The Expectation Maximization algorithm (EM) consists of alternating expectation and maximization steps  During expectation steps the “best estimates of the missing information” are computed  During maximization step maximum likelihood training on all data is performed

EM Initialization:  (0) for i =1..iterno // usually iterno=2 or 3 E step: Q (i) = E D bad {log(p(D;  )|x;D bad,  (i-1) } M step:  (i) =argmax{Q (i) } end

Pseudo-EM Initialization:  (0) for i =1..iterno // usually iterno=2 or 3 Expectation step: D bad =E{D bad |  (i-1) } Maximization step:  (i) =argmax{p(D|  (i-1) } end

Convergence  EM is guaranteed to converge to a local optimum (NOT the global optimum!)  Pseudo-EM has no convergence guarantees but is used often in practice

Conclusions  EM is an iterative algorithm used when there are missing or partially observable training data  EM is a generalization of ML training  EM is guaranteed to converge to a local optimum (NOT the global optimum!)

PatReco: Bayesian Networks Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

Definitions  Bayesian networks consist of nodes and (usually directional) arcs  Nodes or states represent a classification class or in general events and are described with a pdf  Arcs represent relations between arcs, e.g., cause and effect, time sequence  Two nodes that are connected via another node are conditionally independent (given that node)

When to use Bayesian nets  Bayesian networks (or networks of inference) are statistical models that are used for classification (or in general pattern recognition) problems where there are dependencies among classes, e.g., time dependencies, cause and effect dependencies

Conditional Independence  Full independence between A and B P(A|B) = P(A) or P(A,B) = P(A) P(B)  Conditional independence of A, B given C P(A|BC) = P(A|C) or P(A,B|C) = P(A|C)P(B|C)

Three problems 1.Probability computation (use independence) 2.Training/Parameter Estimation Maximum likelihood (ML) if all is observable Expectation maximization (EM) if missing data 3.Inference (Testing) Diagnosis P(cause|effect)bottom-up PredictionP(effect|cause)top-down

Probability Computation For a Bayesian Network that consists of N nodes: 1.Compute P(n 1, n 2..n N ) using chain rule starting from the “last/bottom” node and working your way up P(n 1, n 2..n N ) = P(n N | n 1, n 2.. n N-1 ) P(n N-1 |n 1, n 2.. n N-2 ) … P(n 2 |n 1 ) P(n 1 ) 2.Identify conditional independence conditions from Bayesian network topology 3.Simplify the conditionals probabilities using independence conditions

Probability Computation  There are general algorithms for identifying cliques in the Bayesian net  Cliques are islands of conditional dependence, i.e., terms in the probability computation that cannot be further reduced SC WSR RC

Training/Parameter Estimation  Instead of estimating the joint pdf of the whole network the joint pdf of each of the cliques is estimated  For example if the network joint pdf is P(C,S,R,W) = P(W|S,R) P(S|C) P(R|C) P(C) instead of computing P(C,S,R,W) we compute each of P(W|S,R), P(S|C), P(R|C), P(C) for all possible values of W, S, R, C (much simpler)

Training/Parameter Estimation  For fully observable data and discrete probabilities compute maximum likelihood estimates of parameters, e.g., for discrete probs counts(W=1,S=1,R=0) P(W=1|S=1,R=0) ML = _______________________ counts(W=*,S=1,R=0)

Training/Parameter Estimation  Example: the following observations pairs are given for (W,C,S,R): (1,0,1,0), (0,0,1,0),(1,1,1,0),(0,1,1,0),(1,0,1,0), (0,1,0,0),(1,0,0,1),(0,1,1,1),(1,1,1,0)  Using Maximum Likelihood Estimation: P(W=1|S=1,R=0) ML = #(1, *, 1, 0)/#(*,*,1,0) = 2/5 = 0.4

Training/Parameter Estimation  When data is non observable or missing the EM algorithm is employed  There are efficient implementations of the EM algorithm for Bayesian nets that operate on the clique network  When the topology of the Bayesian network is not known structural EM can be used

Inference  There are two types of inference (testing) Diagnosis P(cause|effect)bottom-up PredictionP(effect|cause)top-down Once  Once the parameters of the network are estimated the joint network pdf can be estimated for ALL possible network values  Inference is simply probability computation using the network pdf

Inference  For example P(W=1|C=1) = P(W=1,C=1) / P(C=1) where P(W=1,C=1) =  RS P(W=1,C=1,R=*,S=*) P(C=1) =  RWS P(W=*,C=1,R=*,S=*)

Inference  Efficient algorithms exist for performing inference in large networks which operate on the clique network  Inference is often shown as a probability maximization problem, e.g., what is the most probable cause or effect? argmax W P(W|C=1)

Continuous Case  In our examples the network nodes represented discrete events (states or classes)  Network nodes often hold continuous variables (observations), e.g., length, energy  For the continuous case parametric pdf are introduced and their parameters are estimated using ML (observed) or EM (hidden)

Some Applications  Medical diagnosis  Computer problem diagnosis (MS)  Markov chains  Hidden Markov Models (HMMs)

Conclusions  Bayesian networks are used to represent dependencies between classes  Network topology defines conditional independence conditions that simplify the network pdf modeling and computation  Three problems: probability computation, estimation/training, inference/testing

PatReco: Hidden Markov Models Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

Markov Models: Definition  Markov chains are Bayesian networks that model sequences of events (states)  Sequential events are dependent  Two non-sequential events are conditionally independent given the intermediate events (MM-1)

Markov chains q1 q4q3q2 q0q1q4q3q2 q0q1q4q3q2 q0q1q4q3q2 MM-0 MM-1 MM-2 MM-3 … … … …

Markov Chains MM-0: P(q 1,q 2.. q N ) =  n=1..N P(q n ) MM-1: P(q 1,q 2.. q N ) =  n=1..N P(q n |q n-1 ) MM-2: P(q 1,q 2.. q N ) =  n=1..N P(q n |q n-1,q n-2 ) MM-3: P(q 1,q 2.. q N ) =  n=1..N P(q n |q n-1,q n-2,q n-3 )

Hidden Markov Models  Hidden Markov chains model sequences of events and corresponding sequences of observations  Events form an Markov chain (MM-1)  Observations are conditionally independent given the sequence of events  Each observation is directly connected with a single event (and conditionally independent with the rest of the events in the network)

Hidden Markov Models q0q1q4q3q2 … o0o1o4o3o2 … P(o 0,o 1..o N, q 0,q 1..q N ) =  n=0..N P(q n |q n-1 )P(o n |q n ) HMM-1

Parameter Estimation  The parameters that have to be estimated are the a-priori probabilities P(q 0 ) transition probabilities P(q n |q n-1 ) observation probabilities P(o n |q n )  For example if there are 3 types of events and continuous 1-D observations that follow a Gaussian distribution there are 18 parameters to estimate: 3 a-priori probabilities 3x3 transition probabilities matrix 3 means and 3 variances (observation probabilities)

Parameter Estimation  If both the sequence of events and sequences of observations are fully observable then ML is used  Usually the sequence of events q 0,q 1..q N are non-observable in which case EM is used  The EM algorithm for HMMs is the Baum- Welsh or forward-backward algorithm

Inference/Decoding  The main inference problem for HMMs is known as the decoding problem: given a sequence of observations find the best sequence of states: q = argmax q P(q|O) = argmax q P(q,O)  An efficient decoding algorithm is the Viterbi algorithm

Viterbi algorithm max q P(q,O) = max q P(o 0,o 1..o N, q 0,q 1..q N ) = max q  n=0..N P(q n |q n-1 )P(o n |q n ) = max q N {P(o N |q N ) max q N-1 {P(q N |q N-1 )P(o N-1 |q N-1 ) … max q2 {P(q 3 |q 2 )P(o 2 |q 2 ) max q1 {P(q 2 |q 1 )P(o 1 |q 1 ) max q0 {P(q 1 |q 0 ) P(o 0 |q 0 ) P(q 0 )}}}…}}

Viterbi algorithm 1 2 3 4 K.... time At each node keep only the best (most probable) path from all the paths passing through that node

Deep Thoughts  HMM-0 (HMM with MM-0 event chain) is the Bayes classifier!!!  MMs and HMMs are poor models but simple and efficient computationally How do you fix this? (dependent observations?)

Some Applications  Speech Recognition  Optical Character Recognition  Part-of-Speech Tagging  …

Conclusions  HMMs and MMs are useful modeling tools for dependent sequence of events (states or classes)  Efficient algorithms exist for training HMM parameters (Baum-Welsh) and decoding the most probable sequence of states given an observation sequence (Viterbi)  HMMs have many applications

Non Parametric Classifiers Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

Histograms-Parzen Windows  Main idea: Instead of selecting a parametric distribution (e.g., Gaussian) to describe the properties of the features of a class, compute directly the empirical distribution  class feature histogram

Feature Histogram Example X # of samples in each bin  Normalize histogram curve to get feature PDF

Parzen Windows: Issues  When compared to parametric methods empirical distributions are: Better because no specific form of the PDF is assumed Worse because over-fitting can easily occur (too small histogram bin)  Parzen proposed rules for adapting bin size based on number of samples in each bin to avoid over- fitting

Nearest Neighbor Rule  Main idea (1-NNR): No explicit model (i.e., no training) For each test sample x the “nearest” sample x’ in the training set is found, i.e., argmin x’ d(x, x’) and x is classified to the class where x’ belongs

Generalizations k-NNR: Instead of finding the nearest neighbors we find k nearest neighbors from the training set; the sample x is classified to the class where most of the k neighbors belong k-l-NNR: Like k-NNR but at least l of the k nearest neighbor must belong to the same class for a classification decision to be taken (else no decision)

Example Training set D 1 = {0,-1,-2} and D 2 = {1,1,1} -2 -1 012 3 1-NNR decision boundary 3-NNR decision boundary 3-3-NNR no decision region

Computational Efficiency  To speed up NNR classification the training set size can be reduced using the condensing algorithm: The training set is classified using NNR rule misclassified samples are added to the new (condensed) training set one by one until all training samples are correctly classified

Conclusions  Non parametric classification algorithms are easy to implement are computationally efficient (in training) don’t make any assumptions are prone to over-fitting are hard to adapt (no detailed model)

Discriminant Functions Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

Discriminant Functions  Main Idea: Describe parametrically the decision boundary (instead of the properties of the class), e.g., the two classes are separated by a straight line a x 1 + b x 2 + c = 0, with parameters (a,b,c) (instead of the feature PDFs are 2-D Gaussians)

Example: Two classes, two features a x 1 + b x 2 + c = 0 x1x1 x2x2 w1w1 w2w2 x1x1 x2x2 w1w1 w2w2  11  22  12  21 N(  1,  1 ) N(  2,  2 ) Model Class BoundaryModel Class Characteristics

Duality  Dualism Parametric class description  Bayes classifier  Decision boundary  Parametric Discriminant Functions  For example modeling class features by Gaussians with same (across-class) variance results in hyper-plane discriminant functions

Discriminant Functions  Discriminant functions g i (x) are functions of the features x of a class i  A sample x is classified to class c for which g i (x) is maximized, i.e., c = argmax i {g i (x)}  The function g i (x) = g j (x) defines class boundaries for each pair of (different) classes i and j

Linear Discriminant Functions  Two class problem: A single discriminant function is defined as: g(x) = g 1 (x) – g 2 (x)  If g(x) is a linear function g(x) = w T x + w 0 then the boundary is a hyper-plane (point, line, plane for 1-D, 2-D, 3-D features respectively)

Linear Discriminant Functions a x 1 + b x 2 + c = 0 x1x1 x2x2 w = (a,b) -c/b -c/a

Non Linear Discriminant Functions  Quadratic discriminant functions g(x) = w 0 +  i w i x i +  ij w ij x i x j for examples for a two class 2-D problem g(x) = a + b x 1 + c x 2 + d x 1 2  Any non-linear discriminant function can become linear by increasing the dimensionality, e.g., y 1 = x 1, y 2 = x 2, y 3 = x 1 2 (2D nonlinear  3D linear) g(y) = a + b y 1 + c y 2 + d y 3

Parameter Estimation  The parameters w are estimated by functional minimization  The function to be minimized J models the average distance of training samples from the decision boundary for either Misclassifier training samples All training samples  The function J is minimized using gradient descent

Gradient Descent  Iterative procedure towards a local minimum a(k+1) = a(k) – n(k)  J(a(k)) where k is the iteration number, n(k) is the learning rate and  J(a(k)) is the gradient of the function to be minimized evaluated at a(k)  Newton descent is the gradient descent with learning rate equal to the inverse Hessian matrix

Distance Functions  Perceptron Criterion Function J p (a) =  misclassified ( - a T y)  Relaxation With Margin b J r (a) =  misclassified (a T y - b) 2 / ||y|| 2  Least Mean square (LMS) J s (a) =  all samples (a T y i - b i ) 2  Ho-Kashyap rule J s (a,b) =  all samples (a T y i - b i ) 2

Discriminant Functions  Working on misclassified samples only (Perceptron, Relaxation with Margin) provides better results but converges only for separable training sets

High Dimensionality  Using non-linear discriminant functions and linearizing them in a high dimensional space can make ANY training set separable large # of parameters (curse of dimensionality)  Support vector machines: A smart way to select appropriate terms (dimensions) is needed

Non-Metric Methods: Decision Trees Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

Decision Trees  Motivation: There are features (discrete) that don’t have an obvious notion of similarity or ordering (nominal data), e.g., book type, shape, sound type  Taxonomies (i.e., trees with is-a relationship) are the oldest form of classification

Decision Trees: Definition  Decision Trees are classifiers that classify samples based on a set of questions that are asked hierarchically (tree of questions)  Example questions is color red? is x < 0.5?  Terminology: root, leaf, node, arc, branch, parent, children, branching factor, depth

Fruit classifier Color? green yellow red Size? Shape? Size? Taste? bigmed round thin big small med big small med sweetsour

Fruit classification Color? green yellow red Size? Shape? Size? Taste? bigmed round thin big small med big small med sweetsour CHERRY

Fruit classifier Color? green yellow red Size? Shape? Size? Taste? bigmed round thin big small med big small med sweetsour watermelon grape grapefruit cherrygrape

Binary Trees  Binary trees: each parent node has exactly two children nodes (branching factor = 2)  Any tree can be represented as a binary tree by changing set of questions and by increasing the tree depth  e.g., Color? green yellow red Color = green? Color = yellow? YN Y N

Decision Trees: Problems 1.List of questions (features) All possible questions are considered 2.Which questions to split first (best split) The questions that split the data best (reduce impurity at each node) are asked first 3.Stopping criteria (pruning criteria) Stop when further splits don’t reduce imprurity

Best Split example  Two class problem with 100 examples from w1 and w2  Three binary questions Q1, Q2 and Q3 that split the data as follows: 1. Node 1: (50,50)Node 2: (50,50) 2. Node 1: (100,0)Node 2: (0,100) 3. Node 1: (80,0)Node 2: (20,100)

Impurity Measures  Impurity measures the degree of homogeneity of a node; a node is pure if it consists of training examples from a single class  Impurity Measures Entropy Impurity: i(N) = -  i P(w i ) log 2 (P(w i )) Variance (two-class): i(N) = P(w 1 ) P(w 2 ) Gini Impurity: i(N) = 1-  i P 2 (w i ) Misclassification: i(N) = 1- max i P(w i )

Total Impurity Total Impurity at Depth 0: i(depth =0) = i(N) Total Impurity at Depth 1: i(depth =1) = p(N L ) i(N L ) + p(N R ) i(N R ) N yes no NLNL NRNR Depth 0 Depth 1

Impurity Example  Node 1: (80,0)Node 2: (20,100) I(node 1) = 0 I(node 2) = - 20/120 log2(20/120) - 100/120 log2(100/120) = 0.65 P(node 1) = 80/200 = 0.4 P(node 2) = 120/200 = 0.6 I(total) = P(node 1) I(node 1) + P(node 2) I(node 2) = = 0 + 0.6*0.65 = 0.39

Continuous Example  For continuous features: questions are of the type x<a where x is the feature and a is a constant  Decision Boundaries (two class, 2-D example): R1 R2 R1 R2 x1 x2

Summary  Decision trees are useful categorical classification tools especially for nominal (non-metric) data  CART creates trees that minimize impurity on the training set at each node  Decision region shape  CART is a useful tool for feature selection

Unsupervised Training and Clustering Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

Unsupervised Training  Definition: The training set samples are unlabelled (unclassified)  Motivation: Labeling is hard/time consuming Fully automatic adaptation of models (in the field)

Maximum Likelihood Training  Given: N training examples drawn from c classes, i.e.,D = {x 1, x 2, … x N } (no class assignments are given!)  Estimate: Class priors: p(w i ) Feature PDF parameters θ : p(x| θ i, w i )  Sometimes the number of classes c is not given and has to be also estimated

Unsupervised ML estimation  k p(w i |x k,θ)   i log p(x k | w i θ i ) = 0  Compared with supervised ML: additional term P(w i |x k,θ)  P(w i |x k,θ) class membership function for each sample x k  Unsupervised ML is a version of EM  Pseudo-EM: P(w i |x k,θ) is binary 0 or 1

Mixture of Gaussians Estimates  Linear combination of Gaussians with weights a i p(x k ) =  i a i N(x k ;  i,  i )  ML estimates: a i = (1/N)  k p(w i |x k )  i = (  k p(w i |x k ) x k ) /  k p(w i |x k )  i = (  k p(w i |x k ) (x k -  i ) (x k -  i ) T ) /  k p(w i |x k )

Clustering  Basic Isodata: 1.Select initial partition of data into c classes and compute cluster means 2.Classify training samples using a classification criterion (Euclidean distance) 3.Recompute cluster means based on training set classification decisions 4.If no change in sample means stop else go to step 2

Iterative clustering algorithms  Top down algorithms: Start from a single class (all data) Split class (e.g.,   std) Continue splitting the “largest” class until desired number of clusters is reached  Bottom up algorithms Each training sample a different class Start merging classes (e.g., using a NNR criterion) until desired number of classes is reached

Review Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005.

Similar presentations

Presentation on theme: "Review Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Review Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005.

Similar presentations

Presentation on theme: "Review Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005."— Presentation transcript:

Similar presentations

About project

Feedback