Download presentation
Presentation is loading. Please wait.
1
Review Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005
2
PatReco: Introduction Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005
3
PatReco:Applications Speech/audio/music/sounds Speech recognition, Speaker verification/id, Image/video OCR, AVASR, Face id, Fingerpring id, Video segmentation Text/Language Machine translatoin, document class., lnag mod., text underst. Medical/Biology Disease diagnosis, DNA sequencing, Gene disease models Other Data User modeling (books/music), Ling analysis (web), Games
4
Basic Concepts Why statistical modeling? Variability: differences between two examples of the same class in training Mismatch: differences between two examples of the same class (one in training one in testing) Learning modes: Supervised learning: class labels known Unsupervised learning: class labels unknown Re-inforced learning: only positive/negative feedback
5
Basic Concepts Feature selection Separate classes, Low correlation Model selection Model type, Model order Prior knowledge E.g., a priori class probability Missing features/observations Modeling of time series Correlation in time (model?), segmentation
6
PatReco: Algorithms Parametric vs Non-Parametric Supervised vs Unsupervised Basic Algorithms: Bayesian Non-parametric Discriminant Functions Non-Metric Methods
7
PatReco: Algorithms Bayesian methods Formulation (describe class characteristics) Bayes classifier Maximum likelihood estimation Bayesian learning Estimation-Maximization Markov models, hidden Markov models Bayesian Nets Non-parametric Parzen windows Nearest Neighbour
8
PatReco: Algorithms Discriminant Functions Formulation (describe boundary) Learning: Gradient descent Perceptron MSE=minimum squared error LMS=least mean squares Neural Net generalizations Support vector machines Non-Metric Methods Classification and Regression Trees String Matching
9
PatReco: Algorithms Unsupervised Learning: Mixture of Gaussians K-means Other not-covered Multi-layered Neural Nets Stochastic Learning (Simulated Annealing) Genetic Algorithms Fuzzy Algorithms Etc…
10
PatReco: Problem Solving 1.Data Collection 2.Data Analysis 3.Feature Selection 4.Model Selection 5.Model Training 6.Classification 7.Classifier Evaluation
13
PatReco: Problem Solving 1.Data Collection 2.Data Analysis 3.Feature Selection 4.Model Selection 5.Model Training 6.Classification 7.Classifier Evaluation
17
PatReco: Problem Solving 1.Data Collection 2.Data Analysis 3.Feature Selection 4.Model Selection 5.Model Training 6.Classification 7.Classifier Evaluation
22
PatReco: Problem Solving 1.Data Collection 2.Data Analysis 3.Feature Selection 4.Model Selection 5.Model Training 6.Classification 7.Classifier Evaluation
23
Evaluation Training Data Set 1234 examples of class 1 and class 2 Testing/Evaluation Data Set 134 examples of class 1 and class 2 Misclassification Error Rate Training: 11.61% (150 errors) Testing: 13.43% (18 errors) Correct for chance (Training 22%, Testing 26%) Why?
24
PatReco: Discriminant Functions for Gaussians Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005
25
PatReco: Problem Solving 1.Data Collection 2.Data Analysis 3.Feature Selection 4.Model Selection 5.Model Training 6.Classification 7.Classifier Evaluation
26
Discriminant Functions Define class boundaries (instead of class characteristics) Dualism: Parametric class description Bayes classifier Decision boundary Parametric Discriminant Functions
27
Normal Density 1D Multi-D Full covariance Diagonal covariance Diagonal covariance + univariate Mixture of Gaussians Usually diagonal covariance
29
Gaussian Discriminant Functions Same variance ALL classes Hyper-planes Different variance among classes Hyper-quadratics (hyper-parabolas, hyper- ellipses etc.)
32
Hyper-Planes When the covariance matrix is common across Gaussian classes The decision boundary is a hyper-plane that is vertical to the line connecting the means of the Gaussian distributions If the a-priori probabilities of classes are equal the hyper-planes cuts the line connecting the Gaussian means in the middle Euclidean classifier
33
Gaussian Discriminant Functions Same variance ALL classes Hyper-planes Different variance among classes Hyper-quadratics (hyper-parabolas, hyper- ellipses etc.)
38
Hyper-Quadratics When the Gaussian class variances are different the boundary can be hyper-plane, multiple hyper-planes, hyper-sphere, hyper- parabola, hyper-elipsoid etc. The boundary in general in NOT vertical to the Gaussian mean connecting line If the a-priori probabilities of classes are equal the resulting classifier is a Mahalanobois classifier
39
Conclusions Parametric statistical models describe class characteristics x by modeling the observation probabilities p(x|class) Discriminant functions describe class boundaries parametrically Parametric statistical models have an equivalent parametric discriminant function For Gaussian p(x|class) distributions the decision boundaries are hyper-planes or hyper-quadratics
40
PatReco: Detection Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005
41
Detection Goal: Detect an Event Hit (Success) False Alarm Miss (Failure) False Reject
44
PatReco: Estimation/Training Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005
45
Estimation/Training Goal: Given observed data (re-)estimate the parameters of the model e.g., for a Gaussian model estimate the mean and variance for each class
46
Supervised-Unsupervised Supervised training: All data has been (manually) labeled, i.e., assigned to classes Unsupervised training: Data is not assigned a class label
47
Observable data Fully observed data: all information necessary for training is available (features, class labels etc.) Partially observed data: some of the features or some of the class labels are missing
48
Supervised Training (fully observable data) Maximum likelihood estimation (ML) Maximum a posteriori estimation (MAP) Bayesian estimation (BE)
49
Training process Collected data used for training consists of the following examples D = {x 1, x 2, … x N } Step 1: Label each example with the corresponding class label ω 1, ω 2,... ω Κ Step 2: For each of the classes separately estimate the model parameters using ML, MAP, BE and the corresponding training examples D 1, D 2..D K
50
Training Process: Step 1 D = {x 1, x 2, x 3, x 4, x 5, … x N } Label manually ω 1, ω 2,... ω Κ D 1 = {x 11, x 12, x 13, … x 1 N 1 } D 2 = {x 21, x 22, x 23, … x 2 N 2 } ………… D K = {x K 1, x K 2, x K 3, … x KN k }
51
Training Process: Step 2 Maximum Likelihood θ 1 = argmax Θ P(D 1 |θ 1 ) Maximum-a-posteriori θ 1 = argmax Θ P(D 1 |θ 1 ) P(θ 1 ) Bayesian estimation P (x|ω 1 ) = P(x| θ 1 )P( θ 1 |D 1 ) d θ 1
52
ML Estimation Assumptions 1.P(x|ω i ) follows a parametric distribution with parameters θ 2.D j tells us nothing about P(x|ω i ) (functional independence) 3.Observations x 1, x 2, x 3, … x N are iid (independent identically distributed 4a (ML only!) θ is a quantity whose value is fixed but unknown
53
ML estimation θ = argmax Θ P(θ|D) = argmax Θ P(D|θ) P(θ) = 4 argmax Θ P(D|θ) = argmax Θ P( x 1, x 2, … x N |θ) = 3 argmax Θ Π j P(x j |θ) => Π j P(x j |θ) / θ = 0 => θ = …
54
ML estimate for Gaussian pdf If P(x|ω) = Ν(μ,σ 2 ) and θ=(μ,σ 2 ) then 1-D μ = (1/Ν) Σ j=1..N x j σ 2 = (1/Ν) Σ j=1..N (x j – μ) 2 Multi-D : θ=( μ, Σ ) μ = (1/Ν) Σ j=1..N x j Σ = (1/Ν) Σ j=1..N (x j – μ) Τ (x j – μ)
55
Bayesian Estimat. Assumptions 1.P(x|ω i ) follows a parametric distribution with parameters θ 2.D j tells us nothing about P(x|ω i ) (functional independence) 3.Observations x 1, x 2, x 3, … x N are iid (independent identically distributed) 4b (MAP, BE) θ is a random variable whose prior distribution p(θ) is known
56
Bayesian Estimation P (x|D) = P(x,θ|D) dθ = P(x|θ,D)P(θ|D) dθ = P(x|θ)P(θ|D) dθ STEP 1: P(θ) P(θ|D) P(x|D) = P(D|θ)P(θ)/P(D) STEP 2: P(x|θ) P (x|D)
57
Bayesian Estimate for Gaussian pdf and priors If P(x|θ) = Ν(μ, σ 2 ) and p(θ) = Ν(μ 0, σ 0 2 ) then STEP 1: P(θ|D)=Ν(μ n, σ n 2 ) STEP 2: P(x|D)=N(μ n, σ 2 +σ n 2 ) μ n = σ 0 2 /(n σ 0 2 + σ 2 ) ( Σ j x j ) + σ 2 /(n σ 0 2 + σ 2 ) μ 0 σ n 2 = σ 2 σ 0 2 /(n σ 0 2 + σ 2 ) For large n (number of training samples) maximum likelihood and Bayesian estimation equivalent!!!
58
Conclusions Maximum likelihood estimation is simple and gives good estimates when the number of training samples is large Bayesian adaptation gives good estimates even for small amounts of training data provided that a good prior is selected Bayesian adaptation is hard and often does not have a closed form solution (in which case try: iterative recursive Bayesian estimation)
59
PatReco: Model and Feature Selection Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005
60
Breakdown of Classification Error Bayes error Model selection error Model estimation error Data mismatch error (training-testing)
62
True statements about Bayes error (valid within statistical significance) The Bayes error is ALWAYS smaller than the total (empirical) classification error If the model, estimation and mismatch errors are zero than the total classification error equals the Bayes error The ONLY way to reduce the Bayes error is to add new features in the classifier design
63
More true statements Adding new features can only reduce the Bayes error (this is not true about the total classification error!!!) Adding new features will NOT reduce the Bayes error if the new features are Very bad at discriminating between classes (feature pdfs overlapping) Highly correlated with existing features
64
Gaussian classification Bayes Error For two classes ω 1 and ω 2 following Gaussian distributions with means μ 1, μ 2 and the same variance σ 2 then the Bayes error is: P(error) = 1/(2π) 0.5 r/2 exp{-u 2 /2} du where r = |μ 1 -μ 2 |/σ
65
Feature Selection If we had infinite amounts of data then The more features the better! However in practice finite data: More features more parameters to train!!! Good features: Uncorrelated Able to discriminate among classes
66
Model selection Number of model parameters is number of parameters that need to be estimated Overfiting: too many parameters, too little data!!! Gaussian models-Model selection: Single Gaussians Mixture of Gaussians Fixed Variance Tied Variance Diagonal Variance
67
Conclusion Introducing more features and/or more complex models can only reduce the classification error (if infinite amounts of training data are available) In practice: number of features and number of model parameters is a function of amount of training data available (avoid overfiting!) Good features are uncorrelated and discriminative
68
PatReco: Expectation Maximization Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005
69
When do we use EM? Partially observable data Missing some features from some samples, e.g., D={(1,2),(2,3),(?,4)} Missing class labels, e.g., hidden states of HMMs Missing class sub-labels, e.g., mixture label for mixture of Gaussian models
70
The EM algorithm The Expectation Maximization algorithm (EM) consists of alternating expectation and maximization steps During expectation steps the “best estimates of the missing information” are computed During maximization step maximum likelihood training on all data is performed
71
EM Initialization: (0) for i =1..iterno // usually iterno=2 or 3 E step: Q (i) = E D bad {log(p(D; )|x;D bad, (i-1) } M step: (i) =argmax{Q (i) } end
72
Pseudo-EM Initialization: (0) for i =1..iterno // usually iterno=2 or 3 Expectation step: D bad =E{D bad | (i-1) } Maximization step: (i) =argmax{p(D| (i-1) } end
73
Convergence EM is guaranteed to converge to a local optimum (NOT the global optimum!) Pseudo-EM has no convergence guarantees but is used often in practice
74
Conclusions EM is an iterative algorithm used when there are missing or partially observable training data EM is a generalization of ML training EM is guaranteed to converge to a local optimum (NOT the global optimum!)
75
PatReco: Bayesian Networks Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005
76
Definitions Bayesian networks consist of nodes and (usually directional) arcs Nodes or states represent a classification class or in general events and are described with a pdf Arcs represent relations between arcs, e.g., cause and effect, time sequence Two nodes that are connected via another node are conditionally independent (given that node)
77
When to use Bayesian nets Bayesian networks (or networks of inference) are statistical models that are used for classification (or in general pattern recognition) problems where there are dependencies among classes, e.g., time dependencies, cause and effect dependencies
78
Conditional Independence Full independence between A and B P(A|B) = P(A) or P(A,B) = P(A) P(B) Conditional independence of A, B given C P(A|BC) = P(A|C) or P(A,B|C) = P(A|C)P(B|C)
79
Conditional Independence A, C independent given B P(C|BA) = P(C|B) B,C independent given A P(B,C|A) = P(B|A)P(C|A) A,C dependent given B P(A,C|B) cannot be reduced! A B C A B C A BC
80
Three problems 1.Probability computation (use independence) 2.Training/Parameter Estimation Maximum likelihood (ML) if all is observable Expectation maximization (EM) if missing data 3.Inference (Testing) Diagnosis P(cause|effect)bottom-up PredictionP(effect|cause)top-down
81
Probability Computation For a Bayesian Network that consists of N nodes: 1.Compute P(n 1, n 2..n N ) using chain rule starting from the “last/bottom” node and working your way up P(n 1, n 2..n N ) = P(n N | n 1, n 2.. n N-1 ) P(n N-1 |n 1, n 2.. n N-2 ) … P(n 2 |n 1 ) P(n 1 ) 2.Identify conditional independence conditions from Bayesian network topology 3.Simplify the conditionals probabilities using independence conditions
82
Probability Computation Topology: P(C,S,R,W) = P(W|C,S,R) P(S|CR) P(R|C)P(C) Independent: (W,C)|S,R(S,R)|C Dependent: (S,R)|W P(C,S,R,W) = P(W|S,R) P(S|C) P(R|C) P(C) C S W R
83
Probability Computation There are general algorithms for identifying cliques in the Bayesian net Cliques are islands of conditional dependence, i.e., terms in the probability computation that cannot be further reduced SC WSR RC
84
Training/Parameter Estimation Instead of estimating the joint pdf of the whole network the joint pdf of each of the cliques is estimated For example if the network joint pdf is P(C,S,R,W) = P(W|S,R) P(S|C) P(R|C) P(C) instead of computing P(C,S,R,W) we compute each of P(W|S,R), P(S|C), P(R|C), P(C) for all possible values of W, S, R, C (much simpler)
85
Training/Parameter Estimation For fully observable data and discrete probabilities compute maximum likelihood estimates of parameters, e.g., for discrete probs counts(W=1,S=1,R=0) P(W=1|S=1,R=0) ML = _______________________ counts(W=*,S=1,R=0)
86
Training/Parameter Estimation Example: the following observations pairs are given for (W,C,S,R): (1,0,1,0), (0,0,1,0),(1,1,1,0),(0,1,1,0),(1,0,1,0), (0,1,0,0),(1,0,0,1),(0,1,1,1),(1,1,1,0) Using Maximum Likelihood Estimation: P(W=1|S=1,R=0) ML = #(1, *, 1, 0)/#(*,*,1,0) = 2/5 = 0.4
87
Training/Parameter Estimation When data is non observable or missing the EM algorithm is employed There are efficient implementations of the EM algorithm for Bayesian nets that operate on the clique network When the topology of the Bayesian network is not known structural EM can be used
88
Inference There are two types of inference (testing) Diagnosis P(cause|effect)bottom-up PredictionP(effect|cause)top-down Once Once the parameters of the network are estimated the joint network pdf can be estimated for ALL possible network values Inference is simply probability computation using the network pdf
89
Inference For example P(W=1|C=1) = P(W=1,C=1) / P(C=1) where P(W=1,C=1) = RS P(W=1,C=1,R=*,S=*) P(C=1) = RWS P(W=*,C=1,R=*,S=*)
90
Inference Efficient algorithms exist for performing inference in large networks which operate on the clique network Inference is often shown as a probability maximization problem, e.g., what is the most probable cause or effect? argmax W P(W|C=1)
91
Continuous Case In our examples the network nodes represented discrete events (states or classes) Network nodes often hold continuous variables (observations), e.g., length, energy For the continuous case parametric pdf are introduced and their parameters are estimated using ML (observed) or EM (hidden)
92
Some Applications Medical diagnosis Computer problem diagnosis (MS) Markov chains Hidden Markov Models (HMMs)
93
Conclusions Bayesian networks are used to represent dependencies between classes Network topology defines conditional independence conditions that simplify the network pdf modeling and computation Three problems: probability computation, estimation/training, inference/testing
94
PatReco: Hidden Markov Models Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005
95
Markov Models: Definition Markov chains are Bayesian networks that model sequences of events (states) Sequential events are dependent Two non-sequential events are conditionally independent given the intermediate events (MM-1)
96
Markov chains q1 q4q3q2 q0q1q4q3q2 q0q1q4q3q2 q0q1q4q3q2 MM-0 MM-1 MM-2 MM-3 … … … …
97
Markov Chains MM-0: P(q 1,q 2.. q N ) = n=1..N P(q n ) MM-1: P(q 1,q 2.. q N ) = n=1..N P(q n |q n-1 ) MM-2: P(q 1,q 2.. q N ) = n=1..N P(q n |q n-1,q n-2 ) MM-3: P(q 1,q 2.. q N ) = n=1..N P(q n |q n-1,q n-2,q n-3 )
98
Hidden Markov Models Hidden Markov chains model sequences of events and corresponding sequences of observations Events form an Markov chain (MM-1) Observations are conditionally independent given the sequence of events Each observation is directly connected with a single event (and conditionally independent with the rest of the events in the network)
99
Hidden Markov Models q0q1q4q3q2 … o0o1o4o3o2 … P(o 0,o 1..o N, q 0,q 1..q N ) = n=0..N P(q n |q n-1 )P(o n |q n ) HMM-1
100
Parameter Estimation The parameters that have to be estimated are the a-priori probabilities P(q 0 ) transition probabilities P(q n |q n-1 ) observation probabilities P(o n |q n ) For example if there are 3 types of events and continuous 1-D observations that follow a Gaussian distribution there are 18 parameters to estimate: 3 a-priori probabilities 3x3 transition probabilities matrix 3 means and 3 variances (observation probabilities)
101
Parameter Estimation If both the sequence of events and sequences of observations are fully observable then ML is used Usually the sequence of events q 0,q 1..q N are non-observable in which case EM is used The EM algorithm for HMMs is the Baum- Welsh or forward-backward algorithm
102
Inference/Decoding The main inference problem for HMMs is known as the decoding problem: given a sequence of observations find the best sequence of states: q = argmax q P(q|O) = argmax q P(q,O) An efficient decoding algorithm is the Viterbi algorithm
103
Viterbi algorithm max q P(q,O) = max q P(o 0,o 1..o N, q 0,q 1..q N ) = max q n=0..N P(q n |q n-1 )P(o n |q n ) = max q N {P(o N |q N ) max q N-1 {P(q N |q N-1 )P(o N-1 |q N-1 ) … max q2 {P(q 3 |q 2 )P(o 2 |q 2 ) max q1 {P(q 2 |q 1 )P(o 1 |q 1 ) max q0 {P(q 1 |q 0 ) P(o 0 |q 0 ) P(q 0 )}}}…}}
104
Viterbi algorithm 1 2 3 4 K.... time At each node keep only the best (most probable) path from all the paths passing through that node
105
Deep Thoughts HMM-0 (HMM with MM-0 event chain) is the Bayes classifier!!! MMs and HMMs are poor models but simple and efficient computationally How do you fix this? (dependent observations?)
106
Some Applications Speech Recognition Optical Character Recognition Part-of-Speech Tagging …
107
Conclusions HMMs and MMs are useful modeling tools for dependent sequence of events (states or classes) Efficient algorithms exist for training HMM parameters (Baum-Welsh) and decoding the most probable sequence of states given an observation sequence (Viterbi) HMMs have many applications
108
Non Parametric Classifiers Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005
109
Histograms-Parzen Windows Main idea: Instead of selecting a parametric distribution (e.g., Gaussian) to describe the properties of the features of a class, compute directly the empirical distribution class feature histogram
110
Feature Histogram Example X # of samples in each bin Normalize histogram curve to get feature PDF
111
Parzen Windows: Issues When compared to parametric methods empirical distributions are: Better because no specific form of the PDF is assumed Worse because over-fitting can easily occur (too small histogram bin) Parzen proposed rules for adapting bin size based on number of samples in each bin to avoid over- fitting
112
Nearest Neighbor Rule Main idea (1-NNR): No explicit model (i.e., no training) For each test sample x the “nearest” sample x’ in the training set is found, i.e., argmin x’ d(x, x’) and x is classified to the class where x’ belongs
113
Generalizations k-NNR: Instead of finding the nearest neighbors we find k nearest neighbors from the training set; the sample x is classified to the class where most of the k neighbors belong k-l-NNR: Like k-NNR but at least l of the k nearest neighbor must belong to the same class for a classification decision to be taken (else no decision)
114
Example Training set D 1 = {0,-1,-2} and D 2 = {1,1,1} -2 -1 012 3 1-NNR decision boundary 3-NNR decision boundary 3-3-NNR no decision region
115
Computational Efficiency To speed up NNR classification the training set size can be reduced using the condensing algorithm: The training set is classified using NNR rule misclassified samples are added to the new (condensed) training set one by one until all training samples are correctly classified
116
Conclusions Non parametric classification algorithms are easy to implement are computationally efficient (in training) don’t make any assumptions are prone to over-fitting are hard to adapt (no detailed model)
117
Discriminant Functions Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005
118
Discriminant Functions Main Idea: Describe parametrically the decision boundary (instead of the properties of the class), e.g., the two classes are separated by a straight line a x 1 + b x 2 + c = 0, with parameters (a,b,c) (instead of the feature PDFs are 2-D Gaussians)
119
Example: Two classes, two features a x 1 + b x 2 + c = 0 x1x1 x2x2 w1w1 w2w2 x1x1 x2x2 w1w1 w2w2 11 22 12 21 N( 1, 1 ) N( 2, 2 ) Model Class BoundaryModel Class Characteristics
120
Duality Dualism Parametric class description Bayes classifier Decision boundary Parametric Discriminant Functions For example modeling class features by Gaussians with same (across-class) variance results in hyper-plane discriminant functions
122
Discriminant Functions Discriminant functions g i (x) are functions of the features x of a class i A sample x is classified to class c for which g i (x) is maximized, i.e., c = argmax i {g i (x)} The function g i (x) = g j (x) defines class boundaries for each pair of (different) classes i and j
123
Linear Discriminant Functions Two class problem: A single discriminant function is defined as: g(x) = g 1 (x) – g 2 (x) If g(x) is a linear function g(x) = w T x + w 0 then the boundary is a hyper-plane (point, line, plane for 1-D, 2-D, 3-D features respectively)
124
Linear Discriminant Functions a x 1 + b x 2 + c = 0 x1x1 x2x2 w = (a,b) -c/b -c/a
125
Non Linear Discriminant Functions Quadratic discriminant functions g(x) = w 0 + i w i x i + ij w ij x i x j for examples for a two class 2-D problem g(x) = a + b x 1 + c x 2 + d x 1 2 Any non-linear discriminant function can become linear by increasing the dimensionality, e.g., y 1 = x 1, y 2 = x 2, y 3 = x 1 2 (2D nonlinear 3D linear) g(y) = a + b y 1 + c y 2 + d y 3
126
Parameter Estimation The parameters w are estimated by functional minimization The function to be minimized J models the average distance of training samples from the decision boundary for either Misclassifier training samples All training samples The function J is minimized using gradient descent
127
Gradient Descent Iterative procedure towards a local minimum a(k+1) = a(k) – n(k) J(a(k)) where k is the iteration number, n(k) is the learning rate and J(a(k)) is the gradient of the function to be minimized evaluated at a(k) Newton descent is the gradient descent with learning rate equal to the inverse Hessian matrix
128
Distance Functions Perceptron Criterion Function J p (a) = misclassified ( - a T y) Relaxation With Margin b J r (a) = misclassified (a T y - b) 2 / ||y|| 2 Least Mean square (LMS) J s (a) = all samples (a T y i - b i ) 2 Ho-Kashyap rule J s (a,b) = all samples (a T y i - b i ) 2
129
Discriminant Functions Working on misclassified samples only (Perceptron, Relaxation with Margin) provides better results but converges only for separable training sets
130
High Dimensionality Using non-linear discriminant functions and linearizing them in a high dimensional space can make ANY training set separable large # of parameters (curse of dimensionality) Support vector machines: A smart way to select appropriate terms (dimensions) is needed
131
Non-Metric Methods: Decision Trees Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005
132
Decision Trees Motivation: There are features (discrete) that don’t have an obvious notion of similarity or ordering (nominal data), e.g., book type, shape, sound type Taxonomies (i.e., trees with is-a relationship) are the oldest form of classification
133
Decision Trees: Definition Decision Trees are classifiers that classify samples based on a set of questions that are asked hierarchically (tree of questions) Example questions is color red? is x < 0.5? Terminology: root, leaf, node, arc, branch, parent, children, branching factor, depth
134
Fruit classifier Color? green yellow red Size? Shape? Size? Taste? bigmed round thin big small med big small med sweetsour
135
Fruit classification Color? green yellow red Size? Shape? Size? Taste? bigmed round thin big small med big small med sweetsour CHERRY
136
Fruit classification Color? green yellow red Size? Shape? Size? Taste? bigmed round thin big small med big small med sweetsour CHERRY
137
Fruit classification Color? green yellow red Size? Shape? Size? Taste? bigmed round thin big small med big small med sweetsour CHERRY
138
Fruit classification Color? green yellow red Size? Shape? Size? Taste? bigmed round thin big small med big small med sweetsour CHERRY
139
Fruit classifier Color? green yellow red Size? Shape? Size? Taste? bigmed round thin big small med big small med sweetsour watermelon grape grapefruit cherrygrape
140
Binary Trees Binary trees: each parent node has exactly two children nodes (branching factor = 2) Any tree can be represented as a binary tree by changing set of questions and by increasing the tree depth e.g., Color? green yellow red Color = green? Color = yellow? YN Y N
141
Decision Trees: Problems 1.List of questions (features) All possible questions are considered 2.Which questions to split first (best split) The questions that split the data best (reduce impurity at each node) are asked first 3.Stopping criteria (pruning criteria) Stop when further splits don’t reduce imprurity
142
Best Split example Two class problem with 100 examples from w1 and w2 Three binary questions Q1, Q2 and Q3 that split the data as follows: 1. Node 1: (50,50)Node 2: (50,50) 2. Node 1: (100,0)Node 2: (0,100) 3. Node 1: (80,0)Node 2: (20,100)
143
Impurity Measures Impurity measures the degree of homogeneity of a node; a node is pure if it consists of training examples from a single class Impurity Measures Entropy Impurity: i(N) = - i P(w i ) log 2 (P(w i )) Variance (two-class): i(N) = P(w 1 ) P(w 2 ) Gini Impurity: i(N) = 1- i P 2 (w i ) Misclassification: i(N) = 1- max i P(w i )
144
Total Impurity Total Impurity at Depth 0: i(depth =0) = i(N) Total Impurity at Depth 1: i(depth =1) = p(N L ) i(N L ) + p(N R ) i(N R ) N yes no NLNL NRNR Depth 0 Depth 1
145
Impurity Example Node 1: (80,0)Node 2: (20,100) I(node 1) = 0 I(node 2) = - 20/120 log2(20/120) - 100/120 log2(100/120) = 0.65 P(node 1) = 80/200 = 0.4 P(node 2) = 120/200 = 0.6 I(total) = P(node 1) I(node 1) + P(node 2) I(node 2) = = 0 + 0.6*0.65 = 0.39
146
Continuous Example For continuous features: questions are of the type x<a where x is the feature and a is a constant Decision Boundaries (two class, 2-D example): R1 R2 R1 R2 x1 x2
147
Summary Decision trees are useful categorical classification tools especially for nominal (non-metric) data CART creates trees that minimize impurity on the training set at each node Decision region shape CART is a useful tool for feature selection
148
Unsupervised Training and Clustering Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005
149
Unsupervised Training Definition: The training set samples are unlabelled (unclassified) Motivation: Labeling is hard/time consuming Fully automatic adaptation of models (in the field)
150
Maximum Likelihood Training Given: N training examples drawn from c classes, i.e.,D = {x 1, x 2, … x N } (no class assignments are given!) Estimate: Class priors: p(w i ) Feature PDF parameters θ : p(x| θ i, w i ) Sometimes the number of classes c is not given and has to be also estimated
151
Unsupervised ML estimation k p(w i |x k,θ) i log p(x k | w i θ i ) = 0 Compared with supervised ML: additional term P(w i |x k,θ) P(w i |x k,θ) class membership function for each sample x k Unsupervised ML is a version of EM Pseudo-EM: P(w i |x k,θ) is binary 0 or 1
152
Mixture of Gaussians Estimates Linear combination of Gaussians with weights a i p(x k ) = i a i N(x k ; i, i ) ML estimates: a i = (1/N) k p(w i |x k ) i = ( k p(w i |x k ) x k ) / k p(w i |x k ) i = ( k p(w i |x k ) (x k - i ) (x k - i ) T ) / k p(w i |x k )
153
Clustering Basic Isodata: 1.Select initial partition of data into c classes and compute cluster means 2.Classify training samples using a classification criterion (Euclidean distance) 3.Recompute cluster means based on training set classification decisions 4.If no change in sample means stop else go to step 2
154
Iterative clustering algorithms Top down algorithms: Start from a single class (all data) Split class (e.g., std) Continue splitting the “largest” class until desired number of clusters is reached Bottom up algorithms Each training sample a different class Start merging classes (e.g., using a NNR criterion) until desired number of classes is reached
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.