Presentation is loading. Please wait.

Presentation is loading. Please wait.

Review Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005.

Similar presentations


Presentation on theme: "Review Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005."— Presentation transcript:

1 Review Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

2 PatReco: Introduction Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

3 PatReco:Applications  Speech/audio/music/sounds Speech recognition, Speaker verification/id,  Image/video OCR, AVASR, Face id, Fingerpring id, Video segmentation  Text/Language Machine translatoin, document class., lnag mod., text underst.  Medical/Biology Disease diagnosis, DNA sequencing, Gene disease models  Other Data User modeling (books/music), Ling analysis (web), Games

4 Basic Concepts  Why statistical modeling? Variability: differences between two examples of the same class in training Mismatch: differences between two examples of the same class (one in training one in testing)  Learning modes: Supervised learning: class labels known Unsupervised learning: class labels unknown Re-inforced learning: only positive/negative feedback

5 Basic Concepts  Feature selection Separate classes, Low correlation  Model selection Model type, Model order  Prior knowledge E.g., a priori class probability  Missing features/observations  Modeling of time series Correlation in time (model?), segmentation

6 PatReco: Algorithms  Parametric vs Non-Parametric  Supervised vs Unsupervised  Basic Algorithms: Bayesian Non-parametric Discriminant Functions Non-Metric Methods

7 PatReco: Algorithms  Bayesian methods Formulation (describe class characteristics) Bayes classifier Maximum likelihood estimation Bayesian learning Estimation-Maximization Markov models, hidden Markov models Bayesian Nets  Non-parametric Parzen windows Nearest Neighbour

8 PatReco: Algorithms  Discriminant Functions Formulation (describe boundary) Learning: Gradient descent Perceptron MSE=minimum squared error LMS=least mean squares Neural Net generalizations Support vector machines  Non-Metric Methods Classification and Regression Trees String Matching

9 PatReco: Algorithms  Unsupervised Learning: Mixture of Gaussians K-means  Other not-covered Multi-layered Neural Nets Stochastic Learning (Simulated Annealing) Genetic Algorithms Fuzzy Algorithms Etc…

10 PatReco: Problem Solving 1.Data Collection 2.Data Analysis 3.Feature Selection 4.Model Selection 5.Model Training 6.Classification 7.Classifier Evaluation

11

12

13 PatReco: Problem Solving 1.Data Collection 2.Data Analysis 3.Feature Selection 4.Model Selection 5.Model Training 6.Classification 7.Classifier Evaluation

14

15

16

17 PatReco: Problem Solving 1.Data Collection 2.Data Analysis 3.Feature Selection 4.Model Selection 5.Model Training 6.Classification 7.Classifier Evaluation

18

19

20

21

22 PatReco: Problem Solving 1.Data Collection 2.Data Analysis 3.Feature Selection 4.Model Selection 5.Model Training 6.Classification 7.Classifier Evaluation

23 Evaluation  Training Data Set 1234 examples of class 1 and class 2  Testing/Evaluation Data Set 134 examples of class 1 and class 2  Misclassification Error Rate Training: 11.61% (150 errors) Testing: 13.43% (18 errors)  Correct for chance (Training 22%, Testing 26%) Why?

24 PatReco: Discriminant Functions for Gaussians Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

25 PatReco: Problem Solving 1.Data Collection 2.Data Analysis 3.Feature Selection 4.Model Selection 5.Model Training 6.Classification 7.Classifier Evaluation

26 Discriminant Functions  Define class boundaries (instead of class characteristics)  Dualism: Parametric class description  Bayes classifier  Decision boundary  Parametric Discriminant Functions

27 Normal Density  1D  Multi-D Full covariance Diagonal covariance Diagonal covariance + univariate  Mixture of Gaussians Usually diagonal covariance

28

29 Gaussian Discriminant Functions  Same variance ALL classes Hyper-planes  Different variance among classes Hyper-quadratics (hyper-parabolas, hyper- ellipses etc.)

30

31

32 Hyper-Planes  When the covariance matrix is common across Gaussian classes The decision boundary is a hyper-plane that is vertical to the line connecting the means of the Gaussian distributions If the a-priori probabilities of classes are equal the hyper-planes cuts the line connecting the Gaussian means in the middle  Euclidean classifier

33 Gaussian Discriminant Functions  Same variance ALL classes Hyper-planes  Different variance among classes Hyper-quadratics (hyper-parabolas, hyper- ellipses etc.)

34

35

36

37

38 Hyper-Quadratics  When the Gaussian class variances are different the boundary can be hyper-plane, multiple hyper-planes, hyper-sphere, hyper- parabola, hyper-elipsoid etc. The boundary in general in NOT vertical to the Gaussian mean connecting line If the a-priori probabilities of classes are equal the resulting classifier is a Mahalanobois classifier

39 Conclusions  Parametric statistical models describe class characteristics x by modeling the observation probabilities p(x|class)  Discriminant functions describe class boundaries parametrically  Parametric statistical models have an equivalent parametric discriminant function  For Gaussian p(x|class) distributions the decision boundaries are hyper-planes or hyper-quadratics

40 PatReco: Detection Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

41 Detection  Goal: Detect an Event Hit (Success) False Alarm Miss (Failure) False Reject

42

43

44 PatReco: Estimation/Training Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

45 Estimation/Training  Goal: Given observed data (re-)estimate the parameters of the model e.g., for a Gaussian model estimate the mean and variance for each class

46 Supervised-Unsupervised  Supervised training: All data has been (manually) labeled, i.e., assigned to classes  Unsupervised training: Data is not assigned a class label

47 Observable data  Fully observed data: all information necessary for training is available (features, class labels etc.)  Partially observed data: some of the features or some of the class labels are missing

48 Supervised Training (fully observable data)  Maximum likelihood estimation (ML)  Maximum a posteriori estimation (MAP)  Bayesian estimation (BE)

49 Training process  Collected data used for training consists of the following examples D = {x 1, x 2, … x N }  Step 1: Label each example with the corresponding class label ω 1, ω 2,... ω Κ  Step 2: For each of the classes separately estimate the model parameters using ML, MAP, BE and the corresponding training examples D 1, D 2..D K

50 Training Process: Step 1 D = {x 1, x 2, x 3, x 4, x 5, … x N } Label manually ω 1, ω 2,... ω Κ D 1 = {x 11, x 12, x 13, … x 1 N 1 } D 2 = {x 21, x 22, x 23, … x 2 N 2 } ………… D K = {x K 1, x K 2, x K 3, … x KN k }

51 Training Process: Step 2  Maximum Likelihood θ 1 = argmax Θ P(D 1 |θ 1 )  Maximum-a-posteriori θ 1 = argmax Θ P(D 1 |θ 1 ) P(θ 1 )  Bayesian estimation P (x|ω 1 ) =  P(x| θ 1 )P( θ 1 |D 1 ) d θ 1

52 ML Estimation Assumptions 1.P(x|ω i ) follows a parametric distribution with parameters θ 2.D j tells us nothing about P(x|ω i ) (functional independence) 3.Observations x 1, x 2, x 3, … x N are iid (independent identically distributed 4a (ML only!) θ is a quantity whose value is fixed but unknown

53 ML estimation θ = argmax Θ P(θ|D) = argmax Θ P(D|θ) P(θ) = 4 argmax Θ P(D|θ) = argmax Θ P( x 1, x 2, … x N |θ) = 3 argmax Θ Π j P(x j |θ) =>  Π j P(x j |θ) /  θ = 0 => θ = …

54 ML estimate for Gaussian pdf If P(x|ω) = Ν(μ,σ 2 ) and θ=(μ,σ 2 ) then 1-D μ = (1/Ν) Σ j=1..N x j σ 2 = (1/Ν) Σ j=1..N (x j – μ) 2 Multi-D : θ=( μ, Σ ) μ = (1/Ν) Σ j=1..N x j Σ = (1/Ν) Σ j=1..N (x j – μ) Τ (x j – μ)

55 Bayesian Estimat. Assumptions 1.P(x|ω i ) follows a parametric distribution with parameters θ 2.D j tells us nothing about P(x|ω i ) (functional independence) 3.Observations x 1, x 2, x 3, … x N are iid (independent identically distributed) 4b (MAP, BE) θ is a random variable whose prior distribution p(θ) is known

56 Bayesian Estimation P (x|D) =  P(x,θ|D) dθ =  P(x|θ,D)P(θ|D) dθ =  P(x|θ)P(θ|D) dθ STEP 1: P(θ)  P(θ|D) P(x|D) = P(D|θ)P(θ)/P(D) STEP 2: P(x|θ)  P (x|D)

57 Bayesian Estimate for Gaussian pdf and priors If P(x|θ) = Ν(μ, σ 2 ) and p(θ) = Ν(μ 0, σ 0 2 ) then STEP 1: P(θ|D)=Ν(μ n, σ n 2 ) STEP 2: P(x|D)=N(μ n, σ 2 +σ n 2 ) μ n = σ 0 2 /(n σ 0 2 + σ 2 ) ( Σ j x j ) + σ 2 /(n σ 0 2 + σ 2 ) μ 0 σ n 2 = σ 2 σ 0 2 /(n σ 0 2 + σ 2 ) For large n (number of training samples) maximum likelihood and Bayesian estimation equivalent!!!

58 Conclusions  Maximum likelihood estimation is simple and gives good estimates when the number of training samples is large  Bayesian adaptation gives good estimates even for small amounts of training data provided that a good prior is selected  Bayesian adaptation is hard and often does not have a closed form solution (in which case try: iterative recursive Bayesian estimation)

59 PatReco: Model and Feature Selection Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

60 Breakdown of Classification Error  Bayes error  Model selection error  Model estimation error  Data mismatch error (training-testing)

61

62 True statements about Bayes error (valid within statistical significance)  The Bayes error is ALWAYS smaller than the total (empirical) classification error  If the model, estimation and mismatch errors are zero than the total classification error equals the Bayes error  The ONLY way to reduce the Bayes error is to add new features in the classifier design

63 More true statements  Adding new features can only reduce the Bayes error (this is not true about the total classification error!!!)  Adding new features will NOT reduce the Bayes error if the new features are Very bad at discriminating between classes (feature pdfs overlapping) Highly correlated with existing features

64 Gaussian classification Bayes Error For two classes ω 1 and ω 2 following Gaussian distributions with means μ 1, μ 2 and the same variance σ 2 then the Bayes error is: P(error) = 1/(2π) 0.5  r/2 exp{-u 2 /2} du where r = |μ 1 -μ 2 |/σ 

65 Feature Selection  If we had infinite amounts of data then The more features the better!  However in practice finite data: More features  more parameters to train!!!  Good features: Uncorrelated Able to discriminate among classes

66 Model selection  Number of model parameters is number of parameters that need to be estimated  Overfiting: too many parameters, too little data!!!  Gaussian models-Model selection: Single Gaussians Mixture of Gaussians Fixed Variance Tied Variance Diagonal Variance

67 Conclusion  Introducing more features and/or more complex models can only reduce the classification error (if infinite amounts of training data are available)  In practice: number of features and number of model parameters is a function of amount of training data available (avoid overfiting!)  Good features are uncorrelated and discriminative

68 PatReco: Expectation Maximization Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

69 When do we use EM?  Partially observable data Missing some features from some samples, e.g., D={(1,2),(2,3),(?,4)} Missing class labels, e.g., hidden states of HMMs Missing class sub-labels, e.g., mixture label for mixture of Gaussian models

70 The EM algorithm  The Expectation Maximization algorithm (EM) consists of alternating expectation and maximization steps  During expectation steps the “best estimates of the missing information” are computed  During maximization step maximum likelihood training on all data is performed

71 EM Initialization:  (0) for i =1..iterno // usually iterno=2 or 3 E step: Q (i) = E D bad {log(p(D;  )|x;D bad,  (i-1) } M step:  (i) =argmax{Q (i) } end

72 Pseudo-EM Initialization:  (0) for i =1..iterno // usually iterno=2 or 3 Expectation step: D bad =E{D bad |  (i-1) } Maximization step:  (i) =argmax{p(D|  (i-1) } end

73 Convergence  EM is guaranteed to converge to a local optimum (NOT the global optimum!)  Pseudo-EM has no convergence guarantees but is used often in practice

74 Conclusions  EM is an iterative algorithm used when there are missing or partially observable training data  EM is a generalization of ML training  EM is guaranteed to converge to a local optimum (NOT the global optimum!)

75 PatReco: Bayesian Networks Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

76 Definitions  Bayesian networks consist of nodes and (usually directional) arcs  Nodes or states represent a classification class or in general events and are described with a pdf  Arcs represent relations between arcs, e.g., cause and effect, time sequence  Two nodes that are connected via another node are conditionally independent (given that node)

77 When to use Bayesian nets  Bayesian networks (or networks of inference) are statistical models that are used for classification (or in general pattern recognition) problems where there are dependencies among classes, e.g., time dependencies, cause and effect dependencies

78 Conditional Independence  Full independence between A and B P(A|B) = P(A) or P(A,B) = P(A) P(B)  Conditional independence of A, B given C P(A|BC) = P(A|C) or P(A,B|C) = P(A|C)P(B|C)

79 Conditional Independence A, C independent given B P(C|BA) = P(C|B) B,C independent given A P(B,C|A) = P(B|A)P(C|A) A,C dependent given B P(A,C|B) cannot be reduced! A B C A B C A BC

80 Three problems 1.Probability computation (use independence) 2.Training/Parameter Estimation Maximum likelihood (ML) if all is observable Expectation maximization (EM) if missing data 3.Inference (Testing) Diagnosis P(cause|effect)bottom-up PredictionP(effect|cause)top-down

81 Probability Computation For a Bayesian Network that consists of N nodes: 1.Compute P(n 1, n 2..n N ) using chain rule starting from the “last/bottom” node and working your way up P(n 1, n 2..n N ) = P(n N | n 1, n 2.. n N-1 ) P(n N-1 |n 1, n 2.. n N-2 ) … P(n 2 |n 1 ) P(n 1 ) 2.Identify conditional independence conditions from Bayesian network topology 3.Simplify the conditionals probabilities using independence conditions

82 Probability Computation Topology: P(C,S,R,W) = P(W|C,S,R) P(S|CR) P(R|C)P(C) Independent: (W,C)|S,R(S,R)|C Dependent: (S,R)|W P(C,S,R,W) = P(W|S,R) P(S|C) P(R|C) P(C) C S W R

83 Probability Computation  There are general algorithms for identifying cliques in the Bayesian net  Cliques are islands of conditional dependence, i.e., terms in the probability computation that cannot be further reduced SC WSR RC

84 Training/Parameter Estimation  Instead of estimating the joint pdf of the whole network the joint pdf of each of the cliques is estimated  For example if the network joint pdf is P(C,S,R,W) = P(W|S,R) P(S|C) P(R|C) P(C) instead of computing P(C,S,R,W) we compute each of P(W|S,R), P(S|C), P(R|C), P(C) for all possible values of W, S, R, C (much simpler)

85 Training/Parameter Estimation  For fully observable data and discrete probabilities compute maximum likelihood estimates of parameters, e.g., for discrete probs counts(W=1,S=1,R=0) P(W=1|S=1,R=0) ML = _______________________ counts(W=*,S=1,R=0)

86 Training/Parameter Estimation  Example: the following observations pairs are given for (W,C,S,R): (1,0,1,0), (0,0,1,0),(1,1,1,0),(0,1,1,0),(1,0,1,0), (0,1,0,0),(1,0,0,1),(0,1,1,1),(1,1,1,0)  Using Maximum Likelihood Estimation: P(W=1|S=1,R=0) ML = #(1, *, 1, 0)/#(*,*,1,0) = 2/5 = 0.4

87 Training/Parameter Estimation  When data is non observable or missing the EM algorithm is employed  There are efficient implementations of the EM algorithm for Bayesian nets that operate on the clique network  When the topology of the Bayesian network is not known structural EM can be used

88 Inference  There are two types of inference (testing) Diagnosis P(cause|effect)bottom-up PredictionP(effect|cause)top-down Once  Once the parameters of the network are estimated the joint network pdf can be estimated for ALL possible network values  Inference is simply probability computation using the network pdf

89 Inference  For example P(W=1|C=1) = P(W=1,C=1) / P(C=1) where P(W=1,C=1) =  RS P(W=1,C=1,R=*,S=*) P(C=1) =  RWS P(W=*,C=1,R=*,S=*)

90 Inference  Efficient algorithms exist for performing inference in large networks which operate on the clique network  Inference is often shown as a probability maximization problem, e.g., what is the most probable cause or effect? argmax W P(W|C=1)

91 Continuous Case  In our examples the network nodes represented discrete events (states or classes)  Network nodes often hold continuous variables (observations), e.g., length, energy  For the continuous case parametric pdf are introduced and their parameters are estimated using ML (observed) or EM (hidden)

92 Some Applications  Medical diagnosis  Computer problem diagnosis (MS)  Markov chains  Hidden Markov Models (HMMs)

93 Conclusions  Bayesian networks are used to represent dependencies between classes  Network topology defines conditional independence conditions that simplify the network pdf modeling and computation  Three problems: probability computation, estimation/training, inference/testing

94 PatReco: Hidden Markov Models Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

95 Markov Models: Definition  Markov chains are Bayesian networks that model sequences of events (states)  Sequential events are dependent  Two non-sequential events are conditionally independent given the intermediate events (MM-1)

96 Markov chains q1 q4q3q2 q0q1q4q3q2 q0q1q4q3q2 q0q1q4q3q2 MM-0 MM-1 MM-2 MM-3 … … … …

97 Markov Chains MM-0: P(q 1,q 2.. q N ) =  n=1..N P(q n ) MM-1: P(q 1,q 2.. q N ) =  n=1..N P(q n |q n-1 ) MM-2: P(q 1,q 2.. q N ) =  n=1..N P(q n |q n-1,q n-2 ) MM-3: P(q 1,q 2.. q N ) =  n=1..N P(q n |q n-1,q n-2,q n-3 )

98 Hidden Markov Models  Hidden Markov chains model sequences of events and corresponding sequences of observations  Events form an Markov chain (MM-1)  Observations are conditionally independent given the sequence of events  Each observation is directly connected with a single event (and conditionally independent with the rest of the events in the network)

99 Hidden Markov Models q0q1q4q3q2 … o0o1o4o3o2 … P(o 0,o 1..o N, q 0,q 1..q N ) =  n=0..N P(q n |q n-1 )P(o n |q n ) HMM-1

100 Parameter Estimation  The parameters that have to be estimated are the a-priori probabilities P(q 0 ) transition probabilities P(q n |q n-1 ) observation probabilities P(o n |q n )  For example if there are 3 types of events and continuous 1-D observations that follow a Gaussian distribution there are 18 parameters to estimate: 3 a-priori probabilities 3x3 transition probabilities matrix 3 means and 3 variances (observation probabilities)

101 Parameter Estimation  If both the sequence of events and sequences of observations are fully observable then ML is used  Usually the sequence of events q 0,q 1..q N are non-observable in which case EM is used  The EM algorithm for HMMs is the Baum- Welsh or forward-backward algorithm

102 Inference/Decoding  The main inference problem for HMMs is known as the decoding problem: given a sequence of observations find the best sequence of states: q = argmax q P(q|O) = argmax q P(q,O)  An efficient decoding algorithm is the Viterbi algorithm

103 Viterbi algorithm max q P(q,O) = max q P(o 0,o 1..o N, q 0,q 1..q N ) = max q  n=0..N P(q n |q n-1 )P(o n |q n ) = max q N {P(o N |q N ) max q N-1 {P(q N |q N-1 )P(o N-1 |q N-1 ) … max q2 {P(q 3 |q 2 )P(o 2 |q 2 ) max q1 {P(q 2 |q 1 )P(o 1 |q 1 ) max q0 {P(q 1 |q 0 ) P(o 0 |q 0 ) P(q 0 )}}}…}}

104 Viterbi algorithm 1 2 3 4 K.... time At each node keep only the best (most probable) path from all the paths passing through that node

105 Deep Thoughts  HMM-0 (HMM with MM-0 event chain) is the Bayes classifier!!!  MMs and HMMs are poor models but simple and efficient computationally How do you fix this? (dependent observations?)

106 Some Applications  Speech Recognition  Optical Character Recognition  Part-of-Speech Tagging  …

107 Conclusions  HMMs and MMs are useful modeling tools for dependent sequence of events (states or classes)  Efficient algorithms exist for training HMM parameters (Baum-Welsh) and decoding the most probable sequence of states given an observation sequence (Viterbi)  HMMs have many applications

108 Non Parametric Classifiers Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

109 Histograms-Parzen Windows  Main idea: Instead of selecting a parametric distribution (e.g., Gaussian) to describe the properties of the features of a class, compute directly the empirical distribution  class feature histogram

110 Feature Histogram Example X # of samples in each bin  Normalize histogram curve to get feature PDF

111 Parzen Windows: Issues  When compared to parametric methods empirical distributions are: Better because no specific form of the PDF is assumed Worse because over-fitting can easily occur (too small histogram bin)  Parzen proposed rules for adapting bin size based on number of samples in each bin to avoid over- fitting

112 Nearest Neighbor Rule  Main idea (1-NNR): No explicit model (i.e., no training) For each test sample x the “nearest” sample x’ in the training set is found, i.e., argmin x’ d(x, x’) and x is classified to the class where x’ belongs

113 Generalizations k-NNR: Instead of finding the nearest neighbors we find k nearest neighbors from the training set; the sample x is classified to the class where most of the k neighbors belong k-l-NNR: Like k-NNR but at least l of the k nearest neighbor must belong to the same class for a classification decision to be taken (else no decision)

114 Example Training set D 1 = {0,-1,-2} and D 2 = {1,1,1} -2 -1 012 3 1-NNR decision boundary 3-NNR decision boundary 3-3-NNR no decision region

115 Computational Efficiency  To speed up NNR classification the training set size can be reduced using the condensing algorithm: The training set is classified using NNR rule misclassified samples are added to the new (condensed) training set one by one until all training samples are correctly classified

116 Conclusions  Non parametric classification algorithms are easy to implement are computationally efficient (in training) don’t make any assumptions are prone to over-fitting are hard to adapt (no detailed model)

117 Discriminant Functions Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

118 Discriminant Functions  Main Idea: Describe parametrically the decision boundary (instead of the properties of the class), e.g., the two classes are separated by a straight line a x 1 + b x 2 + c = 0, with parameters (a,b,c) (instead of the feature PDFs are 2-D Gaussians)

119 Example: Two classes, two features a x 1 + b x 2 + c = 0 x1x1 x2x2 w1w1 w2w2 x1x1 x2x2 w1w1 w2w2  11  22  12  21 N(  1,  1 ) N(  2,  2 ) Model Class BoundaryModel Class Characteristics

120 Duality  Dualism Parametric class description  Bayes classifier  Decision boundary  Parametric Discriminant Functions  For example modeling class features by Gaussians with same (across-class) variance results in hyper-plane discriminant functions

121

122 Discriminant Functions  Discriminant functions g i (x) are functions of the features x of a class i  A sample x is classified to class c for which g i (x) is maximized, i.e., c = argmax i {g i (x)}  The function g i (x) = g j (x) defines class boundaries for each pair of (different) classes i and j

123 Linear Discriminant Functions  Two class problem: A single discriminant function is defined as: g(x) = g 1 (x) – g 2 (x)  If g(x) is a linear function g(x) = w T x + w 0 then the boundary is a hyper-plane (point, line, plane for 1-D, 2-D, 3-D features respectively)

124 Linear Discriminant Functions a x 1 + b x 2 + c = 0 x1x1 x2x2 w = (a,b) -c/b -c/a

125 Non Linear Discriminant Functions  Quadratic discriminant functions g(x) = w 0 +  i w i x i +  ij w ij x i x j for examples for a two class 2-D problem g(x) = a + b x 1 + c x 2 + d x 1 2  Any non-linear discriminant function can become linear by increasing the dimensionality, e.g., y 1 = x 1, y 2 = x 2, y 3 = x 1 2 (2D nonlinear  3D linear) g(y) = a + b y 1 + c y 2 + d y 3

126 Parameter Estimation  The parameters w are estimated by functional minimization  The function to be minimized J models the average distance of training samples from the decision boundary for either Misclassifier training samples All training samples  The function J is minimized using gradient descent

127 Gradient Descent  Iterative procedure towards a local minimum a(k+1) = a(k) – n(k)  J(a(k)) where k is the iteration number, n(k) is the learning rate and  J(a(k)) is the gradient of the function to be minimized evaluated at a(k)  Newton descent is the gradient descent with learning rate equal to the inverse Hessian matrix

128 Distance Functions  Perceptron Criterion Function J p (a) =  misclassified ( - a T y)  Relaxation With Margin b J r (a) =  misclassified (a T y - b) 2 / ||y|| 2  Least Mean square (LMS) J s (a) =  all samples (a T y i - b i ) 2  Ho-Kashyap rule J s (a,b) =  all samples (a T y i - b i ) 2

129 Discriminant Functions  Working on misclassified samples only (Perceptron, Relaxation with Margin) provides better results but converges only for separable training sets

130 High Dimensionality  Using non-linear discriminant functions and linearizing them in a high dimensional space can make ANY training set separable large # of parameters (curse of dimensionality)  Support vector machines: A smart way to select appropriate terms (dimensions) is needed

131 Non-Metric Methods: Decision Trees Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

132 Decision Trees  Motivation: There are features (discrete) that don’t have an obvious notion of similarity or ordering (nominal data), e.g., book type, shape, sound type  Taxonomies (i.e., trees with is-a relationship) are the oldest form of classification

133 Decision Trees: Definition  Decision Trees are classifiers that classify samples based on a set of questions that are asked hierarchically (tree of questions)  Example questions is color red? is x < 0.5?  Terminology: root, leaf, node, arc, branch, parent, children, branching factor, depth

134 Fruit classifier Color? green yellow red Size? Shape? Size? Taste? bigmed round thin big small med big small med sweetsour

135 Fruit classification Color? green yellow red Size? Shape? Size? Taste? bigmed round thin big small med big small med sweetsour CHERRY

136 Fruit classification Color? green yellow red Size? Shape? Size? Taste? bigmed round thin big small med big small med sweetsour CHERRY

137 Fruit classification Color? green yellow red Size? Shape? Size? Taste? bigmed round thin big small med big small med sweetsour CHERRY

138 Fruit classification Color? green yellow red Size? Shape? Size? Taste? bigmed round thin big small med big small med sweetsour CHERRY

139 Fruit classifier Color? green yellow red Size? Shape? Size? Taste? bigmed round thin big small med big small med sweetsour watermelon grape grapefruit cherrygrape

140 Binary Trees  Binary trees: each parent node has exactly two children nodes (branching factor = 2)  Any tree can be represented as a binary tree by changing set of questions and by increasing the tree depth  e.g., Color? green yellow red Color = green? Color = yellow? YN Y N

141 Decision Trees: Problems 1.List of questions (features) All possible questions are considered 2.Which questions to split first (best split) The questions that split the data best (reduce impurity at each node) are asked first 3.Stopping criteria (pruning criteria) Stop when further splits don’t reduce imprurity

142 Best Split example  Two class problem with 100 examples from w1 and w2  Three binary questions Q1, Q2 and Q3 that split the data as follows: 1. Node 1: (50,50)Node 2: (50,50) 2. Node 1: (100,0)Node 2: (0,100) 3. Node 1: (80,0)Node 2: (20,100)

143 Impurity Measures  Impurity measures the degree of homogeneity of a node; a node is pure if it consists of training examples from a single class  Impurity Measures Entropy Impurity: i(N) = -  i P(w i ) log 2 (P(w i )) Variance (two-class): i(N) = P(w 1 ) P(w 2 ) Gini Impurity: i(N) = 1-  i P 2 (w i ) Misclassification: i(N) = 1- max i P(w i )

144 Total Impurity Total Impurity at Depth 0: i(depth =0) = i(N) Total Impurity at Depth 1: i(depth =1) = p(N L ) i(N L ) + p(N R ) i(N R ) N yes no NLNL NRNR Depth 0 Depth 1

145 Impurity Example  Node 1: (80,0)Node 2: (20,100) I(node 1) = 0 I(node 2) = - 20/120 log2(20/120) - 100/120 log2(100/120) = 0.65 P(node 1) = 80/200 = 0.4 P(node 2) = 120/200 = 0.6 I(total) = P(node 1) I(node 1) + P(node 2) I(node 2) = = 0 + 0.6*0.65 = 0.39

146 Continuous Example  For continuous features: questions are of the type x<a where x is the feature and a is a constant  Decision Boundaries (two class, 2-D example): R1 R2 R1 R2 x1 x2

147 Summary  Decision trees are useful categorical classification tools especially for nominal (non-metric) data  CART creates trees that minimize impurity on the training set at each node  Decision region shape  CART is a useful tool for feature selection

148 Unsupervised Training and Clustering Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

149 Unsupervised Training  Definition: The training set samples are unlabelled (unclassified)  Motivation: Labeling is hard/time consuming Fully automatic adaptation of models (in the field)

150 Maximum Likelihood Training  Given: N training examples drawn from c classes, i.e.,D = {x 1, x 2, … x N } (no class assignments are given!)  Estimate: Class priors: p(w i ) Feature PDF parameters θ : p(x| θ i, w i )  Sometimes the number of classes c is not given and has to be also estimated

151 Unsupervised ML estimation  k p(w i |x k,θ)   i log p(x k | w i θ i ) = 0  Compared with supervised ML: additional term P(w i |x k,θ)  P(w i |x k,θ) class membership function for each sample x k  Unsupervised ML is a version of EM  Pseudo-EM: P(w i |x k,θ) is binary 0 or 1

152 Mixture of Gaussians Estimates  Linear combination of Gaussians with weights a i p(x k ) =  i a i N(x k ;  i,  i )  ML estimates: a i = (1/N)  k p(w i |x k )  i = (  k p(w i |x k ) x k ) /  k p(w i |x k )  i = (  k p(w i |x k ) (x k -  i ) (x k -  i ) T ) /  k p(w i |x k )

153 Clustering  Basic Isodata: 1.Select initial partition of data into c classes and compute cluster means 2.Classify training samples using a classification criterion (Euclidean distance) 3.Recompute cluster means based on training set classification decisions 4.If no change in sample means stop else go to step 2

154 Iterative clustering algorithms  Top down algorithms: Start from a single class (all data) Split class (e.g.,   std) Continue splitting the “largest” class until desired number of clusters is reached  Bottom up algorithms Each training sample a different class Start merging classes (e.g., using a NNR criterion) until desired number of classes is reached


Download ppt "Review Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005."

Similar presentations


Ads by Google