Ch 1. Introduction (Latter) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J.W. Ha Biointelligence Laboratory, Seoul National University
2(C) 2006, SNU Biointelligence Lab, The Curse of Dimensionality 1.5 Decision Theory 1.6 Information Theroy
3(C) 2006, SNU Biointelligence Lab, The Curse of Dimensionality The High Dimensionality Problem Ex. Mixture of Oil, Water, Gas - 3-Class (Homogeneous, Annular, Laminar) - 12 Input Variables - Scatter Plot of x6, x7 - Predict Point X - Simple and Naïve Approach
4(C) 2006, SNU Biointelligence Lab, The Curse of Dimensionality (Cont’d) The Shortcomings of Naïve Approach - The number of cells increase exponentially. - Needs a large training data set for cells not to be empty.
5(C) 2006, SNU Biointelligence Lab, The Curse of Dimensionality (Cont’d) Polynomial Curve Fitting Method(M Order) - Althogh D increases, it grows propotionally to D m The Volume of High Dimensional Sphere - Concentrated in a thin shell near the space
6(C) 2006, SNU Biointelligence Lab, The Curse of Dimensionality (Cont’d) Gaussian Distribution
7(C) 2006, SNU Biointelligence Lab, Decision Theory Make Optimal Decisions - Inference Step & Decision Step - Select Higher Posterior Probability Minimizing the Misclassification Rate - Object: → Minimizing Colored Area
8(C) 2006, SNU Biointelligence Lab, Decision Theory (Cont’d) Minimizing the Expected Loss - Class 마다 Missclassification 의 Damage 가 다르다. - Introduction of Loss Function(Cost Function) - Object : Minimizing Expected Loss The Reject Option - Threshold θ - Reject if θ > Posterior Prob.
9(C) 2006, SNU Biointelligence Lab, Decision Theory (Cont’d) Inference and Decision - Three Distinct Approach 1. Obtain Posterior Probability & Generative Models - Obtain data distribution by Caculating p(x|C k ) for each class - Obtain p(C k ), p(x) to get p(C k |x) in Bayesian Rule - Can generate synthetic data points - Overheads of Calculation
10(C) 2006, SNU Biointelligence Lab, 2. Discriminative Models using Posterior - Obtain Posterior Directly - Classify the class for new input data - In case that classification is needed only 3. Discriminative Function - Maps input x to class directly 1.5 Decision Theory (Cont’d)
11(C) 2006, SNU Biointelligence Lab, Why do we compute the posterior? 1. Minimizing Risk - Frequently changed Loss Matrix 2. Reject Option 3. Compensating for Class Priors - In case of large difference between the probablities of each class - Posterior is proportional to prior 4. Combining Models - Seprate subproblem and Obtain each posterior 1.5 Decision Theory (Cont’d)
12(C) 2006, SNU Biointelligence Lab, Decision Theory (Cont’d) Loss Function for Regression - Multiple Target Variable Vector
13(C) 2006, SNU Biointelligence Lab, Minkowski Loss 1.5 Decision Theory (Cont’d)
14(C) 2006, SNU Biointelligence Lab, Information Theory Entropy - Low probability events corresponds to high information content.( h(x) = -log 2 p(x) ) - Expectaion value of information content. - Higher Entropy, Lager Uncertainty
15(C) 2006, SNU Biointelligence Lab, Information Theory (Cont’d) Maximum Entropy Configuration for Continuous Variable - The distribution that maximize the differential entropy is the Gaussian Conditional Entropy : H[x,y] = H[y|x] + H[x] - Adopt Lagrange multipliers to obtain maximum entropy
16(C) 2006, SNU Biointelligence Lab, Information Theory (Cont’d) Relative Entropy [Kullback-Leibler divergence] Convexity Function (Jensen’s Inequality) - Predict unknown distribution p(x) with an approaxiamting distribution q(x)
17(C) 2006, SNU Biointelligence Lab, Mutual Information 1.6 Information Theory (Cont’d) - I[x, y] = H[x] – H[x|y] = H[y] – H[y|x] - If x and y are independent, I[x,y] = 0 - the Reduction in the uncertainty about x by virtue of being told the value of y - Relative Entropy between the joint distribution and the product of the marginals