CHAPTER 5: Multivariate Methods
Multivariate Distribution 4 Assume all members of class came from join distribution Can learn distributions from data P(x|C) Assign new instance for most probable class P(C|x) using Bayes rule An instance described by a vector of correlated parameters Realm of multivariate distributions Multivariate normal Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 4
Multivariate Data Multiple measurements (sensors) 5 Multiple measurements (sensors) d inputs/features/attributes: d-variate N instances/observations/examples 𝑋= 𝑋 1 1 𝑋 2 1 ⋯ 𝑋 𝑑 1 𝑋 1 2 𝑋 2 2 ⋯ 𝑋 𝑑 2 ⋮ 𝑋 1 𝑁 𝑋 2 𝑁 ⋯ 𝑋 𝑑 𝑁 Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 5
Multivariate Parameters 6 Mean:𝐸 𝑥 =𝜇= 𝜇 1 ,..., 𝜇 𝑑 𝑇 Covariance: 𝜎 𝑖𝑗 ≡Cov 𝑋 𝑖 , 𝑋 𝑗 Correlation: Corr 𝑋 𝑖 , 𝑋 𝑗 ≡ 𝜌 𝑖𝑗 = 𝜎 𝑖𝑗 𝜎 𝑖 𝜎 𝑗 𝛴≡Cov 𝑋 =𝐸 𝑋−𝜇 𝑋−𝜇 𝑇 = 𝜎 1 2 𝜎 12 ⋯ 𝜎 1d 𝜎 21 𝜎 2 2 ⋯ 𝜎 2d ⋮ 𝜎 𝑑1 𝜎 𝑑2 ⋯ 𝜎 𝑑 2 Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 6
Parameter Estimation Sample mean𝑚: 𝑚 𝑖 = 𝑡=1 𝑁 𝑥 𝑖 𝑡 𝑁 ,𝑖=1,...,𝑑 Covariance matrix𝑆: 𝑠 𝑖𝑗 = 𝑡=1 𝑁 𝑥 𝑖 𝑡 − 𝑚 𝑖 𝑥 𝑗 𝑡 − 𝑚 𝑗 𝑁 Correlation matrix𝑅: 𝑟 𝑖𝑗 = 𝑠 𝑖𝑗 𝑠 𝑖 𝑠 𝑗 Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 7 7
Estimation of Missing Values 8 What to do if certain instances have missing attributes? Ignore those instances: not a good idea if the sample is small Use ‘missing’ as an attribute: may give information Imputation: Fill in the missing value Mean imputation: Use the most likely value (e.g., mean) Imputation by regression: Predict based on other attributes Based on for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 8
Multivariate Normal Have d-attributes 9 Have d-attributes Often can assume each one distributed normally Attributes might be dependant/correlated Joint distribution of correlated several variables P(X1=x1, X2=x2, … Xd=xd)=? X1 is normally distributed with mean µi and variance 𝜎i Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 9
Multivariate Normal Mahalanobis distance: (x – μ)T ∑–1 (x – μ) 10 𝑥~ 𝑁 𝑑 𝜇,𝛴 𝑝 𝑥 = 1 2π 𝑑 2 ∣𝛴∣ 1 2 exp − 1 2 𝑥−𝜇 𝑇 𝛴 −1 𝑥−𝜇 Mahalanobis distance: (x – μ)T ∑–1 (x – μ) 2 variables are correlated Divided by inverse of covariance (large) Contribute less to Mahalanobis distance Contribute more to the probability Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 10
Bivariate Normal 11 Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 11
Multivariate Normal Distribution 12 Mahalanobis distance: (x – μ)T ∑–1 (x – μ) measures the distance from x to μ in terms of ∑ (normalizes for difference in variances and correlations) Bivariate: d = 2 𝛴= 𝜎 1 2 𝜌𝜎 1 𝜎 2 𝜌𝜎 1 𝜎 2 𝜎 2 2 𝑝 𝑥 1 , 𝑥 2 = 1 2 𝜋𝜎 1 𝜎 2 1− 𝜌 2 exp − 1 2 1− 𝜌 2 𝑧 1 2 − 2ρz 1 𝑧 2 + 𝑧 2 2 𝑧 𝑖 = 𝑥 𝑖 − 𝜇 𝑖 𝜎 𝑖 Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 12
Bivariate Normal 13 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 13
Bivariate Normal Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Independent Inputs: Naive Bayes 15 If xi are independent, offdiagonals of ∑ are 0, Mahalanobis distance reduces to weighted (by 1/σi) Euclidean distance: If variances are also equal, reduces to Euclidean distance 𝑝 𝑥 = 𝑖=1 𝑑 𝑝 𝑖 𝑥 𝑖 = 1 2π 𝑑 2 𝑖=1 𝑑 𝜎 𝑖 exp − 1 2 𝑖=1 𝑑 𝑥 𝑖 − 𝜇 𝑖 𝜎 𝑖 2 Based on Introduction to Machine Learning © The MIT Press (V1.1) 15
Projection Distribution 16 Example: vector of 3 features Multivariate normal distribution Projection to 2 dimensional space (e.g. XY plane) Vectors of 2 features Projection are also multivariate normal distribution Projection of d-dimensional normal to k-dimensional space is k-dimensional normal Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 16
1D projection 17 Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 17
Multivariate Classification 18 Assume members of class from a single multivariate distribution Multivariate normal is a good choice Easy to analyze Model many natural phenomena Model a class as having single prototype source (mean) slightly randomly changed Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 18
Example Matching cars to customers 19 Matching cars to customers Each cat defines a class of matching customers Customers described by (age, income) There is a correlation between age and income Assume each class is multivariate normal Need to learn P(x|C) from data Use Bayes to compute P(C|x) Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 19
Parametric Classification If p (x | Ci ) ~ N ( μi , ∑i ) Discriminant functions are Need to know Covariance Matrix and mean to compute discriminant functions. Can ignore P(x) as the same for all classes 𝑝 𝑥∣ 𝐶 𝑖 = 1 2π 𝑑 2 ∣ 𝛴 𝑖 ∣ 1 2 exp − 1 2 𝑥− 𝜇 𝑖 𝑇 𝛴 𝑖 −1 𝑥− 𝜇 𝑖 Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 20 20
Estimation of Parameters 21 𝑃 𝐶 𝑖 = 𝑡 𝑟 𝑖 𝑡 𝑁 𝑚 𝑖 = 𝑡 𝑟 𝑖 𝑡 𝑥 𝑡 𝑡 𝑟 𝑖 𝑡 𝑆 𝑖 = 𝑡 𝑟 𝑖 𝑡 𝑥 𝑡 − 𝑚 𝑖 𝑥 𝑡 − 𝑚 𝑖 𝑇 𝑡 𝑟 𝑖 𝑡 𝑔 𝑖 𝑥 =− 1 2 log∣ 𝑆 𝑖 ∣− 1 2 𝑥− 𝑚 𝑖 𝑇 𝑆 𝑖 −1 𝑥− 𝑚 𝑖 +log { 𝑃 𝐶 𝑖 Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 21
Covariance Matrix per Class Quadratic discriminant Requires estimation of K*d*(d+1)/2 parameters for covariance matrix 𝑔 𝑖 𝑥 =− 1 2 log∣ 𝑆 𝑖 ∣− 1 2 𝑥 𝑇 𝑆 𝑖 −1 𝑥− 2x 𝑇 𝑆 𝑖 −1 𝑚 𝑖 + 𝑚 𝑖 𝑇 𝑆 𝑖 −1 𝑚 𝑖 +log { 𝑃 𝐶 𝑖 = 𝑥 𝑇 𝑊 𝑖 𝑥+ 𝑤 𝑖 𝑇 𝑥+ 𝑤 𝑖0 where 𝑊 𝑖 =− 1 2 𝑆 𝑖 −1 𝑤 𝑖 = 𝑆 𝑖 −1 𝑚 𝑖 𝑤 𝑖0 =− 1 2 𝑚 𝑖 𝑇 𝑆 𝑖 −1 𝑚 𝑖 − 1 2 log∣ 𝑆 𝑖 ∣+log { 𝑃 𝐶 𝑖 22 Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 22
discriminant: P (C1|x ) = 0.5 likelihoods posterior for C1 23 Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 23
Common Covariance Matrix S 24 If not enough data can assume all classes have same common sample covariance matrix S Discriminant reduces to a linear discriminant (xTS-1x is common to all discriminant and can be removed) 𝑆= 𝑖 𝑃 𝐶 𝑖 𝑆 𝑖 𝑔 𝑖 𝑥 =− 1 2 𝑥− 𝑚 𝑖 𝑇 𝑆 −1 𝑥− 𝑚 𝑖 +log { 𝑃 𝐶 𝑖 Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 24
Common Covariance Matrix S 25 Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 25
Diagonal S When xj j = 1,..d, are independent, ∑ is diagonal 26 When xj j = 1,..d, are independent, ∑ is diagonal p (x|Ci) = ∏j p (xj |Ci) (Naive Bayes’ assumption) Classify based on weighted Euclidean distance (in sj units) to the nearest mean 𝑔 𝑖 𝑥 =− 1 2 𝑗=1 𝑑 𝑥 𝑗 𝑡 − 𝑚 𝑖𝑗 𝑠 𝑗 2 +log { 𝑃 𝐶 𝑖 Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 26
Diagonal S variances may be different 27 variances may be different Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 27
Diagonal S, equal variances 28 Nearest mean classifier: Classify based on Euclidean distance to the nearest mean Each mean can be considered a prototype or template and this is template matching 𝑔 𝑖 𝑥 =− ∥𝑥− 𝑚 𝑖 ∥ 2 2s 2 +log 𝑃 𝐶 𝑖 =− 1 2s 2 𝑗=1 𝑑 𝑥 𝑗 𝑡 − 𝑚 𝑖𝑗 2 +log 𝑃 𝐶 𝑖 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 28
Diagonal S, equal variances 29 ? * Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 29
Model Selection Assumption Covariance matrix No of parameters Shared, Hyperspheric Si=S=s^2I 1 Shared, Axis-aligned Si=S, with sij=0 d Shared, Hyperellipsoidal Si=S d(d+1)/2 Different, Hyperellipsoidal Si K d(d+1)/2 As we increase complexity (less restricted S), bias decreases and variance increases Assume simple models (allow some bias) to control variance (regularization) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Model Selection Different covariance matrix for each class 31 Different covariance matrix for each class Have to estimate many parameters Small bias , large variance Common covariance matrices, diagonal covariance etc. reduce number of parameters Increase bias but control variance In-between states? Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 31
Regularized Discriminant Analysis(RDA) 32 a=b=0: Quadratic classifier a=0, b=1:Shared Covariance, linear classifier a=1,b=0: Diagonal Covariance Choose best a,b by cross validation Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 32
Model Selection: Example 33 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 33
Model Selection 34 Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 34
Multivariate Regression Multivariate linear model Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Multivariate Regression Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
CHAPTER 6: Dimensionality Reduction
Dimensionality of input 39 Number of Observables (e.g. age and income) If number of observables is increased More time to compute More memory to store inputs and intermediate results More complicated explanations (knowledge from learning) Regression from 100 vs. 2 parameters No simple visualization 2D vs. 10D graph Need much more data (curse of dimensionality) 1M of 1-d inputs is not equal to 1 input of dimension 1M Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 39
Dimensionality reduction 40 Some features (dimensions) bear little or nor useful information (e.g. color of hair for a car selection) Can drop some features Have to estimate which features can be dropped from data Several features can be combined together without loss or even with gain of information (e.g. income of all family members for loan application) Some features can be combined together Have to estimate which features to combine from data Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 40
Feature Selection vs Extraction 41 Feature selection: Choosing k<d important features, ignoring the remaining d – k Subset selection algorithms Feature extraction: Project the original xi , i =1,...,d dimensions to new k<d dimensions, zj , j =1,...,k Principal Components Analysis (PCA) Linear Discriminant Analysis (LDA) Factor Analysis (FA) Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 41
Usage Have data of dimension d Reduce dimensionality to k<d 42 Have data of dimension d Reduce dimensionality to k<d Discard unimportant features Combine several features in one Use resulting k-dimensional data set for Learning for classification problem (e.g. parameters of probabilities P(x|C) Learning for regression problem (e.g. parameters for model y=g(x|Thetha) Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 42
Subset selection Have initial set of features of size d 43 Have initial set of features of size d There are 2^d possible subsets Need a criteria to decide which subset is the best A way to search over the possible subsets Can’t go over all 2^d possibilities Need some heuristics Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 43
“Goodness” of feature set 44 Supervised Train using selected subset Estimate error on validation data set Unsupervised Look at input only(e.g. age, income and savings) Select subset of 2 that bear most of the information about the person Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 44
Mutual Information 45 Have a 3 random variables(features) X,Y,Z and have to select 2 which gives most information If X and Y are “correlated” then much of the information about of Y is already in X Make sense to select features which are “uncorrelated” Mutual Information (Kullback–Leibler Divergence ) is more general measure of “mutual information” Can be extended to n variables (information variables x1,.. xn have about variable xn+1) Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 45
Subset-selection Forward search Backward search 46 Forward search Start from empty set of features Try each of remaining features Estimate classification/regression error for adding specific feature Select feature that gives maximum improvement in validation error Stop when no significant improvement Backward search Start with original set of size d Drop features with smallest impact on error Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 46
Subset Selection There are 2^d subsets of d features Forward search: Add the best feature at each step Set of features F initially Ø. At each iteration, find the best new feature j = argmini E ( F È xi ) Add xj to F if E ( F È xj ) < E ( F ) Hill-climbing O(d^2) algorithm Backward search: Start with all features and remove one at a time, if possible. Floating search (Add k, remove l) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Floating Search Forward and backward search are “greedy” algorithms 48 Forward and backward search are “greedy” algorithms Select best options at single step Do not always achieve optimum value Floating search Two types of steps: Add k, remove l More computations Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 48
Feature Extraction Face recognition problem 49 Face recognition problem Training data input: pairs of Image + Label(name) Classifier input: Image Classifier output: Label(Name) Image: Matrix of 256X256=65536 values in range 0..256 Each pixels bear little information so can’t select 100 best ones Average of pixels around specific positions may give an indication about an eye color. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 49
Projection 50 Find a projection matrix w from d-dimensional to k- dimensional vectors that keeps error low Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 50
PCA: Motivation 51 Assume that d observables are linear combination of k<d vectors zi=wi1xi1+…+wikxid We would like to work with basis as it has lesser dimension and have all(almost) required information What we expect from such basis Uncorrelated or otherwise can be reduced further Have large variance (e.g. wi1 have large variation) or otherwise bear no information Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 51
PCA: Motivation 52 Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 52
PCA: Motivation 53 Choose directions such that a total variance of data will be maximum Maximize Total Variance Choose directions that are orthogonal Minimize correlation Choose k<d orthogonal directions which maximize total variance Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 53
PCA Choosing only directions: 54 Choosing only directions: Maximize variance subject to a constrain using Lagrange Multipliers Taking Derivatives Eigenvector. Since want to maximize we should choose an eigenvector with largest eigenvalue Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 54
PCA d-dimensional feature space 55 d-dimensional feature space d by d symmetric covariance matrix estimated from samples Select k largest eigenvalue of the covariance matrix and associated k eigenvectors The first eigenvector will be a direction with largest variance Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 55
What PCA does z = WT(x – m) 56 z = WT(x – m) where the columns of W are the eigenvectors of ∑, and m is sample mean Centers the data at the origin and rotates the axes Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 56
How to choose k ? Proportion of Variance (PoV) explained 57 Proportion of Variance (PoV) explained when λi are sorted in descending order Typically, stop at PoV>0.9 Scree graph plots of PoV vs k, stop at “elbow” 𝜆 1 + 𝜆 2 +⋯+ 𝜆 𝑘 𝜆 1 + 𝜆 2 +⋯+ 𝜆 𝑘 +⋯+ 𝜆 𝑑 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 57
58 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 58
PCA PCA is unsupervised (does not take into account class information) 59 PCA is unsupervised (does not take into account class information) Can take into account classes : Karhuned-Loeve Expansion Estimate Covariance Per Class Take average weighted by prior Common Principle Components Assume all classes have same eigenvectors (directions) but different variances Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 59
PCA Does not try to explain noise 60 Does not try to explain noise Large noise can become new dimension/largest PC Interested in resulting uncorrelated variables which explain large portion of total sample variance Sometimes interested in explained shared variance (common factors) that affect data Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 60