Linear Discrimant Analysis(LDA)
Discrimant analysis(Joliffe) Discriminant analysis is concerned with data in which each observation comes from one of several well-defined groups or populations. Assumptions are made about the structure of the populations, and the main objective is to construct rules for assigning future observations to one of the populations so as to minimize the probability of misclassification or some similar criterion. Cluster analysis is unsupervised learning and discrimant analysis falls in supervised learning roughly.
Quadratic discriminant Suppose that we have two classes of data D_1={x^1_i} with class label y_1 and data D_2={x^2_i} with class label y_2. We assume both are normally distributed with means and variances (m_1, σ_1) and (m_2, σ_2).
Let us consider two events E={x is in D_1} and H={x is in D_2} Bayesian posterior probability P(H|E) = P(E|H)P(H)/P(E), Where P(H|E) is conditional probability and vice versa. E H We may say that x is D_2 if the log likelihood (x-m_1)^T [σ_1]^{-1}(x-m_1) – ln |σ_1| -(x-m_2)^T [σ_2]^{-1}(x-m_2) + ln |σ_2| > h is satisfied for a threshold h. This is quadratic classifier.
LDA We have main assumptions; Two classes satisfies the followings. 1. variances are identical; σ_1 = σ_2 = σ . x^T [σ_1]^{-1} x = x^T [σ_2]^{-1} x x^T [σ_i]^{-1} m_i = (m_i)^T [σ_i]^{-1} x . 2. x’s are independent. So quadratic classifier will be w·x > c w=[σ]^{-1}(m_2 – m_1) c=1/2 (h - m_1^T[σ_1]^{-1} m_1 + m_2^T[σ_2]^{-1} m_2 )
Multiclass LDA Suppose there are k-classes {D_i, i=1,2,…,k}. We assume variances(covariances) are identical, say, σ_1=…=σ_k=σ. The class separation in direction w is obtained by maximizing S = w^T [σ_s] w/ w^T [σ] w where σ_s is variance of means such that σ_s = 1/k 1 𝑘 (m_i – m)^T (m_i – m). w is an eigenvector of [σ]^{-1} [σ_s] and S is the corresponding separation level.
In fact, the minimizer satisfies 1/2 dS/dw = ( (w^T[σ]w) [σ_s] w – (w^T[σ_s]w) [σ] w ) /( w^T[σ]w )^2 = 0 and w satisfies the eigen vector equation [σ]^{-1} [σ_s] w = Sw. Note that w is the orthogonal vector to the decision boundary.
Go there and get Iris data as;(Jupyter notebook) Data from UC Irvine; http://archive.ics.edu/ml/index.html Go there and get Iris data as;(Jupyter notebook) >>>import pandas as pd >>>df = pd.read_ccs(‘http://archive.ics.uci.edu/ml’ ‘machine-learning-databases/iris/iris.data’, header=None) >>>df.tail()(check data;150vector with 4attributes and a label) 1 2 3 4 145 6.7 3.0 5.2 2.3 Iris-virginica 146 6.3 2.5 5.0 1.9 147 6.5 2.0 148 6.2 3.4 5.4 149 5.9 5.1 1.8
plot of 50 Iris Setosa and 50 Iris Versicolour petal length sepal length
Dimension reduction When the random vectors have a large dimension, we may combine with PCA(principal component analysis) to classify. One problem is that the classes might have different variances or even different distributions. Note that LDA assumes that each class has the same variance.
Summary for practical LDA 0.Before we start, we normalize data. We divide variance to data vectors. 1.Compute mean vectors for each class. 2.Compute the scatter matrix in between class σ_i = 𝑥 𝑘 ∈𝐷_𝑖 ( 𝑥 𝑘 − 𝑚 𝑖 )^𝑇( 𝑥 𝑘 − 𝑚 𝑖 ) and within class σ_s = 𝑚 𝑘 (𝑚_𝑘− 𝑚 )^𝑇( 𝑚 𝑘 − 𝑚 ) . 3.Find eigenvectors and eigenvalues {(e_i,λ_i),i=1,…,d} of σ_s. 4.Rearrange eingenvalues in decreasing order and choose n largest eigenvalues and their eigenvectors {(e_i,λ_i),i=1,…,k}, where we use the same indices in decreasing order. 5.Project data vectors to k dimensional subspace span {e_i, i<=k}. 6.Condcuct LDA in R^k dimensional vectors.
Thank you for your attention