Feature Generation and Cluster-based Feature Selection
Feature Generation An case study – Some of data has no obvious features (Sequence data) Feature generation method Feature Selection
Virus
Virus Recombination 4
Genome Distance 5
Previous Work on Genome Distance Measurement Multiple Sequence Alignment Gene Content Based Data Compression Based 6
Complete Composition Vector Composition information embedded in one genome sequence – R: Genome Sequence of Length L – Elements: {A,G,C,T} – K: Maximum Pattern Length – f(a 1,a 2,…,a k ): Appearance Probability of a 1,a 2,…,a k in R 7
Composition Value – Second Order Markov Model 8
Composition Value ATATCTATATACT f(ATA)=3/(13-3+1)=3/11 f(AT)=4/(13-2+1)=4/12 f(TA)=4/(13-2+1)=4/12 f(T)=6/13 Expected Appearance Probability q(ATA)=f(AT)*f(TA)/f(T)=13/54 CCV(ATA)=(f(ATA)-q(ATA))/q(ATA)=19/162 9
Complete Composition Vector Using complete composition vector to represent whole genome String Selection 10
String Selection Score Function (Relative Entropy) 11
Pair-wise Evolution Distance Given two sequences R and S – Their composition vector CCV(R)= CCV(S)= – Euclidean distance d(R,S) 12
HIV-1 Genotyping Identify HIV-1 Genotyping – Three Major Groups M-further categorized into 9 subtypes: A-D,F-H,J,K. O N – Recombinant Strains 13
Neighbor Jointing on 42 Pure Subtypes Strains 14
Genotyping Classifier – Mean-Classifier Leave-One-Out Cross Validation Independent Testing 15
Genotyping Classifier – Mean-Classifier Leave-One-Out Cross Validation Independent Testing 16
Genotyping Results Top 500 Strings by Relative Entropy – Pure Subtype Prediction LOOCV Accuracy on Training Dataset 100%. Independent Test 100% – Myers et al % – Oliveira et al % – Rozanovet al % 17
Genotyping Results Recombinant Accuracy 87.3% 18
Feature Selection Two feature selection categories: – Selecting topmost features with the most individual discriminatory power – Selecting a group of feature with the most overall discriminatory power 19
Feature Selection Examples of feature selection algorithms – F-test Compute for each feature a score And select the topmost genes with the top scores – SFS/SFFS Iteratively include one more feature each time and try to kick each gene in the selected feature pool out to see which of the residue combination is the best one. 20
Cluster-based Feature Selection Cluster-based feature selection 21 Team A: Team B:
Cluster-based Feature Selection 22
Cluster-based Feature Selection Perform feature clustering Choose from each cluster the topmost features which have the most powerful individual discriminatory power 23
Discrimination Power Vector 24 Class2- Class1 Class 3- Class 1 Class 3- Class 2 Feature1|Mean 1(2)- Mean 1(1)| |Mean 1(3)- Mean 1(1)| |Mean 1(3)- Mean 1(2)| Feature2|Mean 2(2)- Mean 2(1)| |Mean 2(3)- Mean 2(1)| |Mean 2(3)- Mean 2(2)| Class 1Class 2Class 3 Feature 1Mean 1(1)Mean 1 (2)Mean 1(3) Feature 2Mean 2(1)Mean 2 (3)
Cluster-based Feature Selection Perform feature clustering Choose from each cluster the topmost features which have the most powerful individual discriminatory power 25
FMDV Genotyping Datasets – Foot Mouse Disease virus (7 Subtypes) Genotyper – Linear kernel support vector machine (SVM) genotyper – Mean-genotyper 26
Genotyping Results 27
Experiments of Cluster Based Method
Linear Regression
Application Credit Approve (Credit Limit) x 1 =Score x 2 =Salary x 3 =Staus Y=0.5x Y=0.5x 1 +x 2 / Y=0.5x 1 +x 2 /100+20x
Linear Fitting the Data We want to fit a linear function to an observed set of points X= [x 1, x 2, x 3,… x n ] with associated label Y= [y 1, y 2, y 3,… y n ]. o After we fit the function, we can use the function to predict new y for new x. Find the function that minimized sum (the average) of square distances between actual ys in the training set and predicted ones. Least Square The fitted line is used as predictor y=ax+b (x i, y i )
Linear Fitting the Data We want to fit a linear function to an observed set of points X= [x 1, x 2, x 3,… x n ] with associated label Y= [y 1, y 2, y 3,… y n ]. o After we fit the function, we can use the function to predict new y for new x. Find the function that minimized sum (the average) of square distances between actual ys in the training set and predicted ones. (x i, y i ) Least Square The fitted line is used as predictor x j y j
Linear Function Linear Least Square fitting with X in R 2 General form: 1D case (X=R): a line 2D case (X=R 2 ) : a plane
Suppose target labels come from set Y – Binary classification:Y = { 0, 1 } – Regression:Y = (real numbers) A loss function maps decisions to costs: – defines the penalty for predicting when the true value is. Standard choice for classification: 0/1 loss (same as misclassification error) Standard choice for regression: squared loss Loss function
Linear Fitting the Data (x i, y i ) x j y j Minimize the error function
Use least squares to find the equation of the line that will best approximate the points (2,0), (-1,1) and (0,2) We want to find best line to fit the four points y=ax+b. Example
Minimize the error function, Example To find optimal a, b, set derivatives w.r.t. a and b equal to zero: 10a+2b+2=0 and a+3b-3=0 a=-3/7 and b=8/7, therefore, y=-3/7 x+8/7
Use least squares to find the equation of the line that will best approximate the points (1,6), (2,5), (3,7) and (4,10). y=ax+b Example y=1.4x+3.5
Approaches to solve Ax b Normal equations-quick and dirty QR- standard in libraries uses orthogonal decomposition SVD - decomposition which also gives indication how linear independent columns are Conjugate gradient- no decompositions, good for large sparse problems
Ax=b For example (1,6), (2,5), (3,7) and (4,10). Normal system of equations General form
Ax=b A T Ax=A t b x=(A T A) -1 A T b Least Square
A matrix Q is said to be orthogonal if its columns are orthonormal, i.e. Q T ·Q=I. Orthogonal transformations preserve the Euclidean norm since Orthogonal matrices can transform vectors in various ways, such as rotation or reflections but they do not change the Euclidean length of the vector. Hence, they preserve the solution to a linear least squares problem. QR Factorization
If A is a m×n matrix with linearly independent columns, then A can be decomposed as, A=QR where Q is a m×n matrix whose columns form an orthonormal basis for the column space of A and R is an nonsingular upper triangular matrix. Where (q i, q j )=0 and | q i |=1 QR Factorization
Gram-Schmidt Orthonormalization Process Linearly independent set a = {a 1, a 2, …, a n } (β i,β j )=0
QR Factorization
We get A=UR, where U=(θ 1, θ 2, θ 3,…, θ n ). Then,
Example of QR Factorization
49 Applying Gram-Schmidt process of computing QR decomposition 1st Step: 2 nd Step: 3 rd Step: Example of QR Factorization
50 4th Step: 5 th Step: 6 th Step: Example of QR Factorization
Therefore, A=QR QR decomposition is widely used in computer codes to find the eigenvalues of a matrix, to solve linear systems, and to find least squares approximations.
52 The least square solution of b is Let X=QR. Then Therefore, Least square solution using QR Decomposition
1. Least Square Ax=b A T Ax=A T b 2. A=QR R T Q T QRx=R T Q T b 3. Q T Q=I R T Rx=R T Q T b 4. Rx=Q T b
Least square solution using QR Decomposition Running Time 1. QR factorization of A: A = QR (2mn 2 flops) 2. Form d = QT b (2mn flops) 3. Solve Rx = d by back substitution (n 2 flops) Cost for large m, n: 2mn 2 flops
QR for Least Square
56 Singular Value Decomposition A=USV The singular values are the diagonal entries of the S matrix and are arranged in descending order The singular values are always real (non- negative) numbers If A is real matrix, U and V are also real
Singular Value Decomposition The SVD of a m-by-n matrix A is given by the formula : Where : U is a m-by-m matrix of the orthonormal eigenvectors of AA T, that is, U T= U -1 V T is the transpose of a n-by-n matrix containing the orthonormal eigenvectors of A T A, that is V T= V -1 is a n-by-n Diagonal matrix of the singular values which are the square roots of the eigenvalues of A T A A=USV T
SVD for Least Square
Example
Conjugate Gradient Solving of the linear equation system Problem: dimension n too big, or not enough time for gauss elimination. Iterative methods are used to get an approximate solution. A is known, square, symmetric, positive-definite Definition Iterative method: given starting point, do steps hopefully converge to the right solution x.
Conjugate Gradient We from now on assume we want to minimize the quadratic function: This is equivalent to solve linear problem: There are generalizations to general functions.
Background for Gradient Methods The min(max) problem: But we learned in calculus how to solve that kind of question!
Directional Derivatives
Directional Derivatives : In general direction…
Directional Derivatives
In the plane The Gradient: Definition in
The Gradient: Definition
The Gradient Properties The gradient defines (hyper) plane approximating the function infinitesimally
The Gradient properties By the chain rule: (important for later use)
The Gradient properties Proposition 1: is maximal choosing is minimal choosing (intuitive: the gradient points at the greatest change direction)
The Gradient Properties We found the best INFINITESIMAL DIRECTION at each point, Looking for minimum: “blind man” procedure How can we derive the way to the minimum using this knowledge?
Steepest Descent Steepest descent algorithm: Data: Step 0:set i=0 Step 1:ifstop, else, compute search direction Step 2: compute the step-size Step 3:setgo to step 1
Steepest Descent
The steepest descent find critical point and local minimum. Implicit step-size rule Actually we reduced the problem to finding minimum: There are extensions that gives the step size rule in discrete sense. (Armijo)
Conjugate Gradient We from now on assume we want to minimize the quadratic function: This is equivalent to solve linear problem: There are generalizations to general functions.
Conjugate Gradient What is the problem with steepest descent? We can repeat the same directions over and over… Conjugate gradient takes at most n steps.
Conjugate Gradient We say that two non-zero vectors u and v are conjugate if If A=I – orthogonal vectors are special case of conjugate vectors.
Conjugate Gradient Search directions – should span
Conjugate Gradient Given, how do we calculate ? (as before)
Conjugate Gradient How do we find ? We want that after n step the error will be 0 :
Conjugate Gradient Here an idea: if then: So if,
Conjugate Gradient So we look for such that : Simple calculation shows that if we take A - conjugate (- orthogonal)
Conjugate Gradient We have to find an A conjugate basis We can do “gram-schmidt” process, but we should be careful since it is an O(n³) process: Some series of vectors
Conjugate Gradient So for a arbitrary choice of we don’t earn nothing. Luckily, we can choose so that the conjugate direction calculation is O(m) where m is the number of non-zero entries in. The correct choice of is:
Conjugate Gradient So the conjugate gradient algorithm for minimizing f: Data: Step 0: Step 1: Step 2: Step 3: Step 4: and repeat n times.
Conjugate Gradient Algorithm Start with initial trial point Find the search direction Find to minimize Is optimum? No Yes Find the search direction Find to minimize