Feature Generation and Cluster-based Feature Selection.

Slides:

Advertisements

Similar presentations

3D Geometry for Computer Graphics

Advertisements

Least Squares example There are 3 mountains u,y,z that from one site have been measured as 2474 ft., 3882 ft., and 4834 ft.. But from u, y looks 1422 ft.

Solving Linear Systems (Numerical Recipes, Chap 2)

Steepest Decent and Conjugate Gradients (CG). Solving of the linear equation system.

Modern iterative methods For basic iterative methods, converge linearly Modern iterative methods, converge faster –Krylov subspace method Steepest descent.

The loss function, the normal equation,

1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.

Determinants Bases, linear Indep., etc Gram-Schmidt Eigenvalue and Eigenvectors Misc

Gradient Methods April Preview Background Steepest Descent Conjugate Gradient.

Symmetric Matrices and Quadratic Forms

1cs542g-term Notes  Simpler right-looking derivation (sorry):

Chapter 5 Orthogonality

Principal Component Analysis

Computer Graphics Recitation 5.

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Some useful linear algebra. Linearly independent vectors span(V): span of vector space V is all linear combinations of vectors v i, i.e.

1cs542g-term Notes  r 2 log r is technically not defined at r=0 but can be smoothly continued to =0 there  Question (not required in assignment):

Math for CSLecture 41 Linear Least Squares Problem Over-determined systems Minimization problem: Least squares norm Normal Equations Singular Value Decomposition.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Gradient Methods May Preview Background Steepest Descent Conjugate Gradient.

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

Lecture 12 Least Square Approximation Shang-Hua Teng.

Gradient Methods Yaron Lipman May Preview Background Steepest Descent Conjugate Gradient.

Math for CSTutorial 41 Contents: 1.Least squares solution for overcomplete linear systems. 2.… via normal equations 3.… via A = QR factorization 4.… via.

Ordinary least squares regression (OLS)

Orthogonality and Least Squares

Linear Least Squares QR Factorization. Systems of linear equations Problem to solve: M x = b Given M x = b : Is there a solution? Is the solution unique?

Boot Camp in Linear Algebra Joel Barajas Karla L Caballero University of California Silicon Valley Center October 8th, 2008.

1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)

Matrices CS485/685 Computer Vision Dr. George Bebis.

Stats & Linear Models.

Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka Virginia de Sa (UCSD) Cogsci 108F Linear.

Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.

Collaborative Filtering Matrix Factorization Approach

SVD(Singular Value Decomposition) and Its Applications

CHAPTER SIX Eigenvalues

Linear Algebra Review 1 CS479/679 Pattern Recognition Dr. George Bebis.

1 February 24 Matrices 3.2 Matrices; Row reduction Standard form of a set of linear equations: Chapter 3 Linear Algebra Matrix of coefficients: Augmented.

Eigenvalue Problems Solving linear systems Ax = b is one part of numerical linear algebra, and involves manipulating the rows of a matrix. The second main.

CHAPTER FIVE Orthogonality Why orthogonal? Least square problem Accuracy of Numerical computation.

AN ORTHOGONAL PROJECTION

SVD: Singular Value Decomposition

1 Chapter 5 – Orthogonality and Least Squares Outline 5.1 Orthonormal Bases and Orthogonal Projections 5.2 Gram-Schmidt Process and QR Factorization 5.3.

Computing Eigen Information for Small Matrices The eigen equation can be rearranged as follows: Ax = x  Ax = I n x  Ax - I n x = 0  (A - I n )x = 0.

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

Elementary Linear Algebra Anton & Rorres, 9th Edition

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Scientific Computing Singular Value Decomposition SVD.

Review of Matrix Operations Vector: a sequence of elements (the order is important) e.g., x = (2, 1) denotes a vector length = sqrt(2*2+1*1) orientation.

Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.

Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

Instructor: Mircea Nicolescu Lecture 8 CS 485 / 685 Computer Vision.

Chapter 61 Chapter 7 Review of Matrix Methods Including: Eigen Vectors, Eigen Values, Principle Components, Singular Value Decomposition.

Matrices CHAPTER 8.9 ~ Ch _2 Contents  8.9 Power of Matrices 8.9 Power of Matrices  8.10 Orthogonal Matrices 8.10 Orthogonal Matrices 

Unsupervised Learning II Feature Extraction

Boot Camp in Linear Algebra TIM 209 Prof. Ram Akella.

CSE 554 Lecture 8: Alignment

Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.

Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.

Matrices and vector spaces

LECTURE 10: DISCRIMINANT ANALYSIS

Systems of First Order Linear Equations

Collaborative Filtering Matrix Factorization Approach

~ Least Squares example

Generally Discriminant Analysis

The loss function, the normal equation,

LECTURE 09: DISCRIMINANT ANALYSIS

Mathematical Foundations of BME Reza Shadmehr

~ Least Squares example

Presentation transcript:

Feature Generation and Cluster-based Feature Selection

Feature Generation  An case study – Some of data has no obvious features (Sequence data) Feature generation method Feature Selection

Virus

Virus Recombination 4

Genome Distance 5

Previous Work on Genome Distance Measurement Multiple Sequence Alignment Gene Content Based Data Compression Based 6

Complete Composition Vector Composition information embedded in one genome sequence – R: Genome Sequence of Length L – Elements: {A,G,C,T} – K: Maximum Pattern Length – f(a 1,a 2,…,a k ): Appearance Probability of a 1,a 2,…,a k in R 7

Composition Value – Second Order Markov Model 8

Composition Value ATATCTATATACT f(ATA)=3/(13-3+1)=3/11 f(AT)=4/(13-2+1)=4/12 f(TA)=4/(13-2+1)=4/12 f(T)=6/13 Expected Appearance Probability q(ATA)=f(AT)*f(TA)/f(T)=13/54 CCV(ATA)=(f(ATA)-q(ATA))/q(ATA)=19/162 9

Complete Composition Vector Using complete composition vector to represent whole genome String Selection 10

String Selection Score Function (Relative Entropy) 11

Pair-wise Evolution Distance Given two sequences R and S – Their composition vector CCV(R)= CCV(S)= – Euclidean distance d(R,S) 12

HIV-1 Genotyping Identify HIV-1 Genotyping – Three Major Groups M-further categorized into 9 subtypes: A-D,F-H,J,K. O N – Recombinant Strains 13

Neighbor Jointing on 42 Pure Subtypes Strains 14

Genotyping Classifier – Mean-Classifier Leave-One-Out Cross Validation Independent Testing 15

Genotyping Classifier – Mean-Classifier Leave-One-Out Cross Validation Independent Testing 16

Genotyping Results Top 500 Strings by Relative Entropy – Pure Subtype Prediction LOOCV Accuracy on Training Dataset 100%. Independent Test 100% – Myers et al % – Oliveira et al % – Rozanovet al % 17

Genotyping Results Recombinant Accuracy 87.3% 18

Feature Selection Two feature selection categories: – Selecting topmost features with the most individual discriminatory power – Selecting a group of feature with the most overall discriminatory power 19

Feature Selection Examples of feature selection algorithms – F-test Compute for each feature a score And select the topmost genes with the top scores – SFS/SFFS Iteratively include one more feature each time and try to kick each gene in the selected feature pool out to see which of the residue combination is the best one. 20

Cluster-based Feature Selection Cluster-based feature selection 21 Team A: Team B:

Cluster-based Feature Selection 22

Cluster-based Feature Selection  Perform feature clustering  Choose from each cluster the topmost features which have the most powerful individual discriminatory power 23

Discrimination Power Vector 24 Class2- Class1 Class 3- Class 1 Class 3- Class 2 Feature1|Mean 1(2)- Mean 1(1)| |Mean 1(3)- Mean 1(1)| |Mean 1(3)- Mean 1(2)| Feature2|Mean 2(2)- Mean 2(1)| |Mean 2(3)- Mean 2(1)| |Mean 2(3)- Mean 2(2)| Class 1Class 2Class 3 Feature 1Mean 1(1)Mean 1 (2)Mean 1(3) Feature 2Mean 2(1)Mean 2 (3)

Cluster-based Feature Selection  Perform feature clustering  Choose from each cluster the topmost features which have the most powerful individual discriminatory power 25

FMDV Genotyping Datasets – Foot Mouse Disease virus (7 Subtypes) Genotyper – Linear kernel support vector machine (SVM) genotyper – Mean-genotyper 26

Genotyping Results 27

Experiments of Cluster Based Method

Linear Regression

Application Credit Approve (Credit Limit) x 1 =Score x 2 =Salary x 3 =Staus Y=0.5x Y=0.5x 1 +x 2 / Y=0.5x 1 +x 2 /100+20x

Linear Fitting the Data  We want to fit a linear function to an observed set of points X= [x 1, x 2, x 3,… x n ] with associated label Y= [y 1, y 2, y 3,… y n ]. o After we fit the function, we can use the function to predict new y for new x.  Find the function that minimized sum (the average) of square distances between actual ys in the training set and predicted ones. Least Square The fitted line is used as predictor y=ax+b (x i, y i )

Linear Fitting the Data  We want to fit a linear function to an observed set of points X= [x 1, x 2, x 3,… x n ] with associated label Y= [y 1, y 2, y 3,… y n ]. o After we fit the function, we can use the function to predict new y for new x.  Find the function that minimized sum (the average) of square distances between actual ys in the training set and predicted ones. (x i, y i ) Least Square The fitted line is used as predictor x j y j

Linear Function Linear Least Square fitting with X in R 2 General form: 1D case (X=R): a line 2D case (X=R 2 ) : a plane

Suppose target labels come from set Y – Binary classification:Y = { 0, 1 } – Regression:Y =  (real numbers) A loss function maps decisions to costs: – defines the penalty for predicting when the true value is. Standard choice for classification: 0/1 loss (same as misclassification error) Standard choice for regression: squared loss Loss function

Linear Fitting the Data (x i, y i ) x j y j Minimize the error function

Use least squares to find the equation of the line that will best approximate the points (2,0), (-1,1) and (0,2) We want to find best line to fit the four points y=ax+b. Example

Minimize the error function, Example To find optimal a, b, set derivatives w.r.t. a and b equal to zero: 10a+2b+2=0 and a+3b-3=0 a=-3/7 and b=8/7, therefore, y=-3/7 x+8/7

Use least squares to find the equation of the line that will best approximate the points (1,6), (2,5), (3,7) and (4,10). y=ax+b Example y=1.4x+3.5

Approaches to solve Ax b Normal equations-quick and dirty QR- standard in libraries uses orthogonal decomposition SVD - decomposition which also gives indication how linear independent columns are Conjugate gradient- no decompositions, good for large sparse problems

Ax=b  For example (1,6), (2,5), (3,7) and (4,10). Normal system of equations General form

Ax=b  A T Ax=A t b  x=(A T A) -1 A T b Least Square

 A matrix Q is said to be orthogonal if its columns are orthonormal, i.e. Q T ·Q=I.  Orthogonal transformations preserve the Euclidean norm since  Orthogonal matrices can transform vectors in various ways, such as rotation or reflections but they do not change the Euclidean length of the vector. Hence, they preserve the solution to a linear least squares problem. QR Factorization

 If A is a m×n matrix with linearly independent columns, then A can be decomposed as, A=QR where Q is a m×n matrix whose columns form an orthonormal basis for the column space of A and R is an nonsingular upper triangular matrix. Where (q i, q j )=0 and | q i |=1 QR Factorization

Gram-Schmidt Orthonormalization Process Linearly independent set a = {a 1, a 2, …, a n } (β i,β j )=0

QR Factorization

We get A=UR, where U=(θ 1, θ 2, θ 3,…, θ n ). Then,

Example of QR Factorization

49 Applying Gram-Schmidt process of computing QR decomposition 1st Step: 2 nd Step: 3 rd Step: Example of QR Factorization

50 4th Step: 5 th Step: 6 th Step: Example of QR Factorization

Therefore, A=QR QR decomposition is widely used in computer codes to find the eigenvalues of a matrix, to solve linear systems, and to find least squares approximations.

52 The least square solution of b is Let X=QR. Then Therefore, Least square solution using QR Decomposition

1. Least Square Ax=b  A T Ax=A T b 2. A=QR  R T Q T QRx=R T Q T b 3. Q T Q=I  R T Rx=R T Q T b 4. Rx=Q T b

Least square solution using QR Decomposition  Running Time 1. QR factorization of A: A = QR (2mn 2 flops) 2. Form d = QT b (2mn flops) 3. Solve Rx = d by back substitution (n 2 flops) Cost for large m, n: 2mn 2 flops

QR for Least Square

56 Singular Value Decomposition A=USV  The singular values are the diagonal entries of the S matrix and are arranged in descending order The singular values are always real (non- negative) numbers If A is real matrix, U and V are also real

Singular Value Decomposition The SVD of a m-by-n matrix A is given by the formula : Where : U is a m-by-m matrix of the orthonormal eigenvectors of AA T, that is, U T= U -1 V T is the transpose of a n-by-n matrix containing the orthonormal eigenvectors of A T A, that is V T= V -1  is a n-by-n Diagonal matrix of the singular values which are the square roots of the eigenvalues of A T A A=USV T

SVD for Least Square

Example

Conjugate Gradient Solving of the linear equation system Problem: dimension n too big, or not enough time for gauss elimination. Iterative methods are used to get an approximate solution. A is known, square, symmetric, positive-definite Definition Iterative method: given starting point, do steps hopefully converge to the right solution x.

Conjugate Gradient We from now on assume we want to minimize the quadratic function: This is equivalent to solve linear problem: There are generalizations to general functions.

Background for Gradient Methods The min(max) problem: But we learned in calculus how to solve that kind of question!

Directional Derivatives

Directional Derivatives : In general direction…

Directional Derivatives

In the plane The Gradient: Definition in

The Gradient: Definition

The Gradient Properties The gradient defines (hyper) plane approximating the function infinitesimally

The Gradient properties By the chain rule: (important for later use)

The Gradient properties Proposition 1: is maximal choosing is minimal choosing (intuitive: the gradient points at the greatest change direction)

The Gradient Properties We found the best INFINITESIMAL DIRECTION at each point, Looking for minimum: “blind man” procedure How can we derive the way to the minimum using this knowledge?

Steepest Descent Steepest descent algorithm: Data: Step 0:set i=0 Step 1:ifstop, else, compute search direction Step 2: compute the step-size Step 3:setgo to step 1

Steepest Descent

The steepest descent find critical point and local minimum. Implicit step-size rule Actually we reduced the problem to finding minimum: There are extensions that gives the step size rule in discrete sense. (Armijo)

Conjugate Gradient We from now on assume we want to minimize the quadratic function: This is equivalent to solve linear problem: There are generalizations to general functions.

Conjugate Gradient What is the problem with steepest descent? We can repeat the same directions over and over… Conjugate gradient takes at most n steps.

Conjugate Gradient We say that two non-zero vectors u and v are conjugate if If A=I – orthogonal vectors are special case of conjugate vectors.

Conjugate Gradient Search directions – should span

Conjugate Gradient Given, how do we calculate ? (as before)

Conjugate Gradient How do we find ? We want that after n step the error will be 0 :

Conjugate Gradient Here an idea: if then: So if,

Conjugate Gradient So we look for such that : Simple calculation shows that if we take A - conjugate (- orthogonal)

Conjugate Gradient We have to find an A conjugate basis We can do “gram-schmidt” process, but we should be careful since it is an O(n³) process: Some series of vectors

Conjugate Gradient So for a arbitrary choice of we don’t earn nothing. Luckily, we can choose so that the conjugate direction calculation is O(m) where m is the number of non-zero entries in. The correct choice of is:

Conjugate Gradient So the conjugate gradient algorithm for minimizing f: Data: Step 0: Step 1: Step 2: Step 3: Step 4: and repeat n times.

Conjugate Gradient Algorithm Start with initial trial point Find the search direction Find to minimize Is optimum? No Yes Find the search direction Find to minimize