Presented at the Alabany Chapter of the ASA February 25, 2004 Washinghton DC.

Presented at the Alabany Chapter of the ASA February 25, 2004 Washinghton DC

Magnetocardiography at CardioMag Imaging inc. With Bolek Szymanski and Karsten Sternickel

Left: Filtered and averaged temporal MCG traces for one cardiac cycle in 36 channels (the 6x6 grid). Right Upper: Spatial map of the cardiac magnetic field, generated at an instant within the ST interval. Right Lower: T3-T4 sub-cycle in one MCG signal trace

Pseudo inverse Classical (Linear) Regression Analysis: Predict y from X X nm y Prediction model Can we apply wisdom to data and forecast them right? (n = 19 & m = 7) 19 data and 7 attributes (1 response)

Fundamental Machine Learning Paradox How to resolve Machine Learning Paradox? Learning occurs because of redundancy (patterns) in the data Machine Learning Paradox: If data contain redundancies (i) we can learn from data (ii) the “feature kernel matrix” K F is ill-conditioned (i) fix rank deficiency of K F with principal components (PCA) (ii) regularization: use K F + I instead of K F (ridge regression) (iii) local learning

Principal Component Regression (PCR): Replace X nm by T nh T nh  principal components projection of the (n) data records on the (h) “most important” eigenvectors of the feature kernel K F

Ridge Regression in Data Space “Wisdom” is now obtained from the right-hand inverse or Penrose inverse Ridge term is added to resolve learning paradox Data Kernel K D Needs kernels only

Implementing Direct Kernel Methods Linear Model: - PCA model - PLS model - Ridge Regression - Self-Organizing Map...

What have we learned so far? There is a “learning paradox” because of redundancies in the data We resolved this paradox by “regularization” - In the case of PCA we used the eigenvectors of the feature kernel - In the case of ridge regression we added a ridge to the data kernel So far prediction models involved only linear algebra  stricly linear What is in a kernel? The data kernel contains linear similarity measures (correlations) of data records xixi xjxj 

Kernels What is a kernel? - The data kernel expresses a similarity measure between data records - So far, the kernel contains linear similarity measures  linear kernel xixi xjxj  We actually can make up nonlinear similarity measures as well Radial Basis Function Kernel Nonlinear Distance or difference

Review: What is in a Kernel? A kernel can be considered as a (nonlinear) data transformation - Many different choices for the kernel are possible - The Radial Basis Function (RBF) or Gaussian kernel is an effective nonlinear kernel The RBF or Gaussian kernel is a symmetric matrix - Entries reflect nonlinear similarities amongst data descriptions - As defined by:

Direct Kernel Methods for Nonlinear Regression/Classification Consider the Kernel as a (nonlinear) data transformation - This is the so-called “kernel trick” (Hilbert, early 1900’s) - The Radial Basis Function (RBF) or Gaussian kernel is an efficient nonlinear kernel Linear regression models can be “tricked” into nonlinear models by applying such regression models on kernel transformed data - PCA  DK-PCA - PLS  DK-PLS (Partial Least Squares Support Vector Machines) - (Direct) Kernel Ridge Regression  Least Squares Support Vector Machines - Direct Kernel Self-Organizing maps (DK-SOM) These methods work in the same space as SVMs - DK models can usually be derived also from an optimization formulation (similar to SVMs) - Unlike the original SVMs DK methods are not sparse (i.,e., all data are support vectors) - Unlike SVMs there is no patent on direct kernel methods - Performance on hunderds of benchmark problems compare favorably with SVMs Classification can be considered as a special cae of regression Data Pre-processing: Data are usually Mahalanobis scaled first

Nonlinear PCA in Kernel Space Like PCA Consider a nonlinear data kernel transformation up front: Data  Kernel Derive principal components for that kernel (e.g. with NIPALS) Examples: - Haykin’s Spiral - Cherkassky’s nonlinear function model

PCA Example: Haykin’s Spiral (demo: haykin1) PCA

Linear PCR Example: Haykin’s Spiral (demo: haykin2)

K-PCR Example: Haykin’s Spiral 3 PCAs 12 PCAs (demo: haykin3)

Training Data Test Data Mahalanobis-scaled Training Data Kernel Transformed Training Data Centered Direct Kernel (Training Data) Mahalanobis-scaled Test Data Mahalanobis Scaling Factors Vertical Kernel Centering Factors Kernel Transformed Test Data Centered Direct Kernel (Test Data) Scaling, centering & making the test kernel centering consistent

36 MCG T3-T4 Traces Preprocessing: - horizontal Mahalanobis scaling - D4 wavlet transform - vertical Mahalanobis scaling (features and response)

SVMLib Linear PCA Direct Kernel PLS SVMLib

Direct Kernel PLS with 3 Latent Variables

Predictions on Test Cases with K-PLS

K-PLS Predictions After Removing 14 Outliers

Benchmark Predictions on Test Cases

Direct Kernel with Robert Bress and Thanakorn Naenna

www.drugmining.com Kristin Bennett and Mark Embrechts

Docking Ligands is a Nonlinear Problem

GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT WORK IN PROGRESS

Direct Kernel Partial-Least Squares (K-PLS) x1x1 x2x2 x3x3 t1t1 t2t2 y Direct Kernel PLS is PLS with the kernel transform as a preprocessing step Consider K-PLS as a “better” nonlinear PLS Consider PLS as a “ better” PCA K-PLS gives almost identical (but more stable) results as SVMs - PLS is the method by choice for chemometrics and QSAR drug design - hyper-parameters are easy to tune (5 latent variables) - unlike SVMs there is no patent on K-PLS

What have we learned so far? There is a “learning paradox” because of redundancies in the data We resolved this paradox by “regularization” - In the case of PCA we used the eigenvectors of the feature kernel - In the case of ridge regression we added a ridge to the data kernel So far prediction models involved only linear algebra  strictly linear What is in a kernel? The data kernel contains linear similarity measures (correlations) of data records xixi xjxj 

Kernels What is a kernel? - The data kernel expresses a similarity measure between data records - So far, the kernel contains linear similarity measures  linear kernel xixi xjxj  We actually can make up nonlinear similarity measures as well Radial Basis Function Kernel Nonlinear Distance or difference

Σ Σ Σ x1x1 xmxm xixi Weights correspond to H eigenvectors corresponding to largest eigenvalues of X T X Σ Σ Σ Σ Σ Σ Σ Σ... Weights correspond to the scores or PCAs for the entire training set Weights correspond to the dependent variable for the entire training data Means that the projections on the eigenvectors will be divided with the corresponding variance (cfr. Mahalanobis scaling) This layer gives a weighted similarity score with each datapoint Kind of a nearest neighbor weighted prediction score PCR in Feature Space

Σ Σ Σ x1x1 xmxm xixi Weights correspond to H eigenvectors corresponding to largest eigenvalues of X T X Σ PCR in Feature Space w1w1 w2w2 whwh Principal components can be thought of as a data pre-processing step Rather than building a model for an m-dimensional input vector x we now have a h-dimensional t vector t1t1 y thth t2t2

Use of a direct kernel self-organizing map in testing mode for the detection of patients with ischemia (read patient IDs). The darker hexagons colored during a separate training phase represent nodes corresponding with ischemia cases. Predictions on Test Cases with DK-SOM

Outlier/Novelty Detection Methods in Analyze/StripMiner One-Class SVM with LIBSVM with auto-tuning for regularization outliers flagged on Self-Organizing Maps (SOMs and DK-SOMs) Extended pharmaplots - PCA- based pharmaplot - PLS-based pharmaplot - K-PLS based pharmaplot - K-PCA based pharmaplot Will explore outlier detection options with CardioMag data - 1152 mixed Wavelet descriptors - 74 training data and 10 test data

Outlier Detection Procedure in Analyze One-class SVM on training data Proprietory regularization mechanism start Determine number of outliers from elbow plot Eliminate outliers from training set Run K-PLS for new training/test data See whether outliers make sense on pharmaplots Inspect outlier clusters on SOMs end List of Outlier pattern IDs Outliers are flagged in pharmaplots

Tagging Outliers on Pharmaplot with Analyze Code

“Elbows” suggest 7-14 outliers “Elbow” Plot for Specifying # Outliers

One-Class SVM Results for MCG Data

Outlier/Novelty Detection Methods in Analyze: Hypotheses One-class SVMs are commonly cited for outlier detection (e.g., Suykens) - used publicly available SVM code (LibSVM) - Analyze has user-friendly interface operators for using LibSVM Proprietary heuristic tuning for C in SVMs - heuristic tuning method explained in previous publications - heuristic tuning is essential to make outlier detection work properly “Elbow” curves for indicating # outliers Pharmaplot justifies/validates detection from different methods Pharmaplots extended to PLS, K-PCA, and K-PLS pharmaplots

One-Class SVM: Brief Theory Well-known method for outlier & novelty detection in SVM literature (e.g., seeSuykens) LibSVM, a publicly available SVM code for general use, has one-class SVM option built-in (see Chih-Chung Chang and Chih-Jen Lin ) Analyze has operators to interface with LibSVM Theory: - One-class SVM ignores response (assumes all zeros for responses) - Maximizes spread and subtracts regularization term - Suykens, pp. 203 has following formulation -  is a regularization parameter, Analyze has proprietary way to determine  Application: - Analyze combines one-class SVMs with pharmaplots to see whether outliers can be explained and make sense - Analyze has elbow curves to assist user in determining # outliers - Combination of 1-class SVMs with pharmaplots, gave excellent results on several industrial (non-pharmaceutical) data

NIPALS ALGORITHM FOR PLS (with just one response variable y) Start for a PLS component: Calculate the score t: Calculate c’: Calculate the loading p: Store t in T, store p in P, store w in W Deflate the data matrix and the response variable: Do for h latent variables

Outlier/Novelty Detection Methods in Analyze Outlier detection methods where extensively tested: - on a variety of different UCI data sets - models sometimes showed significant improvement after removal of outliers - models were rarely worse - outliers could be validated on pharmaplots and lead to enhanced insight The pharmaplots confirm the validity of outlier detection with one-class SVM Prediction on test set for albumin data improves model A non-pharmaceutical (medical) data set actually shows two data points in the training set that probably were given wrong labels (Appendix A)

P Q R S T

Innovations in Analyze for Outlier Detection User-fiendly procedure with automated processes Interface for one-class SVM from LibSVM Automated tuning for regularization parameters Elbow plots to determine number of outliers Combination of LibSVM outliers with pharmaplots - efficient visualization of outliers - facilitates interpretation of outliers Extended pharmaplots - PCA - K-PCA - PLS - K-PLS User-friendly and efficient SOM with outlier identification Direct-Kernel-based outlier detection as an alternative to LibSVM

w 1 eigenvector of X T YY T X t 1 eigenvector of XX T YY T w’s and t’s of deflations: w’s are orthonormal t’s are orthogonal p’s not orthogonal p’s orthogonal to earlier w’s Linear PLS Kernel PLS Trick is a different normalization Now t’s rather than w’s are normalized t 1 eigenvector of K(XX T) YY T w’s and t’s of deflations of XX T KERNEL PLS (K-PLS) Invented by Rospital and Trejo (J. Machine learning, December 2001) Can be considered as the poor man’s support vector machine (SVM) They first altered the linear PLS by dealing with eigenvectors of XX T They also made the NIPALS PLS formulation resemble PCA more Now non-linear Kernel based correlation matrix K(XX T ) rather than XX T is used Nonlinear Correlation matrix contains nonlinear similarities of datapoints rather than An example is the Gaussian Kernel similarity measure: RENSSELAER

Principal Component Analysis (PCA) We introduce a modest set of h most important principal components, T nh Replace data X nm by most important principal components T nh The most important T’s are the ones corresponding to largest eigenvalues of X T X The B’s are the eigenvectors of X T X ordered from largest to lowest eigenvalue In practice calculation of B’s and T’s proceeds iteratively with NIPALS algorithm NIPALS: non-linear iterative least squares (Herman Wold) x1x1 x2x2 x3x3 t1t1 t2t2 y

Partial Least Squares (PLS) Similar to PCA PLS: Partial Least Squares/Projection to Latent Structures/Please Listen to Svante T’s are now called scores or latent variables and the p’s are the loading vectors Loading vectors are not orthogonal anymore and influenced by y vector A special version of NIPALS is also used to build up t’s x1x1 x2x2 x3x3 t1t1 t2t2 y

Kernel PLS (K-PLS) x1x1 x2x2 x3x3 t1t1 t2t2 y Invented by Rospital and Trejo (J. Machine Learning, December 2000) Consider K-PLS as a better and nonlinear PLS K-PLS gives almost identical results to SVMs for the QSAR data we tried K-PLS is a lot faster than SVMs

P Q R S T

Validation Model: 100x leave 10% out validations

PLS, K-PLS, SVM, ANN Feature Selection (data strip mining)

Presented at the Alabany Chapter of the ASA February 25, 2004 Washinghton DC.

Similar presentations

Presentation on theme: "Presented at the Alabany Chapter of the ASA February 25, 2004 Washinghton DC."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presented at the Alabany Chapter of the ASA February 25, 2004 Washinghton DC.

Similar presentations

Presentation on theme: "Presented at the Alabany Chapter of the ASA February 25, 2004 Washinghton DC."— Presentation transcript:

Similar presentations

About project

Feedback