Download presentation
Presentation is loading. Please wait.
1
Presented at the Alabany Chapter of the ASA February 25, 2004 Washinghton DC
2
Magnetocardiography at CardioMag Imaging inc. With Bolek Szymanski and Karsten Sternickel
3
Left: Filtered and averaged temporal MCG traces for one cardiac cycle in 36 channels (the 6x6 grid). Right Upper: Spatial map of the cardiac magnetic field, generated at an instant within the ST interval. Right Lower: T3-T4 sub-cycle in one MCG signal trace
4
Pseudo inverse Classical (Linear) Regression Analysis: Predict y from X X nm y Prediction model Can we apply wisdom to data and forecast them right? (n = 19 & m = 7) 19 data and 7 attributes (1 response)
5
Fundamental Machine Learning Paradox How to resolve Machine Learning Paradox? Learning occurs because of redundancy (patterns) in the data Machine Learning Paradox: If data contain redundancies (i) we can learn from data (ii) the “feature kernel matrix” K F is ill-conditioned (i) fix rank deficiency of K F with principal components (PCA) (ii) regularization: use K F + I instead of K F (ridge regression) (iii) local learning
6
Principal Component Regression (PCR): Replace X nm by T nh T nh principal components projection of the (n) data records on the (h) “most important” eigenvectors of the feature kernel K F
7
Ridge Regression in Data Space “Wisdom” is now obtained from the right-hand inverse or Penrose inverse Ridge term is added to resolve learning paradox Data Kernel K D Needs kernels only
8
Implementing Direct Kernel Methods Linear Model: - PCA model - PLS model - Ridge Regression - Self-Organizing Map...
9
What have we learned so far? There is a “learning paradox” because of redundancies in the data We resolved this paradox by “regularization” - In the case of PCA we used the eigenvectors of the feature kernel - In the case of ridge regression we added a ridge to the data kernel So far prediction models involved only linear algebra stricly linear What is in a kernel? The data kernel contains linear similarity measures (correlations) of data records xixi xjxj
10
Kernels What is a kernel? - The data kernel expresses a similarity measure between data records - So far, the kernel contains linear similarity measures linear kernel xixi xjxj We actually can make up nonlinear similarity measures as well Radial Basis Function Kernel Nonlinear Distance or difference
11
Review: What is in a Kernel? A kernel can be considered as a (nonlinear) data transformation - Many different choices for the kernel are possible - The Radial Basis Function (RBF) or Gaussian kernel is an effective nonlinear kernel The RBF or Gaussian kernel is a symmetric matrix - Entries reflect nonlinear similarities amongst data descriptions - As defined by:
12
Direct Kernel Methods for Nonlinear Regression/Classification Consider the Kernel as a (nonlinear) data transformation - This is the so-called “kernel trick” (Hilbert, early 1900’s) - The Radial Basis Function (RBF) or Gaussian kernel is an efficient nonlinear kernel Linear regression models can be “tricked” into nonlinear models by applying such regression models on kernel transformed data - PCA DK-PCA - PLS DK-PLS (Partial Least Squares Support Vector Machines) - (Direct) Kernel Ridge Regression Least Squares Support Vector Machines - Direct Kernel Self-Organizing maps (DK-SOM) These methods work in the same space as SVMs - DK models can usually be derived also from an optimization formulation (similar to SVMs) - Unlike the original SVMs DK methods are not sparse (i.,e., all data are support vectors) - Unlike SVMs there is no patent on direct kernel methods - Performance on hunderds of benchmark problems compare favorably with SVMs Classification can be considered as a special cae of regression Data Pre-processing: Data are usually Mahalanobis scaled first
13
Nonlinear PCA in Kernel Space Like PCA Consider a nonlinear data kernel transformation up front: Data Kernel Derive principal components for that kernel (e.g. with NIPALS) Examples: - Haykin’s Spiral - Cherkassky’s nonlinear function model
14
PCA Example: Haykin’s Spiral (demo: haykin1) PCA
15
Linear PCR Example: Haykin’s Spiral (demo: haykin2)
16
K-PCR Example: Haykin’s Spiral 3 PCAs 12 PCAs (demo: haykin3)
17
Training Data Test Data Mahalanobis-scaled Training Data Kernel Transformed Training Data Centered Direct Kernel (Training Data) Mahalanobis-scaled Test Data Mahalanobis Scaling Factors Vertical Kernel Centering Factors Kernel Transformed Test Data Centered Direct Kernel (Test Data) Scaling, centering & making the test kernel centering consistent
18
36 MCG T3-T4 Traces Preprocessing: - horizontal Mahalanobis scaling - D4 wavlet transform - vertical Mahalanobis scaling (features and response)
19
SVMLib Linear PCA Direct Kernel PLS SVMLib
20
Direct Kernel PLS with 3 Latent Variables
21
Predictions on Test Cases with K-PLS
22
K-PLS Predictions After Removing 14 Outliers
23
Benchmark Predictions on Test Cases
24
Direct Kernel with Robert Bress and Thanakorn Naenna
25
www.drugmining.com Kristin Bennett and Mark Embrechts
26
Docking Ligands is a Nonlinear Problem
27
GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT WORK IN PROGRESS
30
Direct Kernel Partial-Least Squares (K-PLS) x1x1 x2x2 x3x3 t1t1 t2t2 y Direct Kernel PLS is PLS with the kernel transform as a preprocessing step Consider K-PLS as a “better” nonlinear PLS Consider PLS as a “ better” PCA K-PLS gives almost identical (but more stable) results as SVMs - PLS is the method by choice for chemometrics and QSAR drug design - hyper-parameters are easy to tune (5 latent variables) - unlike SVMs there is no patent on K-PLS
31
What have we learned so far? There is a “learning paradox” because of redundancies in the data We resolved this paradox by “regularization” - In the case of PCA we used the eigenvectors of the feature kernel - In the case of ridge regression we added a ridge to the data kernel So far prediction models involved only linear algebra strictly linear What is in a kernel? The data kernel contains linear similarity measures (correlations) of data records xixi xjxj
32
Kernels What is a kernel? - The data kernel expresses a similarity measure between data records - So far, the kernel contains linear similarity measures linear kernel xixi xjxj We actually can make up nonlinear similarity measures as well Radial Basis Function Kernel Nonlinear Distance or difference
33
Σ Σ Σ x1x1 xmxm xixi Weights correspond to H eigenvectors corresponding to largest eigenvalues of X T X Σ Σ Σ Σ Σ Σ Σ Σ... Weights correspond to the scores or PCAs for the entire training set Weights correspond to the dependent variable for the entire training data Means that the projections on the eigenvectors will be divided with the corresponding variance (cfr. Mahalanobis scaling) This layer gives a weighted similarity score with each datapoint Kind of a nearest neighbor weighted prediction score PCR in Feature Space
34
Σ Σ Σ x1x1 xmxm xixi Weights correspond to H eigenvectors corresponding to largest eigenvalues of X T X Σ PCR in Feature Space w1w1 w2w2 whwh Principal components can be thought of as a data pre-processing step Rather than building a model for an m-dimensional input vector x we now have a h-dimensional t vector t1t1 y thth t2t2
35
Use of a direct kernel self-organizing map in testing mode for the detection of patients with ischemia (read patient IDs). The darker hexagons colored during a separate training phase represent nodes corresponding with ischemia cases. Predictions on Test Cases with DK-SOM
36
Outlier/Novelty Detection Methods in Analyze/StripMiner One-Class SVM with LIBSVM with auto-tuning for regularization outliers flagged on Self-Organizing Maps (SOMs and DK-SOMs) Extended pharmaplots - PCA- based pharmaplot - PLS-based pharmaplot - K-PLS based pharmaplot - K-PCA based pharmaplot Will explore outlier detection options with CardioMag data - 1152 mixed Wavelet descriptors - 74 training data and 10 test data
37
Outlier Detection Procedure in Analyze One-class SVM on training data Proprietory regularization mechanism start Determine number of outliers from elbow plot Eliminate outliers from training set Run K-PLS for new training/test data See whether outliers make sense on pharmaplots Inspect outlier clusters on SOMs end List of Outlier pattern IDs Outliers are flagged in pharmaplots
38
Tagging Outliers on Pharmaplot with Analyze Code
39
“Elbows” suggest 7-14 outliers “Elbow” Plot for Specifying # Outliers
40
One-Class SVM Results for MCG Data
41
Outlier/Novelty Detection Methods in Analyze: Hypotheses One-class SVMs are commonly cited for outlier detection (e.g., Suykens) - used publicly available SVM code (LibSVM) - Analyze has user-friendly interface operators for using LibSVM Proprietary heuristic tuning for C in SVMs - heuristic tuning method explained in previous publications - heuristic tuning is essential to make outlier detection work properly “Elbow” curves for indicating # outliers Pharmaplot justifies/validates detection from different methods Pharmaplots extended to PLS, K-PCA, and K-PLS pharmaplots
42
One-Class SVM: Brief Theory Well-known method for outlier & novelty detection in SVM literature (e.g., seeSuykens) LibSVM, a publicly available SVM code for general use, has one-class SVM option built-in (see Chih-Chung Chang and Chih-Jen Lin ) Analyze has operators to interface with LibSVM Theory: - One-class SVM ignores response (assumes all zeros for responses) - Maximizes spread and subtracts regularization term - Suykens, pp. 203 has following formulation - is a regularization parameter, Analyze has proprietary way to determine Application: - Analyze combines one-class SVMs with pharmaplots to see whether outliers can be explained and make sense - Analyze has elbow curves to assist user in determining # outliers - Combination of 1-class SVMs with pharmaplots, gave excellent results on several industrial (non-pharmaceutical) data
43
NIPALS ALGORITHM FOR PLS (with just one response variable y) Start for a PLS component: Calculate the score t: Calculate c’: Calculate the loading p: Store t in T, store p in P, store w in W Deflate the data matrix and the response variable: Do for h latent variables
44
Outlier/Novelty Detection Methods in Analyze Outlier detection methods where extensively tested: - on a variety of different UCI data sets - models sometimes showed significant improvement after removal of outliers - models were rarely worse - outliers could be validated on pharmaplots and lead to enhanced insight The pharmaplots confirm the validity of outlier detection with one-class SVM Prediction on test set for albumin data improves model A non-pharmaceutical (medical) data set actually shows two data points in the training set that probably were given wrong labels (Appendix A)
45
P Q R S T
46
Innovations in Analyze for Outlier Detection User-fiendly procedure with automated processes Interface for one-class SVM from LibSVM Automated tuning for regularization parameters Elbow plots to determine number of outliers Combination of LibSVM outliers with pharmaplots - efficient visualization of outliers - facilitates interpretation of outliers Extended pharmaplots - PCA - K-PCA - PLS - K-PLS User-friendly and efficient SOM with outlier identification Direct-Kernel-based outlier detection as an alternative to LibSVM
47
w 1 eigenvector of X T YY T X t 1 eigenvector of XX T YY T w’s and t’s of deflations: w’s are orthonormal t’s are orthogonal p’s not orthogonal p’s orthogonal to earlier w’s Linear PLS Kernel PLS Trick is a different normalization Now t’s rather than w’s are normalized t 1 eigenvector of K(XX T) YY T w’s and t’s of deflations of XX T KERNEL PLS (K-PLS) Invented by Rospital and Trejo (J. Machine learning, December 2001) Can be considered as the poor man’s support vector machine (SVM) They first altered the linear PLS by dealing with eigenvectors of XX T They also made the NIPALS PLS formulation resemble PCA more Now non-linear Kernel based correlation matrix K(XX T ) rather than XX T is used Nonlinear Correlation matrix contains nonlinear similarities of datapoints rather than An example is the Gaussian Kernel similarity measure: RENSSELAER
48
Principal Component Analysis (PCA) We introduce a modest set of h most important principal components, T nh Replace data X nm by most important principal components T nh The most important T’s are the ones corresponding to largest eigenvalues of X T X The B’s are the eigenvectors of X T X ordered from largest to lowest eigenvalue In practice calculation of B’s and T’s proceeds iteratively with NIPALS algorithm NIPALS: non-linear iterative least squares (Herman Wold) x1x1 x2x2 x3x3 t1t1 t2t2 y
49
Partial Least Squares (PLS) Similar to PCA PLS: Partial Least Squares/Projection to Latent Structures/Please Listen to Svante T’s are now called scores or latent variables and the p’s are the loading vectors Loading vectors are not orthogonal anymore and influenced by y vector A special version of NIPALS is also used to build up t’s x1x1 x2x2 x3x3 t1t1 t2t2 y
50
Kernel PLS (K-PLS) x1x1 x2x2 x3x3 t1t1 t2t2 y Invented by Rospital and Trejo (J. Machine Learning, December 2000) Consider K-PLS as a better and nonlinear PLS K-PLS gives almost identical results to SVMs for the QSAR data we tried K-PLS is a lot faster than SVMs
51
P Q R S T
53
Validation Model: 100x leave 10% out validations
54
PLS, K-PLS, SVM, ANN Feature Selection (data strip mining)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.