Download presentation
Presentation is loading. Please wait.
1
Direct Kernel Methods
2
Data mining is the process of automatically extracting valid, novel, potentially useful and ultimately comprehensible information from very large databases
3
Direct Kernel Methods database data prospecting and surveying selected data select transformed data preprocess & transform make model Interpretation& rule formulation The Data Mining Process
4
How is Data Mining Different? Emphasis on large data sets - Not all data fit in memory (necessarily) - Outlier detection, rare events, errors, missing data, minority classes - Scaling of computation time with data size is an issue - Large data sets: i.e., large number of records and/or large number of attributes fusion of databases Emphasis on finding interesting, novel non-obvious information - It is not necessarily known what exactly one is looking for - Models can be highly nonlinear - Information nuggets can be valuable Different methods - Statistics - Association rules & Pattern recognition - AI - Computational intelligence (neural nets, genetic algorithms, fuzzy logic) - Support vector machines and kernel-based methods - Visualization (SOM, pharmaplots) Emphasis on explaining and feedback Interdisciplinary nature of data mining
5
Direct Kernel Methods Data Mining Challenges Large data sets - Data sets can be rich in the number of data - Data sets can be rich in the number of attributes Data preprocessing and feature definition - Data representation - Attribute/Feature selection - Transforms and scaling Scientific data mining - Classification, multiple classes, regression - Continuous and binary attributes - Large datasets - Nonlinear Problems Erroneous data, outliers, novelty, and rare events - Erroneous data - Outliers - Rare events - Novelty detection Smart visualization techniques Feature Selection & Rule formulation
6
Direct Kernel Methods UNDERSTANDING WISDOM DATA INFORMATION KNOWLEDGE
7
Direct Kernel Methods A Brief History in Data Mining: Pascal Bayes Fisher Werbos Vapnik The meaning of “Data Mining” changed over time: - Pre 1993: “Data mining is art of torturing the data into a confession” - Post 1993: “Data mining is the art of charming the data into confession” From the supermarket scanner to the human genome - Pre 1998: Database marketing and marketing driven applications - Post 1998: The emergence of scientific data mining From AI expert systems data-driven expert systems: - Pre 1990: The experts speak (AI Systems) - Post 1995: Attempts to let the data to speak for themselves - 2000+: The data speak … A brief history of statistics and statistical learning theory: - From the calculus of chance to the calculus of probabilities (Pascal Bayes) - From probabilities to statistics (Bayes Fisher) - From statistics to machine learning (Fisher & Tuckey Werbos Vapnik) From theory to application
8
Data Preparation - Missing data - Data cleansing - Visualization - Data transformation Clustering/Classification Statistics Factor analysis/Feature selection Associations Regression models Data driven expert systems Meta-Visualization/Interpretation Database Marketing Finance Health Insurance Medicine Bioinformatics Manufacturing WWW Agents Text Retrieval Data Mining Applications and Operations “Homeland” “Security” BioDefense
9
Direct Kernel Methods Direct Kernel Methods for Data Mining: Outline Classical (linear) regression analysis and the learning paradox Resolving the learning paradox by - Resolving the rank deficiency (e.g., PCA) - Regularization (e.g., Ridge Regression) Linear and nonlinear kernels Direct kernel methods for nonlinear regression - Direct Kernel Principal Component Analysis DK-PCA - (Direct) Kernel Ridge Regression Least Squares SVM (LS-SVM) - Direct Kernel Partial Least Squares Partial Least-Squares SVM - Direct Kernel Self-Organizing Maps DK-SOM Feature selection, memory requirements, hyperparameter selection Examples: - Nonlinear toy examples (DK-PCA Haykin’s Spiral, LS-SVM for Cherkassky data) - K-PLS for Time series data - K-PLS for QSAR drug design - LS-SVM Nerve agent classification with electronic nose - K-PLS with feature selection on microarray gene expression data (leukemia) - Direct Kernel SOM and DK-PLS for Magnetocardiogram data - Direct Kernel SOM for substance identification from spectrograms
10
Direct Kernel Methods Outline Classical (linear) regression analysis and the learning paradox Resolving the learning paradox by - Resolving the rank deficiency (e.g., PCA) - Regularization (e.g., Ridge Regression) Linear and nonlinear kernels Direct kernel methods for nonlinear regression - Direct Kernel Principal Component Analysis DK-PCA - (Direct) Kernel Ridge Regression Least Squares SVMs (LS-SVM) - Direct Kernel Partial Least Squares Partial Least-Squares SVMs - Direct Kernel Self-Organizing Maps DK-SOM Feature selection, memory requirements, hyperparameter selection Examples: - Nonlinear toy examples (DK-PCA Haykin’s Spiral, LS-SVM for Cherkassky data) - K-PLS for Time series data - K-PLS for QSAR drug design - LS-SVM Nerve agent classification with electronic nose - K-PLS with feature selection on microarray gene expression data (leukemia) - Direct Kernel SOM and DK-PLS for Magnetocardiogram data
11
Direct Kernel Methods Review: What is in a Kernel? A kernel can be considered as a (nonlinear) data transformation - Many different choices for the kernel are possible - The Radial Basis Function (RBF) or Gaussian kernel is an effective nonlinear kernel The RBF or Gaussian kernel is a symmetric matrix - Entries reflect nonlinear similarities amongst data descriptions - As defined by:
12
Docking Ligands is a Nonlinear Problem
13
Direct Kernel Methods Surface properties are encoded on 0.002 e/au 3 surface Breneman, C.M. and Rhem, M. [1997] J. Comp. Chem., Vol. 18 (2), p. 182-197 Histograms or wavelet encoded of surface properties give Breneman’s TAE property descriptors 10x16 wavelet descriptore Electron Density-Derived TAE-Wavelet Descriptors PIP (Local Ionization Potential) Histograms Wavelet Coefficients
14
Direct Kernel Methods Binding affinities to human serum albumin (HSA): log K’hsa Gonzalo Colmenarejo, GalaxoSmithKline J. Med. Chem. 2001, 44, 4370-4378 95 molecules, 250-1500+ descriptors 84 training, 10 testing (1 left out) 551 Wavelet + PEST + MOE descriptors Widely different compounds Acknowledgements: Sean Ekins (Concurrent) N. Sukumar (Rensselaer)
15
Direct Kernel Methods Validation Model: 100x leave 10% out validations
16
Direct Kernel Methods PLS, K-PLS, SVM, ANN Feature Selection (data strip mining)
17
Direct Kernel Methods 511 features 32 features K-PLS Pharmaplots
18
Direct Kernel Methods Microarray Gene Expression Data for Detecting Leukemia 38 data for training 36 data for testing Challenge: select ~10 out of 6000 genes used sensitivity analysis for feature selection (with Kristin Bennett)
19
Direct Kernel Methods
21
with Wunmi Osadik and Walker Land (Binghamton University) Acknowledgement: NSF
22
Direct Kernel Methods Magnetocardiography at CardioMag Imaging inc.
23
Direct Kernel Methods Left: Filtered and averaged temporal MCG traces for one cardiac cycle in 36 channels (the 6x6 grid). Right Upper: Spatial map of the cardiac magnetic field, generated at an instant within the ST interval. Right Lower: T3-T4 sub-cycle in one MCG signal trace
24
Direct Kernel Methods Magneto-cardiogram Data with Karsten Sternickel (Cardiomag Inc.) and Boleslaw Szymanski (Rensselaer) Acknowledgemnent: NSF SBIR phase I project
25
Direct Kernel Methods SVMLib Linear PCA Direct Kernel PLS SVMLib
26
Direct Kernel Methods Direct Kernel PLS with 3 Latent Variables
27
Direct Kernel Methods Direct Kernel with Robert Bress and Thanakorn Naenna
28
GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT WORK IN PROGRESS
29
Direct Kernel Methods Santa Fe Time Series Prediction Competition 1994 Santa Fe Institute Competition: 1000 data chaotic laser data, predict next 100 data Competition is described in Time Series Prediction: Forecasting the Future and Understanding the Past, A. S. Weigend & N. A. Gershenfeld, eds., Addison-Wesley, 1994 Method: - K-PLS with = 3 and 24 latent variables - Used records with 40 past data for training for next point - Predictions bootstrap on each other for 100 real test data Entry “would have won” the competition
30
Direct Kernel Methods www.drugmining.com Kristin Bennett and Mark Embrechts
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.