Presented by: Isabelle Guyon Machine Learning Research
BIOwulf 1 - People 2 - Technology 3 - Results
BIOwulf Technologies 1- People
Research people + Isabelle Guyon + Vladimir Vapnik (c) + Peter Bartlett+ Bernhard Schölkopf + Asa Ben Hur+ André Elisseeff + Nello Cristianini + Olivier Chapelle + René Doursat - Olivier Bousquet + David Lewis (c)+ Jason Weston + Ed Reiss- Alex Smola (c) + Shelia Guberman+ Hong Zhang
BIOwulf Technologies 2 - Technology
Technology: SVM Kernel Machines: F(x) = k K(x k, x) Sparcity: the sum runs only over support vectors Boser-Guyon-Vapnik (1992)
SVM: Universality & Generalization x1x1 x2x2 x=(x 1,x 2 ) F(x)=0 F(x)>0 F(x)<0
Neural Networks: Local Optima
SVM key properties
Core problems SVMs Kernel Methods Statistical Learning Theory Classification Clustering Regression Feature/ Pattern Selection Causality Inference Control Problems Model selection Novelty Detection
BIOwulf Technologies 3 - Results
Scope Life Sciences Imaging & Signal Processing Financial Seismic Geological Telecom Internet Security Fraud & Abuse Military BIOWulf Technologies
Strategy Data Analysis Result validation Data Collection
Medical Images Medical & Biology Literature Medical & Demographic Records Genomic Sequences Microarray data Spectra Data
Information Center IR Numerical Lab DA numerical results raw data structured info researcher prospects demo scientists tool data analyst customers service Internet Discovery Platform
Microarray Data Prostate cancer, Stamey-Guyon, Dec Microarray Data Prostate cancer, Stamey-Guyon, Dec Preprocessing Microarray Data Prostate cancer, Stamey-Guyon, Dec Preprocessing - Gene selection - Data cleaning BPH G4 Outlier
Two best genes Prostate cancer, Stamey-Guyon, Dec Golub SVM
H64807 R55310 T62947 H08393 T62947 U09564 R88740 M59040 R88740 T94579 H81558 T64012 T86444 H06524 H81558 H06524 U19969 H06524T94579 T58861 M59040 L08069 H08393 M82919 L03840 U19969 D14812 M82919 L Guyon-Doursat-Reiss, 2000 Tree Explorer
Spectroscopy Class 1 Class 2 f(t) g(t) t t Alignment kernel: K(f,g) = f(t) g(t-x) exp(- x 2 ) dtdx Simple kernel: K(f,g) = f(t) g(t) dt Infrared spectra, Elisseeff-Bartlett, Feb. 2001
Prostate cancer, Elisseeff-Guyon-Weston, May Ciphergen Spectra 299 features(peak values) 385 examples (325 training, 60 test) 4 classes (15 test example/class) A=BPH, B and C cancer (B<C), D=ref. D < A < B < C SVM multi-class error rate: 15%(9/60) 59 peaks separate training set perfectly
SVM advantages in pattern recognition: Superior prediction performance on test data. Unique, easy to interpret solution. Better feature selection (only 2-7 genes in array exp.). Use all the data, automatic data cleaning. Incorporate knowledge about the task in Kernel. Can be combined with other methods. Conclusions