Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

CHAPTER 13: Alpaydin: Kernel Machines
Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Principal Component Analysis
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Reduced Support Vector Machine
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Classification with reject option in gene expression data Blaise Hanczar and Edward R Dougherty BIOINFORMATICS Vol. 24 no , pages
Principle of Locality for Statistical Shape Analysis Paul Yushkevich.
Darlene Goldstein 29 January 2003 Receiver Operating Characteristic Methodology.
Support Vector Machines
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
CISC667, F05, Lec24, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) DNA Microarray, 2d gel, MSMS, yeast 2-hybrid.
Inferring the nature of the gene network connectivity Dynamic modeling of gene expression data Neal S. Holter, Amos Maritan, Marek Cieplak, Nina V. Fedoroff,
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University
Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.
3 rd Summer School in Computational Biology September 10, 2014 Frank Emmert-Streib & Salissou Moutari Computational Biology and Machine Learning Laboratory.
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
CBCl/AI MIT Class 19: Bioinformatics S. Mukherjee, R. Rifkin, G. Yeo, and T. Poggio.
Why microarrays in a bioinformatics class? Design of chips Quantitation of signals Integration of the data Extraction of groups of genes with linked expression.
Analysis of microarray data
Pictorial Demonstration Rescale features to minimize the LOO bound R 2 /M 2 x2x2 x1x1 R 2 /M 2 >1 M R x2x2 R 2 /M 2 =1 M = R.
Whole Genome Expression Analysis
Efficient Model Selection for Support Vector Machines
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
From motif search to gene expression analysis
Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning Adan Cosgaya Winter 2006 Dalhousie University.
Integration II Prediction. Kernel-based data integration SVMs and the kernel “trick” Multiple-kernel learning Applications – Protein function prediction.
The Broad Institute of MIT and Harvard Classification / Prediction.
Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma.
Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics.
Scenario 6 Distinguishing different types of leukemia to target treatment.
CZ5225: Modeling and Simulation in Biology Lecture 8: Microarray disease predictor-gene selection by feature selection methods Prof. Chen Yu Zong Tel:
+ Get Rich and Cure Cancer with Support Vector Machines (Your Summer Projects)
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.
By: Amira Djebbari and John Quackenbush BMC Systems Biology 2008, 2: 57 Presented by: Garron Wright April 20, 2009 CSCE 582.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
DNA Microarray Data Analysis using Artificial Neural Network Models. by Venkatanand Venkatachalapathy (‘Venkat’) ECE/ CS/ ME 539 Course Project.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Artificial Intelligence Project #3 : Diagnosis Using Bayesian Networks May 19, 2005.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Disease Diagnosis by DNAC MEC seminar 25 May 04. DNA chip Blood Biopsy Sample rRNA/mRNA/ tRNA RNA RNA with cDNA Hybridization Mixture of cell-lines Reference.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.
1 CISC 841 Bioinformatics (Fall 2008) Review Session.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2016 Xiaole Shirley Liu.
David Amar, Tom Hait, and Ron Shamir
Bioinformatics Overview
Classification with Gene Expression Data
Statistical Applications in Biology and Genetics
Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Presentation transcript:

Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee

Class 23, 2001 CBCl/AI MIT Outline I. Basic Molecular biology II. Some Bioinformatics problems III. Microarray technology a. Purpose b. cDNA and Oligonucleotide arrays c. Yeast experiment IV. Cancer classification using SVMs V. Rejects and Confidence of classification VI. Feature Selection for SVMs a. Leave-one-out bounds b. The algorithm VII Results on several datasets

Class 23, 2001 CBCl/AI MIT What is Bioinformatics Application of computing technology to providing statistical and database solutions to problems in molecular biology. Defining and addressing problems in molecular biology using methodologies from statistics and computer science. The genome project, genome wide analysis/screening of disease, genetic regulatory networks, analysis of expression data. Pre 1995 Post 1995

Class 23, 2001 CBCl/AI MIT Some Basic Molecular Biology CGAACAAACCTCGAACCTGCT DNA: mRNA: GCU UGU UUA CGA Polypeptide: Ala Cys Leu Arg Translation Transcription

Class 23, 2001 CBCl/AI MIT Examples of Problems Gene sequence problems: Given a DNA sequence state which sections are coding or noncoding regions. Which sections are promoters etc... Protein Structure problems: Given a DNA or amino acid sequence state what structure the resulting protein takes. Gene expression problems: Given DNA/gene microarray expression data infer either clinical or biological class labels or genetic machinery that gives rise to the expression data. Protein expression problems: Study expression of proteins and their function.

Class 23, 2001 CBCl/AI MIT Microarray Technology Basic idea: The state of the cell is determined by proteins. A gene codes for a protein which is assembled via mRNA. Measuring amount particular mRNA gives measure of amount of corresponding protein. Copies of mRNA is expression of a gene. Microarray technology allows us to measure the expression of thousands of genes at once. Measure the expression of thousands of genes under different experimental conditions and ask what is different and why.

Class 23, 2001 CBCl/AI MIT Oligo vs cDNA arrays Lockhart and Winzler 2000

Class 23, 2001 CBCl/AI MIT A DNA Microarray Experiment

Class 23, 2001 CBCl/AI MIT Cancer Classification 38 examples of Myeloid and Lymphoblastic leukemias Affymetrix human 6800, (7128 genes including control genes) 34 examples to test classifier Results: 33/34 correct d perpendicular distance from hyperplane Test data d

Class 23, 2001 CBCl/AI MIT Gene expression and Coregulation

Class 23, 2001 CBCl/AI MIT Nonlinear classifier

Class 23, 2001 CBCl/AI MIT Nonlinear SVM Nonlinear SVM does not help when using all genes but does help when removing top genes, ranked by Signal to Noise (Golub et al).

Class 23, 2001 CBCl/AI MIT Rejections Golub et al classified 29 test points correctly, rejected 5 of which 2 were errors using 50 genes Need to introduce concept of rejects to SVM g1 g2 Normal Cancer Reject

Class 23, 2001 CBCl/AI MIT Rejections

Class 23, 2001 CBCl/AI MIT Estimating a CDF

Class 23, 2001 CBCl/AI MIT The Regularized Solution

Class 23, 2001 CBCl/AI MIT Rejections for SVM 1/d P(c=1 | d).95 95% confidence or p =.05d =.107

Class 23, 2001 CBCl/AI MIT Results with rejections Results: 31 correct, 3 rejected of which 1 is an error Test data d

Class 23, 2001 CBCl/AI MIT Why Feature Selection SVMs as stated use all genes/features Molecular biologists/oncologists seem to be conviced that only a small subset of genes are responsible for particular biological properties, so they want which genes are are most important in discriminating Practical reasons, a clinical device with thousands of genes is not financially practical Possible performance improvement

Class 23, 2001 CBCl/AI MIT Results with Gene Selection AML vs ALL: 40 genes 34/34 correct, 0 rejects. 5 genes 31/31 correct, 3 rejects of which 1 is an error. B vs T cells for AML: 10 genes 33/33 correct, 0 rejects. d Test data d

Class 23, 2001 CBCl/AI MIT Leave-one-out Procedure

Class 23, 2001 CBCl/AI MIT The Basic Idea Use leave-one-out (LOO) bounds for SVMs as a criterion to select features by searching over all possible subsets of n features for the ones that minimizes the bound. When such a search is impossible because of combinatorial explosion, scale each feature by a real value variable and compute this scaling via gradient descent on the leave-one-out bound. One can then keep the features corresponding to the largest scaling variables. The rescaling can be done in the input space or in a “Principal Components” space.

Class 23, 2001 CBCl/AI MIT Pictorial Demonstration Rescale features to minimize the LOO bound R 2 /M 2 x2x2 x1x1 R 2 /M 2 >1 M R x2x2 R 2 /M 2 =1 M = R

Class 23, 2001 CBCl/AI MIT SVM Functional To the SVM classifier we add an extra scaling parameters for feature selection: where the parameters , b are computed by maximizing the the following functional, which is equivalent to maximizing the margin:

Class 23, 2001 CBCl/AI MIT Radius Margin Bound

Class 23, 2001 CBCl/AI MIT Jaakkola-Haussler Bound

Class 23, 2001 CBCl/AI MIT Span Bound

Class 23, 2001 CBCl/AI MIT The Algorithm

Class 23, 2001 CBCl/AI MIT Computing Gradients

Class 23, 2001 CBCl/AI MIT Toy Data Linear problem with 6 relevant dimensions of 202 Nonlinear problem with 2 relevant dimensions of 52

Class 23, 2001 CBCl/AI MIT Face Detection On the CMU testset consisting of 479 faces and 57,000,000 non-faces we compare ROC curves obtained for different number of selected features. We see that using more than 60 features does not help.

Class 23, 2001 CBCl/AI MIT Molecular Classification of Cancer

Class 23, 2001 CBCl/AI MIT Morphology Classification

Class 23, 2001 CBCl/AI MIT Outcome Classification

Class 23, 2001 CBCl/AI MIT Outcome Classification Error rates ignore temporal information such as when a patient dies. Survival analysis takes temporal information into account. The Kaplan-Meier survival plots and statistics for the above predictions show significance. p-val = p-val = Lymphoma Medulloblastoma