CBCl/AI MIT Class 19: Bioinformatics S. Mukherjee, R. Rifkin, G. Yeo, and T. Poggio.

Slides:



Advertisements
Similar presentations
Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
Advertisements

Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
High-dimensional data analysis: Microarrays and multiple testing Mark van de Wiel 1,2 1. Dep. of Mathematics, VU University Amsterdam 2. Dep. of Biostatistics.
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Gene Expression Chapter 9.
Principal Component Analysis
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
The Central Dogma of Molecular Biology (Things are not really this simple) Genetic information is stored in our DNA (~ 3 billion bp) The DNA of a.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.
DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.
1 Masterseminar „A statistical framework for the diagnostic of meningioma cancer“ Chair for Bioinformatics, Saarland University Andreas Keller Supervised.
Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.
Supervised gene expression data analysis using SVMs and MLPs Giorgio Valentini
Analysis of microarray data
1 Harvard Medical School Transcriptional Diagnosis by Bayesian Network Hsun-Hsien Chang and Marco F. Ramoni Children’s Hospital Informatics Program Harvard-MIT.
Gene expression profiling identifies molecular subtypes of gliomas
Classification of multiple cancer types by multicategory support vector machines using gene expression data.
Whole Genome Expression Analysis
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
From motif search to gene expression analysis
Clustering of DNA Microarray Data Michael Slifker CIS 526.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
CDNA Microarrays MB206.
Sample classification using Microarray Data. AB We have two sample entities malignant vs. benign tumor patient responding to drug vs. patient resistant.
Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Construction of cancer pathways for personalized medicine | Presented By Date Construction of cancer pathways for personalized medicine Predictive, Preventive.
CZ5225: Modeling and Simulation in Biology Lecture 8: Microarray disease predictor-gene selection by feature selection methods Prof. Chen Yu Zong Tel:
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
+ Get Rich and Cure Cancer with Support Vector Machines (Your Summer Projects)
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
Whole Genome Approaches to Cancer 1. What other tumor is a given rare tumor most like? 2. Is tumor X likely to respond to drug Y?
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002.
Disease Diagnosis by DNAC MEC seminar 25 May 04. DNA chip Blood Biopsy Sample rRNA/mRNA/ tRNA RNA RNA with cDNA Hybridization Mixture of cell-lines Reference.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.
Microarray: An Introduction
1 CISC 841 Bioinformatics (Fall 2008) Review Session.
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2016 Xiaole Shirley Liu.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Gene expression.
Molecular Classification of Cancer
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Presentation transcript:

CBCl/AI MIT Class 19: Bioinformatics S. Mukherjee, R. Rifkin, G. Yeo, and T. Poggio

CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology. Defining and addressing problems in molecular biology using methodologies from statistics and computer science. The genome project, genome wide analysis/screening of disease, genetic regulatory networks, analysis of expression data. Pre 1995 Post 1995

CBCl/AI MIT Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

CBCl/AI MIT CGAACAAACCTCGAACCTGCT DNA: mRNA: GCU UGU UUA CGA Polypeptide: Ala Cys Leu Arg Translation Transcription Basic molecular biology

CBCl/AI MIT Transcription End modification Splicing Transport Translation Less basic molecular biology

CBCl/AI MIT Sequence information Quantitative information microarray rt-pcr protein arrays yeast-2 hybrid Chemical screens DNA RNA preRNA mRNA Protein transcription splicing translation, stability transport, localization Biological information

CBCl/AI MIT Splice sites and branch points in eukaryotic pre-mRNA Gene finding in prokaryotes and eukaryotes Promoter recognition (transcription and termination) Protein structure prediction Protein function prediction Protein family classification Sequence problems

CBCl/AI MIT Predict tissue morphology Predict treatment/drug effect Infer metabolic pathway Infer disease pathways Infer developmental pathway Gene expression problems

CBCl/AI MIT Basic idea: The state of the cell is determined by proteins. A gene codes for a protein which is assembled via mRNA. Measuring amount particular mRNA gives measure of amount of corresponding protein. Copies of mRNA is expression of a gene. Microarray technology allows us to measure the expression of thousands of genes at once. (Northern blot). Measure the expression of thousands of genes under different experimental conditions and ask what is different and why. Microarray technology

CBCl/AI MIT Cy3 Cy5 ReferenceTest Sample cDNA Clone (LIBRARY) PCR Product PE Test Sample Oligonucleotide Synthesis Biological Sample RNA ARRAY Ramaswamy and Golub, JCO Microarray technology

CBCl/AI MIT Lockhart and Winzler 2000 Oligonucleotide cDNA Microarray technology

CBCl/AI MIT Yeast experiment Microarray experiment

CBCl/AI MIT When the science is not well understood, resort to statistics: Ultimate goal: discover the genetic pathways of cancers Infer cancer genetics by analyzing microarray data from tumors Curse of dimensionality: Far too few examples for so many dimensions to predict accurately Immediate goal: models that discriminate tumor types or treatment outcomes and determine genes used in model Basic difficulty: few examples , high-dimensionality 7,000-16,000 genes measured for each sample, ill-posed problem Analytic challenge

CBCl/AI MIT 38 examples of Myeloid and Lymphoblastic leukemias Affymetrix human 6800, (7128 genes including control genes) 34 examples to test classifier Results: 33/34 correct d perpendicular distance from hyperplane Test data d Cancer Classification

CBCl/AI MIT Coregulation: the expression of two genes must be correlated for a protein to be made, so we need to look at pairwise correlations as well as individual expression Size of feature space: if there are 7,000 genes, feature space is about 24 million features, so the fact that feature space is never computed is important Two gene example: two genes measuring Sonic Hedgehog and TrkC Coregulation and kernels

CBCl/AI MIT Nonlinear SVM helps when the most informative genes are removed, Informative as ranked using Signal to Noise (Golub et al). Genes removederrors 1 st order2 nd order3 rd order polynomials Gene coregulation

CBCl/AI MIT Golub et al classified 29 test points correctly, rejected 5 of which 2 were errors using 50 genes Need to introduce concept of rejects to SVM g1 g2 Normal Cancer Reject Rejecting samples

CBCl/AI MIT Rejecting samples

CBCl/AI MIT Estimating a CDF

CBCl/AI MIT The regularized solution

CBCl/AI MIT 1/d P(c=1 | d).95 95% confidence or p =.05d =.107 Rejections for SVMs

CBCl/AI MIT Results: 31 correct, 3 rejected of which 1 is an error Test data d Results with rejections

CBCl/AI MIT SVMs as stated use all genes/features Molecular biologists/oncologists seem to be convinced that only a small subset of genes are responsible for particular biological properties, so they want the genes most important in discriminating Practical reasons, a clinical device with thousands of genes is not financially practical Possible performance improvement Wrapper method for gene/feature selection Gene selection

CBCl/AI MIT AML vs ALL: 40 genes 34/34 correct, 0 rejects. 5 genes 31/31 correct, 3 rejects of which 1 is an error. B vs T cells for AML: 10 genes 33/33 correct, 0 rejects. d Test data d Results with gene selection

CBCl/AI MIT Recursive feature elimination (RFE): based upon perturbation analysis, eliminate genes that perturb the margin the least Optimize leave-one out (LOO): based upon optimization of leave-one out error of a SVM, leave-one out error is unbiased Two feature selection algorithms

CBCl/AI MIT Recursive feature elimination

CBCl/AI MIT Use leave-one-out (LOO) bounds for SVMs as a criterion to select features by searching over all possible subsets of n features for the ones that minimizes the bound. When such a search is impossible because of combinatorial explosion, scale each feature by a real value variable and compute this scaling via gradient descent on the leave-one-out bound. One can then keep the features corresponding to the largest scaling variables. The rescaling can be done in the input space or in a “Principal Components” space. Optimizing the LOO

CBCl/AI MIT Rescale features to minimize the LOO bound R 2 /M 2 x2x2 x1x1 R 2 /M 2 >1 M R x2x2 R 2 /M 2 =1 M = R Pictorial demonstration

CBCl/AI MIT Radius margin bound: simple to compute, continuous very loose but often tracks LOO well Jaakkola Haussler bound: somewhat tighter, simple to compute, discontinuous so need to smooth, valid only for SVMs with no b term Span bound: tight as a Britney Spears outfit complicated to compute, discontinuous so need to smooth Three LOO bounds

CBCl/AI MIT We add a scaling parameter  to the SVM, which scales genes, genes corresponding to small  j are removed. The SVM function has the form: Classification function with scaling

CBCl/AI MIT SVM and other functionals

CBCl/AI MIT Algorithm

CBCl/AI MIT Computing gradients

CBCl/AI MIT Linear problem with 6 relevant dimensions of 202 Nonlinear problem with 2 relevant dimensions of 52 number of samples error rate Toy data

CBCl/AI MIT Hierarchy of difficulty: 1.Histological differences: normal vs. malignant, skin vs. brain 2.Morphologies: different leukemia types, ALL vs. AML 3.Lineage B-Cell vs. T-Cell, folicular vs. large B-cell lymphoma 4.Outcome: treatment outcome, elapse, or drug sensitivity. Molecular classification of cancer

CBCl/AI MIT Morphology classification

CBCl/AI MIT Outcome classification

CBCl/AI MIT Error rates ignore temporal information such as when a patient dies. Survival analysis takes temporal information into account. The Kaplan-Meier survival plots and statistics for the above predictions show significance. p-val = p-val = Lymphoma Medulloblastoma Outcome classification

CBCl/AI MIT BreastProstate Lung Colorec tal Lympho ma Bladd er Meleno ma Uterus Leuke mia Renal Pancr eas Ovary Mesothe lioma Brain AbrevBPLCRLyBlMULeRPAOvMSC Total Train Test Note that most of these tumors came from secondary sources and were not at the tissue of origin. Multi tumor classification

CBCl/AI MIT CNS, Lymphoma, Leukemia tumors separate Adenocarcinomas do not separate Clustering is not accurate

CBCl/AI MIT Combination approaches: All pairs One versus all (OVA) Multi tumor classification

CBCl/AI MIT Supervised methodology

CBCl/AI MIT Well differentiated tumors

CBCl/AI MIT Feature selection hurts performance

CBCl/AI MIT AccuracyFraction of Calls Low High Confidence Low High Confidence Correct Errors FirstTop 2Top 3 Prediction Calls DatasetSample TypeValidation Method Sample Number Total Accuracy Confidence High Low Fraction Accuracy TestPoorly DifferentiatedTrain/test2030%50% 50% 50% 10% Poorly differentiated tumors

CBCl/AI MIT Predicting sample sizes For a given number of existing samples, how significant is the performance of a classifier? Does the classifier work at all? Are the results better than what one would expect by chance? Assuming we know the answer to the previous questions for a number of sample sizes. What would be the performance of the classifier When trained with additional samples? Will the accuracy of the classifier improve significantly? Is the effort to collect additional samples worthwhile?

CBCl/AI MIT Predicting sample sizes

CBCl/AI MIT Statistical significance The statistical significance in the tumor vs. non- tumor classification for a) 15 samples and b) 30 samples.

CBCl/AI MIT Learning curves Tumor vs. normal

CBCl/AI MIT Learning curves Tumor vs. normal

CBCl/AI MIT Learning curves AML vs. ALL

CBCl/AI MIT Learning curves Colon cancer

CBCl/AI MIT Learning curves Brain treatment outcome