Download presentation
Presentation is loading. Please wait.
1
CBCl/AI MIT Class 19: Bioinformatics S. Mukherjee, R. Rifkin, G. Yeo, and T. Poggio
2
CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology. Defining and addressing problems in molecular biology using methodologies from statistics and computer science. The genome project, genome wide analysis/screening of disease, genetic regulatory networks, analysis of expression data. Pre 1995 Post 1995
3
CBCl/AI MIT Central dogma of biology DNA RNA pre-mRNA mRNA Protein Central dogma
4
CBCl/AI MIT CGAACAAACCTCGAACCTGCT DNA: mRNA: GCU UGU UUA CGA Polypeptide: Ala Cys Leu Arg Translation Transcription Basic molecular biology
5
CBCl/AI MIT Transcription End modification Splicing Transport Translation Less basic molecular biology
6
CBCl/AI MIT Sequence information Quantitative information microarray rt-pcr protein arrays yeast-2 hybrid Chemical screens DNA RNA preRNA mRNA Protein transcription splicing translation, stability transport, localization Biological information
7
CBCl/AI MIT Splice sites and branch points in eukaryotic pre-mRNA Gene finding in prokaryotes and eukaryotes Promoter recognition (transcription and termination) Protein structure prediction Protein function prediction Protein family classification Sequence problems
8
CBCl/AI MIT Predict tissue morphology Predict treatment/drug effect Infer metabolic pathway Infer disease pathways Infer developmental pathway Gene expression problems
9
CBCl/AI MIT Basic idea: The state of the cell is determined by proteins. A gene codes for a protein which is assembled via mRNA. Measuring amount particular mRNA gives measure of amount of corresponding protein. Copies of mRNA is expression of a gene. Microarray technology allows us to measure the expression of thousands of genes at once. (Northern blot). Measure the expression of thousands of genes under different experimental conditions and ask what is different and why. Microarray technology
10
CBCl/AI MIT Cy3 Cy5 ReferenceTest Sample cDNA Clone (LIBRARY) PCR Product PE Test Sample Oligonucleotide Synthesis Biological Sample RNA ARRAY Ramaswamy and Golub, JCO Microarray technology
11
CBCl/AI MIT Lockhart and Winzler 2000 Oligonucleotide cDNA Microarray technology
12
CBCl/AI MIT Yeast experiment Microarray experiment
13
CBCl/AI MIT When the science is not well understood, resort to statistics: Ultimate goal: discover the genetic pathways of cancers Infer cancer genetics by analyzing microarray data from tumors Curse of dimensionality: Far too few examples for so many dimensions to predict accurately Immediate goal: models that discriminate tumor types or treatment outcomes and determine genes used in model Basic difficulty: few examples 20-100, high-dimensionality 7,000-16,000 genes measured for each sample, ill-posed problem Analytic challenge
14
CBCl/AI MIT 38 examples of Myeloid and Lymphoblastic leukemias Affymetrix human 6800, (7128 genes including control genes) 34 examples to test classifier Results: 33/34 correct d perpendicular distance from hyperplane Test data d Cancer Classification
15
CBCl/AI MIT Coregulation: the expression of two genes must be correlated for a protein to be made, so we need to look at pairwise correlations as well as individual expression Size of feature space: if there are 7,000 genes, feature space is about 24 million features, so the fact that feature space is never computed is important Two gene example: two genes measuring Sonic Hedgehog and TrkC Coregulation and kernels
16
CBCl/AI MIT Nonlinear SVM helps when the most informative genes are removed, Informative as ranked using Signal to Noise (Golub et al). Genes removederrors 1 st order2 nd order3 rd order polynomials 01110111 10211 20321 30332 40332 50322 100332 200333 1500778 Gene coregulation
17
CBCl/AI MIT Golub et al classified 29 test points correctly, rejected 5 of which 2 were errors using 50 genes Need to introduce concept of rejects to SVM g1 g2 Normal Cancer Reject Rejecting samples
18
CBCl/AI MIT Rejecting samples
19
CBCl/AI MIT Estimating a CDF
20
CBCl/AI MIT The regularized solution
21
CBCl/AI MIT 1/d P(c=1 | d).95 95% confidence or p =.05d =.107 Rejections for SVMs
22
CBCl/AI MIT Results: 31 correct, 3 rejected of which 1 is an error Test data d Results with rejections
23
CBCl/AI MIT SVMs as stated use all genes/features Molecular biologists/oncologists seem to be convinced that only a small subset of genes are responsible for particular biological properties, so they want the genes most important in discriminating Practical reasons, a clinical device with thousands of genes is not financially practical Possible performance improvement Wrapper method for gene/feature selection Gene selection
24
CBCl/AI MIT AML vs ALL: 40 genes 34/34 correct, 0 rejects. 5 genes 31/31 correct, 3 rejects of which 1 is an error. B vs T cells for AML: 10 genes 33/33 correct, 0 rejects. d Test data d Results with gene selection
25
CBCl/AI MIT Recursive feature elimination (RFE): based upon perturbation analysis, eliminate genes that perturb the margin the least Optimize leave-one out (LOO): based upon optimization of leave-one out error of a SVM, leave-one out error is unbiased Two feature selection algorithms
26
CBCl/AI MIT Recursive feature elimination
27
CBCl/AI MIT Use leave-one-out (LOO) bounds for SVMs as a criterion to select features by searching over all possible subsets of n features for the ones that minimizes the bound. When such a search is impossible because of combinatorial explosion, scale each feature by a real value variable and compute this scaling via gradient descent on the leave-one-out bound. One can then keep the features corresponding to the largest scaling variables. The rescaling can be done in the input space or in a “Principal Components” space. Optimizing the LOO
28
CBCl/AI MIT Rescale features to minimize the LOO bound R 2 /M 2 x2x2 x1x1 R 2 /M 2 >1 M R x2x2 R 2 /M 2 =1 M = R Pictorial demonstration
29
CBCl/AI MIT Radius margin bound: simple to compute, continuous very loose but often tracks LOO well Jaakkola Haussler bound: somewhat tighter, simple to compute, discontinuous so need to smooth, valid only for SVMs with no b term Span bound: tight as a Britney Spears outfit complicated to compute, discontinuous so need to smooth Three LOO bounds
30
CBCl/AI MIT We add a scaling parameter to the SVM, which scales genes, genes corresponding to small j are removed. The SVM function has the form: Classification function with scaling
31
CBCl/AI MIT SVM and other functionals
32
CBCl/AI MIT Algorithm
33
CBCl/AI MIT Computing gradients
34
CBCl/AI MIT Linear problem with 6 relevant dimensions of 202 Nonlinear problem with 2 relevant dimensions of 52 number of samples error rate Toy data
35
CBCl/AI MIT Hierarchy of difficulty: 1.Histological differences: normal vs. malignant, skin vs. brain 2.Morphologies: different leukemia types, ALL vs. AML 3.Lineage B-Cell vs. T-Cell, folicular vs. large B-cell lymphoma 4.Outcome: treatment outcome, elapse, or drug sensitivity. Molecular classification of cancer
36
CBCl/AI MIT Morphology classification
37
CBCl/AI MIT Outcome classification
38
CBCl/AI MIT Error rates ignore temporal information such as when a patient dies. Survival analysis takes temporal information into account. The Kaplan-Meier survival plots and statistics for the above predictions show significance. p-val = 0.0015 p-val = 0.00039 Lymphoma Medulloblastoma Outcome classification
39
CBCl/AI MIT BreastProstate Lung Colorec tal Lympho ma Bladd er Meleno ma Uterus Leuke mia Renal Pancr eas Ovary Mesothe lioma Brain AbrevBPLCRLyBlMULeRPAOvMSC Total11101113221110 3011 20 Train88881688824888816 Test32356323633334 Note that most of these tumors came from secondary sources and were not at the tissue of origin. Multi tumor classification
40
CBCl/AI MIT CNS, Lymphoma, Leukemia tumors separate Adenocarcinomas do not separate Clustering is not accurate
41
CBCl/AI MIT Combination approaches: All pairs One versus all (OVA) Multi tumor classification
42
CBCl/AI MIT Supervised methodology
43
CBCl/AI MIT Well differentiated tumors
44
CBCl/AI MIT Feature selection hurts performance
45
CBCl/AI MIT 0 0.2 0.4 0.6 0.8 1 01234 AccuracyFraction of Calls 0 1 2 3 4 5 Low High Confidence Low High Confidence Correct Errors 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FirstTop 2Top 3 Prediction Calls DatasetSample TypeValidation Method Sample Number Total Accuracy Confidence High Low Fraction Accuracy TestPoorly DifferentiatedTrain/test2030%50% 50% 50% 10% Poorly differentiated tumors
46
CBCl/AI MIT Predicting sample sizes For a given number of existing samples, how significant is the performance of a classifier? Does the classifier work at all? Are the results better than what one would expect by chance? Assuming we know the answer to the previous questions for a number of sample sizes. What would be the performance of the classifier When trained with additional samples? Will the accuracy of the classifier improve significantly? Is the effort to collect additional samples worthwhile?
47
CBCl/AI MIT Predicting sample sizes
48
CBCl/AI MIT Statistical significance The statistical significance in the tumor vs. non- tumor classification for a) 15 samples and b) 30 samples.
49
CBCl/AI MIT Learning curves Tumor vs. normal
50
CBCl/AI MIT Learning curves Tumor vs. normal
51
CBCl/AI MIT Learning curves AML vs. ALL
52
CBCl/AI MIT Learning curves Colon cancer
53
CBCl/AI MIT Learning curves Brain treatment outcome
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.