Download presentation
Presentation is loading. Please wait.
1
Computational Biology Algorithmic Techniques & Medical Applications CSE 590YA August 15, 2001
2
2 Outline Overview Biology Technology Algorithms & Applications Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes Conclusions
3
3 Overview Human Genome Project Why is it important? Sequence functionality Prevention & treatment of disease Where is there computation in it? Lab hardware/software Analysis: assembly, element discovery Could not accomplish w/o computers
4
4 Bigger Picture Biology of the (not so) past Isolated Low level (one X at a time) Slow accumulation of knowledge Biology of the present Global High level (organismal/theoretical) Rapid accumulation of knowledge Rapid generation of open questions
5
5 Example: S. cerevisiae (yeast) Yeast: before expression arrays Model organism for experiments Easy to grow, modify, and study Genetics similar to higher organisms Yeast: after expression arrays Immensely more useful Now know most gene functions New results every month that used to take five years Results are directly applicable to higher organisms
6
6 A good beginning … The genome is not the end Code to be deciphered Human road map Greater need for computational tools and power Example: dbSNP Data exists Need help finding and relating it all
7
7 Computers – not just for analysis Role reversal Before: Biologists generate data, computers analyze it Now: Computers generate experiments, biologists perform them Cycle New future for CMBists Biotech has greatest opportunity for real science to be done, and CS is crucial!
8
8 CB is good for CS Old research revisited and applied Clustering Expired in the 70s, reborn 3 years ago New papers reacceptance as research topic Data mining, web statistics, e-commerce Machine learning Well-studied over the past couple decades New needs in CB new research on tuning
9
9 Outline Overview Biology Technology Algorithms & Applications Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes Conclusions
10
10 Biochemistry 101 Cells Basic building blocks of life * Proteins Key to functionality Catalyze reactions * Store and release energy Build cells and cell components Process-specific, yet resource-efficient
11
11 The genetics of proteins DNA Four-base alphabet * Genes are instructions for building proteins Cell cycle * Extensive regulatory mechanism Construct proteins at right time and place Break down proteins and reuse components Incredibly complex series of steps
12
12 Transcription & translation DNA RNA Transcription factors * RNA polymerase RNA protein Translation at ribosome * Amino acid chains Protein degradation
13
13 Outline Overview Biology Technology Algorithms & Applications Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes Conclusions
14
14 Technology DNA microarrays Consensus RNAs adhered to slide Test and control cDNAs produced * Fluorescently labeled Hybridized with RNAs on slide Scan fluorescence with computer Results: how much RNA present! * What does this signify?
15
15 Example uses Timepoints in the cell cycle Which genes are always “on”? Which genes are responsible for certain events in the cycle? Differential expression in experiment Which genes are responsible for a particular cell response? What is the response pattern over time?
16
16 Outline Overview Biology Technology Algorithms & Applications Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes Conclusions
17
17 “Low tech” algorithms 90s: DNA is just a bunch of strings Questions became answerable! Are there gross similarities in the genome? What do they imply? Are there smaller recurring elements in the genome? What is their function? I know what Gene A does? Can I use that to figure out what Gene B does?
18
18 String and sequence matching String matching Find exact replicas of DNA sequence elsewhere in the genome Are they statistically unlikely? Sequence matching Regions of DNA that look similar: allows for evolution Also applied to proteins In reality, sequences are more important
19
19 Computer tools Biological questions could be answered better by a computer than by a biologist GenBank, FASTA, BLAST, GAP Not trivial developments, even for CS Required novel approaches to NP-hard problems Web proliferation (ongoing) www.cs.jhu.edu/~salzberg/appendixa.html
20
20 High tech: expression arrays Use active gene data to classify a cell Example: Cancer type prediction Subtypes appear very similar histologically Very different clinical courses Diagnoses: biologists’ insight rather than systematic/unbiased approaches
21
21 Classifying cancer ALL vs. AML Two kinds of leukemia (only recently separated) Must be treated very differently Distinguishable in clinic, but not 100% reliable Golub (1999) Goal: Determine cancer type by overall gene expression; build an automated classifier By-product: One of earliest quantitative uses of DNA microarrays
22
22 Strategy Get expression data for 6800 genes from 27 ALL and 11 AML patients Clustering: Find genes with expression levels that are strongly correlated with the ALL-AML class distinction Give each such gene a weighted predictive vote for its class Let important genes vote on test cases
23
23 Determining correlation w/ class Idealized expression patterns Neighborhood analysis * Correlation metric Euclidean distance, regression, TNOM Significance Q: Is gene more highly correlated with IEP than would be expected by chance? A: Examine correlation w/ random IEP permutations Results: 1100 genes more highly correlated with ALL-AML class distinction than expected by chance
24
24 Making a class predictor Subset of informative genes will elect the class of a new sample Each casts weighted vote for its class: * Expression level of gene in test sample Original correlation of gene w/ class distinction Prediction strength (PS) Margin of victory after all genes vote If less than threshold, then uncertain
25
25 Validation of the model (a) Initial data set: cross-validation For each patient sample: Build a classifier without it (i.e. w/ 37 others) Predict class of left-out sample Calculate cumulative error rate Results Used top 50 genes 36/38 samples classified correctly, 2 uncertain
26
26 Validation of the model (b) Independent data set: test validation 34 samples from diverse tissues 29/34 “strong” predictions; 100% accuracy PS values quite high for both.77 in cross-validation;.73 in independent Mean PS lower for samples from one particular laboratory: importance of standardization in clinical setting
27
27 Further results of clinical importance 10 200 voting gene set had same accuracy Voter gene function: not just lineage markers Surface receptors, anti-apoptotic agents, cell cycle regulators, DNA manipulators, known oncogenes These genes provide insight into cancer causes New biological knowledge as a result of computational methods! Other applications of CP & feature selection Response to chemotherapy Eventual outcome of disease
28
28 Other array-based classifiers (a) k-means clustering Select “high-scoring” features like before Pick k points as initial cluster centroids Add each new data point to nearest cluster Move that cluster centroid to new mean Use these centroids to classify test cases
29
29 Other array-based classifiers (b) Support Vector Machines Goal: find a plane that separates data points If not separable Boost the data points into a higher dimensional space using some well-behaved kernel function Try to find a separating hyperplane there Key benefits of SVM version Kernel avoids explicit representation of higher-dim space Finding the maximum margin separating classes avoids overfitting
30
30 Class discovery What if we don’t know how many clusters we want? The discovery of finer-grained subtypes of cancer has been arduous and slow How can microarrays help here? Golub (1999) again … Automatic class discovery based solely on gene expression
31
31 Self-organizing maps (SOMs) Very much like k-means clustering However, we don’t know the discriminating features in advance Cluster based on all gene expression levels Results for 27 ALL/11 AML data set Class A: 24/25 samples were ALL Class B: 10/13 samples were AML Quite effective, but not perfect
32
32 SOMs (cont’d) How can we evaluate the “learned” clusters w/o knowing the true classes? Test by class prediction – accuracy should be high if classes reflect true structure Results Predictors w/ variety of genes did well in cross-validation Exception: the one AML in class A was often predicted to be in class B This suggests an iterative method for class discovery: discover, predict, refine
33
33 Independent model validation Cannot assess “accuracy” on test data Instead, assess prediction strength High PS indicates that structure in initial data is also present in test data Results Median PS=.61, 74% of samples above threshold Compared w/ random clusters, PS’s were highly statistically significant We have discovered ALL-AML distinction! Even lower-level distinctions also discovered
34
34 Other CS w/ expression arrays Regulatory element detection Correlate expression data with frequency of DNA motifs Taxing even for fastest processors today Discovery of regulatory pathways Treat expression arrays over time as a graph Establish a Bayesian network model for regulatory pathways over the array graph structure Infer network parameters pathway structure
35
35 Problems with DNA arrays Different companies, different types Even within one company Different products over time Different binding efficiencies Much time spent on normalization Even then, different groups’ results are hard to compare Biggest worry: RNA levels in cells do not accurately reflect current protein content Perhaps limits our discovery potential
36
36 Proteonomics If protein is most important, why not study it directly? Much work is done on proteins already But difficult to purify, prepare, quantify Results are very coarse Emerging technologies More efficient protein purification and protein arrays are being developed! Lots of discoveries to come
37
37 Outline Overview Biology Technology Algorithms & Applications Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes Conclusions
38
38 Looking to the future Biology is becoming a more theoretical, unified science The problem w/ biology has always been that there are too many layers Work has always been somewhere in the middle Now research is beginning to focus on processes and pathways and networks in general This is the proper path to developing theories Along the way … Lots of hard computational problems to be solved!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.