Download presentation
Presentation is loading. Please wait.
Published byNeal Cunningham Modified over 9 years ago
1
1 Machine Learning for Functional Genomics I Matt Hibbs http://cbfg.jax.org
2
2 Central Dogma Gene Expression DNA Proteins Phenotypes
3
3 Functional Genomics Identify the roles played by genes/proteins Sealfon et al., 2006.
4
4 Gene Expression Microarrays Simultaneous measurements of mRNA abundance levels for every gene in a genome Genes Conditions
5
5 Simultaneous measurements of mRNA abundance levels for every gene in a genome – in thousands of conditions Gene Expression Microarrays Rich functional information in these data, but how can we utilize the entire compendia?
6
6 Biological Data Explosion Huge repositories of biological data… …are not directly translating into knowledge Year # of genes Mouse genes with known process associationPublically available microarrays in GEO # of measurements Year
7
7 Why is there a Data-Knowledge Gap? Many datasets are analyzed only once –Initial publication looks for hypothesis –Need standards for naming, formats, collection Data should be aggregated and integrated –Modestly significant clues seen repeatedly can become convincing –“a preponderance of circumstantial evidence” Scale of this problem overwhelms traditional biology
8
8 Scalable Artificial Intelligence Computer science is really a study in scalability Use machine learning and data mining techniques to quickly identify important patterns
9
9 Amazon Recommendations
10
10 Amazon Recommendations Purchase History Item Rankings Purchase History Item Rankings Recommendations Machine Learning (Bayesian networks) Machine Learning (Bayesian networks) Compare your purchase history to all other customers Find commonalities between profiles Predict potential purchases Observe Browsing Patterns and Account Activity
11
11 Gene Function Prediction Purchase History Item Rankings Purchase History Item Rankings Recommendations Observe Browsing Patterns and Account Activity Machine Learning (Bayesian networks) Machine Learning (Bayesian networks) Genome Scale Data MGI Annotations Genome Scale Data MGI Annotations Predictions Laboratory Experiments Machine Learning (Bayesian networks) Machine Learning (Bayesian networks) ≈
12
12 Challenges for AI from Biology Input data is noisy, heterogeneous, constantly evolving Current knowledge is incomplete and biased Can be difficult to determine accuracy
13
13 Promise of Computational Functional Genomics Data & Existing Knowledge Computational Approaches Predictions Laboratory Experiments
14
14 Reality of Computational Functional Genomics Data & Existing Knowledge Computational Approaches Predictions Laboratory Experiments
15
15 Computational Solutions Machine learning & data mining –Use existing data to make new predictions Similarity search algorithms Bayesian networks Support vector machines etc. –Validate predictions with follow-up lab work Visualization & exploratory analysis –Seeing and interacting with data important –Show data so that questions can be answered Scalability, incorporate statistics, etc.
16
16 Computational Solutions Machine learning & data mining –Use existing data to make new predictions Similarity search algorithms Bayesian networks Support vector machines etc. –Validate predictions with follow-up lab work Visualization & exploratory analysis –Seeing and interacting with data important –Show data so that questions can be answered Scalability, incorporate statistics, etc.
17
17 Similarity Search Approach Re-frame analysis as exploratory search Data Collection Query Genes Search Algorithm (SPELL) Relevant Datasets Related Genes
18
18 Context-Sensitive Search Process Signal Balancing Correlation Comparability XU VtVt = Key Insights
19
19 Context-Sensitive Search Process Signal Balancing Correlation Comparability XU VtVt = Key Insights
20
20 Dataset relevance weighting Datasets Calculate correlation measure among query for each dataset -- This is each datasets’ weight 0.150.820.050.55 Query Genes: Q = {YQG1, YQG2, YQG3} YQG1 YQG2 YQG3
21
21 Identify Novel Partners Datasets 0.150.820.050.55 Query Genes: Q = {YQG1, YQG2, YQG3} YQG1 YQG2 YQG3 Calculate weighted distance score for all other genes to the query set geneA geneB geneC
22
22 Identify Novel Partners Datasets 0.150.820.050.55 Query Genes: Q = {YQG1, YQG2, YQG3} YQG1 YQG2 YQG3 geneA geneB geneC Calculate weighted distance score for all other genes to the query set Best score Worst score + Takes advantage of functional diversity + Addresses statistical concerns + Fast running times [O(GDQ 2 )] (ms per query) + Top results are candidates for investigation + Search process is iterative to refine results
23
23 Context-Sensitive Search Process Signal Balancing Correlation Comparability XU VtVt = Key Insights
24
24 Singular Value Decomposition (SVD) Projects data into another orthonormal basis Correlations in U (rather than X) often contain better biological signals Signal Balancing Data - SVD
25
25 Signal Balancing SVD Signal Balancing
26
26 Signal Balancing Use correlations among left singular vectors –Downweights dominant patterns, amplifies subtle patterns Top eigengenes dominate data –Sometimes correspond to systematic bias –Often correspond to common biological processes eg. ribosome biogenesis, etc. Accuracy of signal balancing improved over re-projection
27
27 Context-Sensitive Search Process Signal Balancing Correlation Comparability XU VtVt = Key Insights
28
28 Between-dataset normalization Commonly used Pearson correlation yields greatly different distributions of correlation These differences complicate comparisons DeRisi et al., 97Primig et al., 00 Histograms of Pearson correlations between all pairs of genes
29
29 Fisher Z-transform, Z-score equalizes distributions Increases comparability between datasets Histograms of Z-scores between all pairs of genes Between-dataset normalization
30
30 SPELL Algorithm Overview Hibbs MA, Hess DC, Myers CL, Huttenhower C, Li K, Troyanskaya OG. Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics, 2007.
31
31 Web Interface http://spell.princeton.edu
32
32 Evaluation of Performance Leave-k-in cross validation / bootstrapping Results averaged across 125 diverse GO biological process terms (defined in the GRIFn system, Myers et al., 2006) Many predictions also verified through experimental validations in other studies –Hibbs et al., Bioinf, 2007 –Hess et al., PLoS Gen, 2009 –Hibbs*, Myers*, Huttenhower*, et al., PLoS Comp Biol, 2009
33
33 Order Genome Search Accuracy Perform “leave-k-in” cross-validation Genes with common function … For all pairs Master List Rank Average
34
34 Search Accuracy Precision-Recall Curve Master List Precision TP TP + FP Recall TP TP + FN 0 0 1 1
35
35 Accuracy of Context-Sensitive Search
36
36 Sample & Query Size Effects Even relatively small sample sizes produce similar results (1000 samples used for all other tests) Significant performance gain between 2 and 3 query genes, little change beyond (5 query genes used for all other tests)
37
37 Effect of Signal Balancing Signal balancing further improves context-specific search performance Improvement is robust to missing value imputation method
38
38 Effects of Signal Balancing n% re-projectionn% balanced signal balanced
39
39 Effects of Signal Balancing n% re-projectionn% balanced
40
40 Specific Performance
41
41 Computational Solutions Machine learning & data mining –Use existing data to make new predictions Similarity search algorithms Bayesian networks Support vector machines etc. –Validate predictions with follow-up lab work Visualization & exploratory analysis –Seeing and interacting with data important –Show data so that questions can be answered Scalability, incorporate statistics, etc.
42
42 Cross-validation based on known biology –Most often used method in literature –Results are useful, but can be biased Laboratory evaluation –More accurate, more difficult –Ultimate goal of functional genomics –Identify novel biology –Publish biological corpus Function Prediction Evaluation Huttenhower C*, Hibbs MA*, Myers CL* et al. The impact of incomplete knowledge on gene function prediction. Bioinformatics, 2009.
43
43 Promise of Computational Functional Genomics Data & Existing Knowledge Computational Approaches Predictions Laboratory Experiments
44
44
45
45 Petite Frequency Assay
46
46 Petite Frequency Phenotypes for Predictions
47
47 Overall Result Summary
48
48 Double mutant petite freq.
49
49 Mitochondrial Motility
50
50 Respiratory Growth Rate
51
51 Biological Benefits of Computational Direction Effective Candidate prioritization –6 months of work vs. 8 years for whole genome screen “Unbiased” (actually, just less biased) –Both uncharacterized genes and genes with known function predicted and verified 40 of 75 (53%) for genes with known function 60 of 118 (51%) for uncharacterized genes –Testing only mitochondrial localized proteins would miss 43% of our discoveries 59% accuracy among mitochondria localized 44% accuracy among non-mitochondria localized
52
52 Computational Expectations Original Gold StandardExperimental Results
53
53 Complementary Computational Approaches
54
54 Computational Reality Original Gold StandardExperimental Results
55
55 Method Comparison Input DataMicroarrays Only Diverse Data Algorithmic Approach Context-specific search Bayesian integration Detailsheavily cross- validated, only pos. correlation, uses signal balanced data naïve Bayes inference after training, pairwise correlations binned naïve Bayes inference after training, each data type converted to pairwise scores
56
56 Method Accuracy is Biologically Diverse
57
57 Underlying Data Changes Predictions
58
58 Methods Converge During Iteration
59
59 Computational Lessons Underlying data, Choice of algorithm important –Data affects which biological areas can be studied –Algorithm affects biological context, nature of results –Possible for many combinations to be accurate Utilizing an ensemble of methods broadens scope and reliability –Iteration in an ensemble can lead to converging predictions Evaluating the results of computational prediction methods is not as simple as recapitulating GO
60
60 Conclusions Microarray search system (& Bayesian data integration) produce good predictions of gene function Experimental verification of predictions is important 109 novel gene functions discovered Subtle phenotypes important to consider Big challenge: Make this work in mammals
61
61 Acknowledgements Hibbs Lab –Karen Dowell –Tongjun Gu –Al Simons Olga Troyanskaya Lab –Patrick Bradley –Maria Chikina –Yuanfang Guan Chad Myers David Hess Florian Markowetz Edo Airoldi Curtis Huttenhower Kai Li Lab –Grant Wallace Amy Caudy Maitreya Dunham Botstein, Kruglyak, Broach, Rose labs Kyuson Yun Carol Bult
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.