1 Machine Learning for Functional Genomics I Matt Hibbs

1 Machine Learning for Functional Genomics I Matt Hibbs http://cbfg.jax.org

2 Central Dogma Gene Expression DNA Proteins Phenotypes

3 Functional Genomics Identify the roles played by genes/proteins Sealfon et al., 2006.

4 Gene Expression Microarrays Simultaneous measurements of mRNA abundance levels for every gene in a genome Genes Conditions

5 Simultaneous measurements of mRNA abundance levels for every gene in a genome – in thousands of conditions Gene Expression Microarrays Rich functional information in these data, but how can we utilize the entire compendia?

6 Biological Data Explosion Huge repositories of biological data… …are not directly translating into knowledge Year # of genes Mouse genes with known process associationPublically available microarrays in GEO # of measurements Year

7 Why is there a Data-Knowledge Gap? Many datasets are analyzed only once –Initial publication looks for hypothesis –Need standards for naming, formats, collection Data should be aggregated and integrated –Modestly significant clues seen repeatedly can become convincing –“a preponderance of circumstantial evidence” Scale of this problem overwhelms traditional biology

8 Scalable Artificial Intelligence Computer science is really a study in scalability Use machine learning and data mining techniques to quickly identify important patterns

9 Amazon Recommendations

10 Amazon Recommendations Purchase History Item Rankings Purchase History Item Rankings Recommendations Machine Learning (Bayesian networks) Machine Learning (Bayesian networks) Compare your purchase history to all other customers Find commonalities between profiles Predict potential purchases Observe Browsing Patterns and Account Activity

11 Gene Function Prediction Purchase History Item Rankings Purchase History Item Rankings Recommendations Observe Browsing Patterns and Account Activity Machine Learning (Bayesian networks) Machine Learning (Bayesian networks) Genome Scale Data MGI Annotations Genome Scale Data MGI Annotations Predictions Laboratory Experiments Machine Learning (Bayesian networks) Machine Learning (Bayesian networks) ≈

12 Challenges for AI from Biology Input data is noisy, heterogeneous, constantly evolving Current knowledge is incomplete and biased Can be difficult to determine accuracy

13 Promise of Computational Functional Genomics Data & Existing Knowledge Computational Approaches Predictions Laboratory Experiments

14 Reality of Computational Functional Genomics Data & Existing Knowledge Computational Approaches Predictions Laboratory Experiments

15 Computational Solutions Machine learning & data mining –Use existing data to make new predictions Similarity search algorithms Bayesian networks Support vector machines etc. –Validate predictions with follow-up lab work Visualization & exploratory analysis –Seeing and interacting with data important –Show data so that questions can be answered Scalability, incorporate statistics, etc.

17 Similarity Search Approach Re-frame analysis as exploratory search Data Collection Query Genes Search Algorithm (SPELL) Relevant Datasets Related Genes

18 Context-Sensitive Search Process Signal Balancing Correlation Comparability XU  VtVt = Key Insights

20 Dataset relevance weighting Datasets Calculate correlation measure among query for each dataset -- This is each datasets’ weight 0.150.820.050.55 Query Genes: Q = {YQG1, YQG2, YQG3} YQG1 YQG2 YQG3

21 Identify Novel Partners Datasets 0.150.820.050.55 Query Genes: Q = {YQG1, YQG2, YQG3} YQG1 YQG2 YQG3 Calculate weighted distance score for all other genes to the query set geneA geneB geneC

22 Identify Novel Partners Datasets 0.150.820.050.55 Query Genes: Q = {YQG1, YQG2, YQG3} YQG1 YQG2 YQG3 geneA geneB geneC Calculate weighted distance score for all other genes to the query set Best score Worst score + Takes advantage of functional diversity + Addresses statistical concerns + Fast running times [O(GDQ 2 )] (ms per query) + Top results are candidates for investigation + Search process is iterative to refine results

24 Singular Value Decomposition (SVD) Projects data into another orthonormal basis Correlations in U (rather than X) often contain better biological signals Signal Balancing Data - SVD

25 Signal Balancing SVD Signal Balancing

26 Signal Balancing Use correlations among left singular vectors –Downweights dominant patterns, amplifies subtle patterns Top eigengenes dominate data –Sometimes correspond to systematic bias –Often correspond to common biological processes eg. ribosome biogenesis, etc. Accuracy of signal balancing improved over re-projection

28 Between-dataset normalization Commonly used Pearson correlation yields greatly different distributions of correlation These differences complicate comparisons DeRisi et al., 97Primig et al., 00 Histograms of Pearson correlations between all pairs of genes

29 Fisher Z-transform, Z-score equalizes distributions Increases comparability between datasets Histograms of Z-scores between all pairs of genes Between-dataset normalization

30 SPELL Algorithm Overview Hibbs MA, Hess DC, Myers CL, Huttenhower C, Li K, Troyanskaya OG. Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics, 2007.

31 Web Interface http://spell.princeton.edu

32 Evaluation of Performance Leave-k-in cross validation / bootstrapping Results averaged across 125 diverse GO biological process terms (defined in the GRIFn system, Myers et al., 2006) Many predictions also verified through experimental validations in other studies –Hibbs et al., Bioinf, 2007 –Hess et al., PLoS Gen, 2009 –Hibbs*, Myers*, Huttenhower*, et al., PLoS Comp Biol, 2009

33 Order Genome Search Accuracy Perform “leave-k-in” cross-validation Genes with common function … For all pairs Master List Rank Average

34 Search Accuracy Precision-Recall Curve Master List Precision TP TP + FP Recall TP TP + FN 0 0 1 1

35 Accuracy of Context-Sensitive Search

36 Sample & Query Size Effects Even relatively small sample sizes produce similar results (1000 samples used for all other tests) Significant performance gain between 2 and 3 query genes, little change beyond (5 query genes used for all other tests)

37 Effect of Signal Balancing Signal balancing further improves context-specific search performance Improvement is robust to missing value imputation method

38 Effects of Signal Balancing n% re-projectionn% balanced signal balanced

39 Effects of Signal Balancing n% re-projectionn% balanced

40 Specific Performance

42 Cross-validation based on known biology –Most often used method in literature –Results are useful, but can be biased Laboratory evaluation –More accurate, more difficult –Ultimate goal of functional genomics –Identify novel biology –Publish biological corpus Function Prediction Evaluation Huttenhower C*, Hibbs MA*, Myers CL* et al. The impact of incomplete knowledge on gene function prediction. Bioinformatics, 2009.

43 Promise of Computational Functional Genomics Data & Existing Knowledge Computational Approaches Predictions Laboratory Experiments

45 Petite Frequency Assay

46 Petite Frequency Phenotypes for Predictions

47 Overall Result Summary

48 Double mutant petite freq.

49 Mitochondrial Motility

50 Respiratory Growth Rate

51 Biological Benefits of Computational Direction Effective Candidate prioritization –6 months of work vs. 8 years for whole genome screen “Unbiased” (actually, just less biased) –Both uncharacterized genes and genes with known function predicted and verified 40 of 75 (53%) for genes with known function 60 of 118 (51%) for uncharacterized genes –Testing only mitochondrial localized proteins would miss 43% of our discoveries 59% accuracy among mitochondria localized 44% accuracy among non-mitochondria localized

52 Computational Expectations Original Gold StandardExperimental Results

53 Complementary Computational Approaches

54 Computational Reality Original Gold StandardExperimental Results

55 Method Comparison Input DataMicroarrays Only Diverse Data Algorithmic Approach Context-specific search Bayesian integration Detailsheavily cross- validated, only pos. correlation, uses signal balanced data naïve Bayes inference after training, pairwise correlations binned naïve Bayes inference after training, each data type converted to pairwise scores

56 Method Accuracy is Biologically Diverse

57 Underlying Data Changes Predictions

58 Methods Converge During Iteration

59 Computational Lessons Underlying data, Choice of algorithm important –Data affects which biological areas can be studied –Algorithm affects biological context, nature of results –Possible for many combinations to be accurate Utilizing an ensemble of methods broadens scope and reliability –Iteration in an ensemble can lead to converging predictions Evaluating the results of computational prediction methods is not as simple as recapitulating GO

60 Conclusions Microarray search system (& Bayesian data integration) produce good predictions of gene function Experimental verification of predictions is important 109 novel gene functions discovered Subtle phenotypes important to consider Big challenge: Make this work in mammals

61 Acknowledgements Hibbs Lab –Karen Dowell –Tongjun Gu –Al Simons Olga Troyanskaya Lab –Patrick Bradley –Maria Chikina –Yuanfang Guan Chad Myers David Hess Florian Markowetz Edo Airoldi Curtis Huttenhower Kai Li Lab –Grant Wallace Amy Caudy Maitreya Dunham Botstein, Kruglyak, Broach, Rose labs Kyuson Yun Carol Bult

1 Machine Learning for Functional Genomics I Matt Hibbs

Similar presentations

Presentation on theme: "1 Machine Learning for Functional Genomics I Matt Hibbs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Machine Learning for Functional Genomics I Matt Hibbs

Similar presentations

Presentation on theme: "1 Machine Learning for Functional Genomics I Matt Hibbs"— Presentation transcript:

Similar presentations

About project

Feedback