1 Machine Learning for Functional Genomics I Matt Hibbs

Slides:



Advertisements
Similar presentations
Chapter 2 The Process of Experimentation
Advertisements

Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.
Random Forest Predrag Radenković 3237/10
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
A Systematic approach to the Large-Scale Analysis of Genotype- Phenotype correlations Paul Fisher Dr. Robert Stevens Prof. Andrew Brass.
More Microarray Analysis: Unsupervised Approaches Matt Hibbs Troyanskaya Lab.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Building biological networks from diverse genomic data Chad Myers Department of Computer Science, Lewis-Sigler Institute for Integrative Genomics Princeton.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Computing Trust in Social Networks
Introduction to Bioinformatics - Tutorial no. 12
Query-driven search methods for large microarray databases Matt Hibbs Troyanskaya Laboratory for BioInformatics and Functional Genomics.
1 CSE591 (575) Data Mining 1/21/ /6/2003 Computer Science & Engineering ASU.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
No definitive “gold standard” causal networks Use a novel held-out validation approach, emphasizing causal aspect of challenge Training Data (4 treatments)
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
April 11, 2008 Data Mining Competition 2008 The 4 th Annual Business Intelligence Symposium Hualin Wang Manager of Advanced.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
1 Machine Learning for Functional Genomics II Matt Hibbs
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
Answering biological questions using large genomic data collections Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
1. Abstract SAGE Serial analysis of gene expression (SAGE) is a method of large-scale gene expression analysis.that involves sequencing small segments.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
1 1 Slide Introduction to Data Mining and Business Intelligence.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
By: Amira Djebbari and John Quackenbush BMC Systems Biology 2008, 2: 57 Presented by: Garron Wright April 20, 2009 CSCE 582.
March 4, Visualization Approaches for Gene Expression Data Matt Hibbs Assistant Professor The Jackson Laboratory.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
9/03 Data Mining – Introduction G Dong (WSU)1 CS499/ Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
DNAmRNAProtein Small molecules Environment Regulatory RNA How a cell is wired The dynamics of such interactions emerge as cellular processes and functions.
Cluster validation Integration ICES Bioinformatics.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Eigengenes as biological signatures Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University 5.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
High throughput biology data management and data intensive computing drivers George Michaels.
Analyzing circadian expression data by harmonic regression based on autoregressive spectral estimation Rendong Yang and Zhen Su Division of Bioinformatics,
INFERENCE FOR BIG DATA Mike Daniels The University of Texas at Austin Department of Statistics & Data Sciences Department of Integrative Biology.
David Amar, Tom Hait, and Ron Shamir
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
WSRec: A Collaborative Filtering Based Web Service Recommender System
Genomic Data Integration
Genomic Data Manipulation
CSc4730/6730 Scientific Visualization
Network Inference Chris Holmes Oxford Centre for Gene Function, &,
Biological Science Applications in Agriculture
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Presentation transcript:

1 Machine Learning for Functional Genomics I Matt Hibbs

2 Central Dogma Gene Expression DNA Proteins Phenotypes

3 Functional Genomics Identify the roles played by genes/proteins Sealfon et al., 2006.

4 Gene Expression Microarrays Simultaneous measurements of mRNA abundance levels for every gene in a genome Genes Conditions

5 Simultaneous measurements of mRNA abundance levels for every gene in a genome – in thousands of conditions Gene Expression Microarrays Rich functional information in these data, but how can we utilize the entire compendia?

6 Biological Data Explosion Huge repositories of biological data… …are not directly translating into knowledge Year # of genes Mouse genes with known process associationPublically available microarrays in GEO # of measurements Year

7 Why is there a Data-Knowledge Gap? Many datasets are analyzed only once –Initial publication looks for hypothesis –Need standards for naming, formats, collection Data should be aggregated and integrated –Modestly significant clues seen repeatedly can become convincing –“a preponderance of circumstantial evidence” Scale of this problem overwhelms traditional biology

8 Scalable Artificial Intelligence Computer science is really a study in scalability Use machine learning and data mining techniques to quickly identify important patterns

9 Amazon Recommendations

10 Amazon Recommendations Purchase History Item Rankings Purchase History Item Rankings Recommendations Machine Learning (Bayesian networks) Machine Learning (Bayesian networks) Compare your purchase history to all other customers Find commonalities between profiles Predict potential purchases Observe Browsing Patterns and Account Activity

11 Gene Function Prediction Purchase History Item Rankings Purchase History Item Rankings Recommendations Observe Browsing Patterns and Account Activity Machine Learning (Bayesian networks) Machine Learning (Bayesian networks) Genome Scale Data MGI Annotations Genome Scale Data MGI Annotations Predictions Laboratory Experiments Machine Learning (Bayesian networks) Machine Learning (Bayesian networks) ≈

12 Challenges for AI from Biology Input data is noisy, heterogeneous, constantly evolving Current knowledge is incomplete and biased Can be difficult to determine accuracy

13 Promise of Computational Functional Genomics Data & Existing Knowledge Computational Approaches Predictions Laboratory Experiments

14 Reality of Computational Functional Genomics Data & Existing Knowledge Computational Approaches Predictions Laboratory Experiments

15 Computational Solutions Machine learning & data mining –Use existing data to make new predictions Similarity search algorithms Bayesian networks Support vector machines etc. –Validate predictions with follow-up lab work Visualization & exploratory analysis –Seeing and interacting with data important –Show data so that questions can be answered Scalability, incorporate statistics, etc.

16 Computational Solutions Machine learning & data mining –Use existing data to make new predictions Similarity search algorithms Bayesian networks Support vector machines etc. –Validate predictions with follow-up lab work Visualization & exploratory analysis –Seeing and interacting with data important –Show data so that questions can be answered Scalability, incorporate statistics, etc.

17 Similarity Search Approach Re-frame analysis as exploratory search Data Collection Query Genes Search Algorithm (SPELL) Relevant Datasets Related Genes

18 Context-Sensitive Search Process Signal Balancing Correlation Comparability XU  VtVt = Key Insights

19 Context-Sensitive Search Process Signal Balancing Correlation Comparability XU  VtVt = Key Insights

20 Dataset relevance weighting Datasets Calculate correlation measure among query for each dataset -- This is each datasets’ weight Query Genes: Q = {YQG1, YQG2, YQG3} YQG1 YQG2 YQG3

21 Identify Novel Partners Datasets Query Genes: Q = {YQG1, YQG2, YQG3} YQG1 YQG2 YQG3 Calculate weighted distance score for all other genes to the query set geneA geneB geneC

22 Identify Novel Partners Datasets Query Genes: Q = {YQG1, YQG2, YQG3} YQG1 YQG2 YQG3 geneA geneB geneC Calculate weighted distance score for all other genes to the query set Best score Worst score + Takes advantage of functional diversity + Addresses statistical concerns + Fast running times [O(GDQ 2 )] (ms per query) + Top results are candidates for investigation + Search process is iterative to refine results

23 Context-Sensitive Search Process Signal Balancing Correlation Comparability XU  VtVt = Key Insights

24 Singular Value Decomposition (SVD) Projects data into another orthonormal basis Correlations in U (rather than X) often contain better biological signals Signal Balancing Data - SVD

25 Signal Balancing SVD Signal Balancing

26 Signal Balancing Use correlations among left singular vectors –Downweights dominant patterns, amplifies subtle patterns Top eigengenes dominate data –Sometimes correspond to systematic bias –Often correspond to common biological processes eg. ribosome biogenesis, etc. Accuracy of signal balancing improved over re-projection

27 Context-Sensitive Search Process Signal Balancing Correlation Comparability XU  VtVt = Key Insights

28 Between-dataset normalization Commonly used Pearson correlation yields greatly different distributions of correlation These differences complicate comparisons DeRisi et al., 97Primig et al., 00 Histograms of Pearson correlations between all pairs of genes

29 Fisher Z-transform, Z-score equalizes distributions Increases comparability between datasets Histograms of Z-scores between all pairs of genes Between-dataset normalization

30 SPELL Algorithm Overview Hibbs MA, Hess DC, Myers CL, Huttenhower C, Li K, Troyanskaya OG. Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics, 2007.

31 Web Interface

32 Evaluation of Performance Leave-k-in cross validation / bootstrapping Results averaged across 125 diverse GO biological process terms (defined in the GRIFn system, Myers et al., 2006) Many predictions also verified through experimental validations in other studies –Hibbs et al., Bioinf, 2007 –Hess et al., PLoS Gen, 2009 –Hibbs*, Myers*, Huttenhower*, et al., PLoS Comp Biol, 2009

33 Order Genome Search Accuracy Perform “leave-k-in” cross-validation Genes with common function … For all pairs Master List Rank Average

34 Search Accuracy Precision-Recall Curve Master List Precision TP TP + FP Recall TP TP + FN

35 Accuracy of Context-Sensitive Search

36 Sample & Query Size Effects Even relatively small sample sizes produce similar results (1000 samples used for all other tests) Significant performance gain between 2 and 3 query genes, little change beyond (5 query genes used for all other tests)

37 Effect of Signal Balancing Signal balancing further improves context-specific search performance Improvement is robust to missing value imputation method

38 Effects of Signal Balancing n% re-projectionn% balanced signal balanced

39 Effects of Signal Balancing n% re-projectionn% balanced

40 Specific Performance

41 Computational Solutions Machine learning & data mining –Use existing data to make new predictions Similarity search algorithms Bayesian networks Support vector machines etc. –Validate predictions with follow-up lab work Visualization & exploratory analysis –Seeing and interacting with data important –Show data so that questions can be answered Scalability, incorporate statistics, etc.

42 Cross-validation based on known biology –Most often used method in literature –Results are useful, but can be biased Laboratory evaluation –More accurate, more difficult –Ultimate goal of functional genomics –Identify novel biology –Publish biological corpus Function Prediction Evaluation Huttenhower C*, Hibbs MA*, Myers CL* et al. The impact of incomplete knowledge on gene function prediction. Bioinformatics, 2009.

43 Promise of Computational Functional Genomics Data & Existing Knowledge Computational Approaches Predictions Laboratory Experiments

44

45 Petite Frequency Assay

46 Petite Frequency Phenotypes for Predictions

47 Overall Result Summary

48 Double mutant petite freq.

49 Mitochondrial Motility

50 Respiratory Growth Rate

51 Biological Benefits of Computational Direction Effective Candidate prioritization –6 months of work vs. 8 years for whole genome screen “Unbiased” (actually, just less biased) –Both uncharacterized genes and genes with known function predicted and verified 40 of 75 (53%) for genes with known function 60 of 118 (51%) for uncharacterized genes –Testing only mitochondrial localized proteins would miss 43% of our discoveries 59% accuracy among mitochondria localized 44% accuracy among non-mitochondria localized

52 Computational Expectations Original Gold StandardExperimental Results

53 Complementary Computational Approaches

54 Computational Reality Original Gold StandardExperimental Results

55 Method Comparison Input DataMicroarrays Only Diverse Data Algorithmic Approach Context-specific search Bayesian integration Detailsheavily cross- validated, only pos. correlation, uses signal balanced data naïve Bayes inference after training, pairwise correlations binned naïve Bayes inference after training, each data type converted to pairwise scores

56 Method Accuracy is Biologically Diverse

57 Underlying Data Changes Predictions

58 Methods Converge During Iteration

59 Computational Lessons Underlying data, Choice of algorithm important –Data affects which biological areas can be studied –Algorithm affects biological context, nature of results –Possible for many combinations to be accurate Utilizing an ensemble of methods broadens scope and reliability –Iteration in an ensemble can lead to converging predictions Evaluating the results of computational prediction methods is not as simple as recapitulating GO

60 Conclusions Microarray search system (& Bayesian data integration) produce good predictions of gene function Experimental verification of predictions is important 109 novel gene functions discovered Subtle phenotypes important to consider Big challenge: Make this work in mammals

61 Acknowledgements Hibbs Lab –Karen Dowell –Tongjun Gu –Al Simons Olga Troyanskaya Lab –Patrick Bradley –Maria Chikina –Yuanfang Guan Chad Myers David Hess Florian Markowetz Edo Airoldi Curtis Huttenhower Kai Li Lab –Grant Wallace Amy Caudy Maitreya Dunham Botstein, Kruglyak, Broach, Rose labs Kyuson Yun Carol Bult