Presentation is loading. Please wait.

Presentation is loading. Please wait.

Large scale genomic data integration for functional genomics and metagenomics Curtis Huttenhower 05-21-10 Harvard School of Public Health Department of.

Similar presentations


Presentation on theme: "Large scale genomic data integration for functional genomics and metagenomics Curtis Huttenhower 05-21-10 Harvard School of Public Health Department of."— Presentation transcript:

1 Large scale genomic data integration for functional genomics and metagenomics Curtis Huttenhower 05-21-10 Harvard School of Public Health Department of Biostatistics

2 Greatest Biological Discoveries? 2

3 Are We There Yet? 3 How much biology is out there? How much have we found? How fast are we finding it? Human Proteins with Annotated Biological Roles Age-Adjusted Citation Rates for Major Sequencing Projects Species Diversity of Environmental Samples Fierer 2008 # Distinct Roles Matt Hibbs

4 # Distinct Roles Matt Hibbs Are We There Yet? 4 How much biology is out there? How much have we found? How fast are we finding it? Human Proteins with Annotated Biological Roles Age-Adjusted Cost per Citation for Major Sequencing Projects Species Diversity of Environmental Samples Fierer 2008 Lots! Not nearly all Not fast enough Our job is to create computational microscopes: To ask and answer specific biomedical questions using millions of experimental results Our job is to create computational microscopes: To ask and answer specific biomedical questions using millions of experimental results

5 Outline 5 1. Data mining: Algorithms for integrating very large data compendia 2. Metagenomics: Network models of microbial communities

6 A framework for functional genomics 6 High Similarity Low Similarity High Correlation Low Correlation G1 G2 + G4 G9 + … G3 G6 - G7 G8 - … G2 G5 ? 0.90.7…0.10.2…0.8 +-…--…+ 0.5…0.050.1…0.6 High Correlation Low Correlation Frequency Coloc.Not coloc. Frequency SimilarDissim. Frequency P(G2-G5|Data) = 0.85 100Ms gene pairs → ← 1Ks datasets + =

7 Functional network prediction and analysis 7 Global interaction network Metabolism networkSignaling networkGut community network Currently includes data from 30,000 human experimental results, 15,000 expression conditions + 15,000 diverse others, analyzed for 200 biological functions and 150 diseases HEFalMp

8 HEFalMp: Predicting human gene function 8 HEFalMp

9 HEFalMp: Predicting human genetic interactions 9 HEFalMp

10 HEFalMp: Analyzing human genomic data 10 HEFalMp

11 HEFalMp: Understanding human disease 11 HEFalMp

12 Meta-analysis for unsupervised functional data integration 12 Evangelou 2007 Huttenhower 2006 Hibbs 2007 Simple regression: All datasets are equally accurate Random effects: Variation within and among datasets and interactions

13 Meta-analysis for unsupervised functional data integration 13 Following up with semi- supervised approach Evangelou 2007 Huttenhower 2006 Hibbs 2007 + =

14 Functional mapping: mining integrated networks 14 Predicted relationships between genes High Confidence Low Confidence The strength of these relationships indicates how cohesive a process is. Chemotaxis

15 Functional mapping: mining integrated networks 15 Predicted relationships between genes High Confidence Low Confidence Chemotaxis

16 Functional mapping: mining integrated networks 16 Flagellar assembly The strength of these relationships indicates how associated two processes are. Predicted relationships between genes High Confidence Low Confidence Chemotaxis

17 Functional Mapping: Functional Associations Between Processes 17 Edges Associations between processes Very Strong Moderately Strong Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

18 Functional Mapping: Functional Associations Between Processes 18 Edges Associations between processes Very Strong Moderately Strong Borders Data coverage of processes Well Covered Sparsely Covered Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

19 Functional Mapping: Functional Associations Between Processes 19 Edges Associations between processes Very Strong Moderately Strong Nodes Cohesiveness of processes Below Baseline (genomic background) Very Cohesive Borders Data coverage of processes Well Covered Sparsely Covered Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

20 Functional Mapping: Functional Associations Between Processes 20 Edges Associations between processes Very Strong Moderately Strong Nodes Cohesiveness of processes Below Baseline (genomic background) Very Cohesive Borders Data coverage of processes Well Covered Sparsely Covered

21 Functional Maps: Focused Data Summarization 21 ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next?

22 Functional Maps: Focused Data Summarization 22 ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? Functional mapping Very large collections of genomic data Specific predicted molecular interactions Pathway, process, or disease associations Underlying experimental results and functional activities in data

23 Outline 23 1. Data mining: Algorithms for integrating very large data compendia 2. Metagenomics: Network models of microbial communities

24 Microbial Communities and Functional Metagenomics Metagenomics: data analysis from environmental samples –Microflora: environment includes us! Pathogen collections of “single” organisms form similar communities Another data integration problem –Must include datasets from multiple organisms What questions can we answer? –What pathways/processes are present/over/under- enriched in a newly sequences microbe/community? –What’s shared within community X? What’s different? What’s unique? –How do human microflora interact with diabetes, obesity, oral health, antibiotics, aging, … –Current functional methods annotate ~50% of synthetic data, <5% of environmental data 24 With Jacques Izard, Wendy Garrett

25 Data Integration for Microbial Communities 25 ~300 available expression datasets ~30 species Weskamp et al 2004 Flannick et al 2006 Kanehisa et al 2008 Tatusov et al 1997 Data integration works just as well in microbes as it does in yeast and humans We know an awful lot about some microorganisms and almost nothing about others Sequence-based and network-based tools for function transfer both work in isolation We can use data integration to leverage both and mine out additional biology

26 Functional network prediction from diverse microbial data 26 486 bacterial expression experiments 876 raw datasets 310 postprocessed datasets 304 normalized coexpression networks in 27 species Integrated functional interaction networks in 15 species 307 bacterial interaction experiments 154796 raw interactions 114786 postprocessed interactions E. Coli Integration ← Precision ↑, Recall ↓

27 Functional maps for cross-species knowledge transfer 27 G17 G16 G15 G10 G6 G9 G8 G5 G11 G7 G12 G13 G14 G2 G1 G4 G3 O8 O4 O5 O7 O9 O6 O2 O3 O1 O1: G1, G2, G3 O2: G4 O3: G6 … ECG1, ECG2 BSG1 ECG3, BSG2 …

28 Functional maps for cross-species knowledge transfer 28 ← Precision ↑, Recall ↓ Following up with unsupervised and partially anchored network alignment

29 Functional maps for functional metagenomics 29 GOS 4441599.3 Hypersaline Lagoon, Ecuador KEGG Pathways Organisms Pathogens Env. Mapping genes into pathways Mapping pathways into organisms + Integrated functional interaction networks in 27 species Mapping organisms into phyla =

30 Functional maps for functional metagenomics 30 Nodes Process cohesiveness in obesity Very Downregulated Baseline (no change) Very Upregulated Edges Process association in obesity More Coregulated Less Coregulated Baseline (no change)

31 Sleipnir C++ library for computational functional genomics Data types for biological entities Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. Network communication, parallelization Efficient machine learning algorithms Generative (Bayesian) and discriminative (SVM) And it’s fully documented! Efficient Computation For Biological Discovery Massive datasets and genomes require efficient algorithms and implementations. 31 It’s also speedy: microbial data integration computation takes <3hrs.

32 Outline 32 1. Data mining: Algorithms for integrating very large data compendia 2. Metagenomics: Network models of microbial communities Bayesian and unsupervised methods for data integration HEFalMp system for human data analysis and integration Functional mapping to statistically summarize large data collections Integration for microbial communities and metagenomics Accurate cross-species interactome transfer Sleipnir software for efficient large scale data mining

33 Thanks! 33 http://function.princeton.edu/hefalmp http://huttenhower.sph.harvard.edu/sleipnir Olga Troyanskaya Chris Park David Hess Matt Hibbs Chad Myers Ana Pop Aaron Wong Hilary Coller Erin Haley Jacques Izard Wendy Garrett Sarah Fortune Tracy Rosebrock

34


Download ppt "Large scale genomic data integration for functional genomics and metagenomics Curtis Huttenhower 05-21-10 Harvard School of Public Health Department of."

Similar presentations


Ads by Google