Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable data mining for functional genomics and metagenomics Curtis Huttenhower 01-06-1011 Harvard School of Public Health Department of Biostatistics.

Similar presentations


Presentation on theme: "Scalable data mining for functional genomics and metagenomics Curtis Huttenhower 01-06-1011 Harvard School of Public Health Department of Biostatistics."— Presentation transcript:

1 Scalable data mining for functional genomics and metagenomics Curtis Huttenhower 01-06-1011 Harvard School of Public Health Department of Biostatistics

2 What tools enable biological discoveries? 2 Our job is to create computational microscopes: To ask and answer specific biomedical questions using millions of experimental results Our job is to create computational microscopes: To ask and answer specific biomedical questions using millions of experimental results

3 Outline 3 2. Metagenomics: Modeling microbial communities for public health 1. Data mining: Integrating very large genomic data compendia

4 A computational definition of functional genomics 4 Genomic data Prior knowledge Data ↓ Function ↓ Function Gene ↓ Gene ↓ Function

5 A framework for functional genomics 5 High Similarity Low Similarity High Correlation Low Correlation G1 G2 + G4 G9 + … G3 G6 - G7 G8 - … G2 G5 ? 0.90.7…0.10.2…0.8 +-…--…+ 0.5…0.050.1…0.6 High Correlation Low Correlation Frequency Let.Not let. Frequency SimilarDissim. Frequency P(G2-G5|Data) = 0.85 100Ms gene pairs → ← 1Ks datasets + =

6 Functional network prediction and analysis 6 Global interaction network Carbon metabolism networkExtracellular signaling networkGut community network Currently includes data from 30,000 human experimental results, 15,000 expression conditions + 15,000 diverse others, analyzed for 200 biological functions and 150 diseases HEFalMp

7 Meta-analysis for unsupervised functional data integration 7 Evangelou 2007 Huttenhower 2006 Hibbs 2007 Simple regression: All datasets are equally accurate Random effects: Variation within and among datasets and interactions

8 Meta-analysis for unsupervised functional data integration 8 Evangelou 2007 Huttenhower 2006 Hibbs 2007 + =

9 Unsupervised data integration: TB virulence and ESX-1 secretion 9 With Sarah Fortune Graphle http://huttenhower.sph.harvard.edu/graphle/

10 Unsupervised data integration: TB virulence and ESX-1 secretion 10 With Sarah Fortune Graphle http://huttenhower.sph.harvard.edu/graphle/ X ?

11 Outline 11 2. Metagenomics: Modeling microbial communities for public health 1. Data mining: Integrating very large genomic data compendia

12 What to do with your metagenome? 12 (x10 10 ) Diagnostic or prognostic biomarker for host disease Public health tool monitoring population health and interactions Comprehensive snapshot of microbial ecology and evolution Reservoir of gene and protein functional information Who’s there? What are they doing? What do functional genomic data tell us about microbiomes? What can our microbiomes tell us about us? * * Using terabases of sequence and thousands of experimental results

13 The Human Microbiome Project 13 2007 - ongoing 300 “normal” adults, 18-40 16S rDNA + WGS 5 sites/18 samples + blood Oral cavity: saliva, tongue, palate, buccal mucosa, gingiva, tonsils, throat, teeth Skin: ears, inner elbows Nasal cavity Gut: stool Vagina: introitus, mid, fornix Reference genomes (~200+800) All healthy subjects; followup projects in psoriasis, Crohn’s, colitis, obesity, acne, cancer, antibiotic resistant infection… Hamady, 2009 Kolenbrander, 2010

14 HMP Organisms: Everyone and everywhere is different 14 ← Body sites + individuals → ← Organisms (taxa) → ear gutnosemouthvaginaarm mucosapalategingivatonsilssalivasub. plaq.sup. plaq.throattongue Every microbiome is surprisingly different Most organisms are rare in most places Even common organisms vary tremendously in abundance among individuals Aerobicity, interaction with the immune system, and extracellular medium appear to be major determinants There are few, if any, organismal biotypes in health

15 HMP: Metabolic reconstruction 15 WGS reads Pathways/ modules Genes (KOs) Pathways (KEGGs) Functional seq. KEGG + MetaCYC CAZy, TCDB, VFDB, MEROPS… BLAST → Genes Genes → Pathways MinPath (Ye 2009) Smoothing Witten-Bell Gap filling c(g) = max( c(g), median ) 300 subjects 1-3 visits/subject ~6 body sites/visit 10-200M reads/sample 100bp reads BLAST ? Taxonomic limitation Rem. paths in taxa < ave. Xipe Distinguish zero/low (Rodriguez-Mueller in review)

16 HMP: Metabolic reconstruction 16 Pathway coveragePathway abundance

17 HMP: Metabolic reconstruction 17 Pathway abundance ← Samples → ← Pathways→

18 HMP: Metabolic reconstruction 18 Pathway coverage ← Samples → ← Pathways→ Aerobic body sites Gastrointestinal body sites All body sites (“core”)

19 Gene expression SNP genotypes Metagenomic biomarker discovery 19 Healthy/IBD BMI Diet Taxa & pathways Batch effects? Population structure? Niches & Phylogeny Test for correlates Multiple hypothesis correction Feature selection p >> n Confounds/ stratification/ environment Cross- validate Biological story? Independent sample Intervention/ perturbation

20 LEfSe: Metagenomic class comparison and explanation 20 LEfSe http://huttenhower.sph.harvard.edu/lefse Nicola Segata LDA + Effect Size

21 LEfSe: The TRUC murine colitis microbiota 21 With Wendy Garrett

22 MetaHIT: The gut microbiome and IBD 22 WGS reads Pathways/ modules 124 subjects:99 healthy 21 UC + 4 CD ReBLASTed against KEGG since published data obfuscates read counts Taxa Phymm Brady 2009 Genes (KOs) Pathways (KEGGs) Qin 2010 With Ramnik Xavier, Joshua Korzenik

23 MetaHIT: Taxonomic CD biomarkers 23 Firmicutes Enterobacteriaceae Up in CD Down in CD UC

24 MetaHIT: Functional CD biomarkers 24 Motility Transporters Sugar metabolism Down in CD Up in CD Subset of enriched modules in CD patientsSubset of enriched pathways in CD patients Growth/replication

25 MetaHIT: Enzymes and metabolites over/under- enriched in the CD microbiome 25 Transporters Growth/ replication Motility Sugar metabolism Down in CD Up in CD Inferred metabolites Enzyme families

26 Outline 26 2. Metagenomics: Modeling microbial communities for public health 1. Data mining: Integrating very large genomic data compendia HMP: microbiome in health, 18 body sites in 300 subjects HUMAnN: metagenomic metabolic and functional pathway reconstruction LEfSe: biologically relevant community differences Network framework for scalable data integration HEFalMp: human data integration Meta-analysis for unsupervised functional network integration

27 Thanks! 27 Jacques Izard Wendy Garrett Pinaki SarderNicola Segata Levi WaldronLarisa Miropolsky http://huttenhower.sph.harvard.edu Interested? We’re recruiting students and postdocs! Human Microbiome Project HMP Metabolic Reconstruction George Weinstock Jennifer Wortman Owen White Makedonka Mitreva Erica Sodergren Vivien Bonazzi Jane Peterson Lita Proctor Sahar Abubucker Yuzhen Ye Beltran Rodriguez-Mueller Jeremy Zucker Qiandong Zeng Mathangi Thiagarajan Brandi Cantarel Maria Rivera Barbara Methe Bill Klimke Daniel Haft Ramnik XavierDirk Gevers Bruce BirrenMark Daly Doyle WardEric Alm Ashlee EarlLisa Cosimi Sarah Fortune http://huttenhower.sph.harvard.edu/sleipnir

28

29 Functional network prediction from diverse microbial data 29 486 bacterial expression experiments 876 raw datasets 310 postprocessed datasets 304 normalized coexpression networks in 27 species Integrated functional interaction networks in 15 species 307 bacterial interaction experiments 154796 raw interactions 114786 postprocessed interactions E. Coli Integration ← Precision ↑, Recall ↓

30 Predicting gene function 30 Cell cycle genes Predicted relationships between genes High Confidence Low Confidence

31 Predicting gene function 31 Predicted relationships between genes High Confidence Low Confidence Cell cycle genes

32 Predicting gene function 32 Predicted relationships between genes High Confidence Low Confidence These edges provide a measure of how likely a gene is to specifically participate in the process of interest.

33 Comprehensive validation of computational predictions 33 Genomic data Computational Predictions of Gene Function MEFIT SPELL Hibbs et al 2007 bioPIXIE Myers et al 2005 Genes predicted to function in mitochondrion organization and biogenesis Laboratory Experiments Petite frequency Growth curves Confocal microscopy New known functions for correctly predicted genes Retraining With David Hess, Amy Caudy Prior knowledge

34 Evaluating the performance of computational predictions 34 106 Original GO Annotations Genes involved in mitochondrion organization and biogenesis 135 Under-annotations 82 Novel Confirmations, First Iteration 17 Novel Confirmations, Second Iteration 340 total: >3x previously known genes in ~5 person-months

35 Evaluating the performance of computational predictions 35 106 Original GO Annotations Genes involved in mitochondrion organization and biogenesis 95 Under-annotations 40 Confirmed Under-annotations 80 Novel Confirmations First Iteration 17 Novel Confirmations Second Iteration 340 total: >3x previously known genes in ~5 person-months Computational predictions from large collections of genomic data can be accurate despite incomplete or misleading gold standards, and they continue to improve as additional data are incorporated.

36 Functional mapping: mining integrated networks 36 Predicted relationships between genes High Confidence Low Confidence The strength of these relationships indicates how cohesive a process is. Chemotaxis

37 Functional mapping: mining integrated networks 37 Predicted relationships between genes High Confidence Low Confidence Chemotaxis

38 Functional mapping: mining integrated networks 38 Flagellar assembly The strength of these relationships indicates how associated two processes are. Predicted relationships between genes High Confidence Low Confidence Chemotaxis

39 Functional mapping: Associations among processes 39 Edges Associations between processes Very Strong Moderately Strong Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

40 Functional mapping: Associations among processes 40 Edges Associations between processes Very Strong Moderately Strong Borders Data coverage of processes Well Covered Sparsely Covered Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

41 Functional mapping: Associations among processes 41 Edges Associations between processes Very Strong Moderately Strong Nodes Cohesiveness of processes Below Baseline (genomic background) Very Cohesive Borders Data coverage of processes Well Covered Sparsely Covered Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

42 Functional mapping: Associations among processes 42 Edges Associations between processes Very Strong Moderately Strong Nodes Cohesiveness of processes Below Baseline (genomic background) Very Cohesive Borders Data coverage of processes Well Covered Sparsely Covered


Download ppt "Scalable data mining for functional genomics and metagenomics Curtis Huttenhower 01-06-1011 Harvard School of Public Health Department of Biostatistics."

Similar presentations


Ads by Google