Scalable data mining for functional genomics and metagenomics Curtis Huttenhower 09-16-10 Harvard School of Public Health Department of Biostatistics
Greatest discoveries in biology? Our job is to create computational microscopes: To ask and answer specific biological questions using millions of experimental results
Outline 1. Data mining: 2. Metagenomics: Integrating very large genomic data compendia 2. Metagenomics: Network models of microbial communities
A computational definition of functional genomics Prior knowledge Genomic data Gene ↓ Function Gene ↓ Data ↓ Function Function ↓
A framework for functional genomics 100Ms gene pairs → G1 G2 + G4 G9 … G3 G6 - G7 G8 G5 ? 0.9 0.7 0.1 0.2 0.8 0.5 0.05 0.6 ← 1Ks datasets P(G2-G5|Data) = 0.85 High Correlation Low Frequency High Correlation Low Let. Not let. Frequency + = Similar Dissim. Frequency High Similarity Low
Functional network prediction and analysis Global interaction network HEFalMp Currently includes data from 30,000 human experimental results, 15,000 expression conditions + 15,000 diverse others, analyzed for 200 biological functions and 150 diseases Carbon metabolism network Extracellular signaling network Gut community network
Functional network prediction from diverse microbial data 486 bacterial expression experiments 310 postprocessed datasets 304 normalized coexpression networks in 27 species 876 raw datasets 307 bacterial interaction experiments 114786 postprocessed interactions Integrated functional interaction networks in 15 species 154796 raw interactions E. Coli Integration Topmost individual E. coli datasets are heat shock – only activate stress response/translation/ribosome ← Precision ↑, Recall ↓
Meta-analysis for unsupervised functional data integration Huttenhower 2006 Hibbs 2007 Evangelou 2007 Simple regression: All datasets are equally accurate Random effects: Variation within and among datasets and interactions
Meta-analysis for unsupervised functional data integration Huttenhower 2006 Hibbs 2007 Evangelou 2007 + =
Unsupervised data integration: TB virulence and ESX-1 secretion With Sarah Fortune Graphle http://huttenhower.sph.harvard.edu/graphle/
Unsupervised data integration: TB virulence and ESX-1 secretion With Sarah Fortune X ? Graphle http://huttenhower.sph.harvard.edu/graphle/
Predicting gene function Predicted relationships between genes High Confidence Low Cell cycle genes
Predicting gene function Predicted relationships between genes High Confidence Low Cell cycle genes
Predicting gene function Predicted relationships between genes High Confidence Low These edges provide a measure of how likely a gene is to specifically participate in the process of interest. Cell cycle genes
Comprehensive validation of computational predictions With David Hess, Amy Caudy Genomic data Prior knowledge Computational Predictions of Gene Function SPELL Hibbs et al 2007 bioPIXIE Myers et al 2005 MEFIT Retraining Genes predicted to function in mitochondrion organization and biogenesis New known functions for correctly predicted genes Could go (-) Laboratory Experiments Petite frequency Growth curves Confocal microscopy
Evaluating the performance of computational predictions Genes involved in mitochondrion organization and biogenesis 106 Original GO Annotations 135 Under-annotations 82 Novel Confirmations, First Iteration 17 Novel Confirmations, Second Iteration 340 total: >3x previously known genes in ~5 person-months Could go (-)
Evaluating the performance of computational predictions Genes involved in mitochondrion organization and biogenesis Computational predictions from large collections of genomic data can be accurate despite incomplete or misleading gold standards, and they continue to improve as additional data are incorporated. 106 Original GO Annotations 95 Under-annotations 40 Confirmed Under-annotations 80 Novel Confirmations First Iteration 17 Novel Confirmations Second Iteration 340 total: >3x previously known genes in ~5 person-months Could go (-)
Functional mapping: mining integrated networks Predicted relationships between genes High Confidence Low The strength of these relationships indicates how cohesive a process is. Chemotaxis
Functional mapping: mining integrated networks Predicted relationships between genes High Confidence Low Chemotaxis
Functional mapping: mining integrated networks Predicted relationships between genes High Confidence Low The strength of these relationships indicates how associated two processes are. Chemotaxis Flagellar assembly
Functional mapping: Associations among processes Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance Edges Associations between processes Moderately Strong Very Strong
Functional mapping: Associations among processes Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance Edges Associations between processes Moderately Strong Very Strong Borders Data coverage of processes Sparsely Covered Well Covered
Functional mapping: Associations among processes Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance Edges Associations between processes Moderately Strong Very Strong Nodes Cohesiveness of processes Below Baseline Baseline (genomic background) Very Cohesive Borders Data coverage of processes Sparsely Covered Well Covered
Functional mapping: Associations among processes Edges Associations between processes Moderately Strong Very Strong Nodes Cohesiveness of processes Below Baseline Baseline (genomic background) Very Cohesive Borders Data coverage of processes Sparsely Covered Well Covered
Cross-species knowledge transfer using functional data Pinaki Sarder TaFTan
TaFTan: Cross-species knowledge transfer using functional data E. coli P. aeruginosa Species-specific data Species’ data excluded All species’ data Important to take advantage of all available data for any one organism Important to take advantage of all available data for every organism Scalable to dozens of organisms with hundreds of functional datasets Currently working on making this more context-specific log(precision/random) log(recall) B. subtilis M. tuberculosis
Outline 1. Data mining: 2. Metagenomics: Integrating very large genomic data compendia 2. Metagenomics: Network models of microbial communities
So what does all of this have to do with microbial communities ~2000 So what does all of this have to do with microbial communities ? AML/ALL Survival Mutation Batch effects Gene expression Functional modules
~2005 Healthy/Diabetes BMI M/F Population structure LD SNP genotypes
2010 Biological story? ??? Cross-validate Taxa & Orthologs Intervention/perturbation Healthy/IBD Temperature Location Biological story? Independent sample ??? Cross-validate Taxa & Orthologs Niches & Phylogeny Test for correlates Confounds/ stratification/ environment Feature selection p >> n Multiple hypothesis correction
What’s metagenomics? Total collection of microorganisms within a community Also microbial community or microbiota Total genomic potential of a microbial community Study of uncultured microorganisms from the environment, which can include humans or other living hosts Total biomolecular repertoire of a microbial community
The Human Microbiome Project 300 “normal” adults, 18-40 16S rDNA + WGS 5 sites/18 samples + blood Oral cavity: saliva, tongue, palate, buccal mucosa, gingiva, tonsils, throat, teeth Skin: ears, inner elbows Nasal cavity Gut: stool Vagina: introitus, mid, fornix Reference genomes (~200-800) Hamady, 2009 All healthy subjects; followup projects in psoriasis, Crohn’s, colitis, obesity, acne, cancer, resistant infection… 2006 - ongoing
Genomic data (Reference genomes) Functional data (Experimental models) What features to test? Microbiome data Genomic data (Reference genomes) Functional data (Experimental models) 16S reads Taxa Binning WGS reads Orthologous clusters Functional roles Clustering Pathways/ modules Pathway activity
HMP: Data features 16S reads Taxa Genes (KOs) Pathways (KEGGs) Orthologous clusters Genes (KOs) Pathways/ modules Pathways (KEGGs)
HMP: Body sites Vanilla linear SVM Taxa KOs KEGGs Then full confusion matrix KEGGs
We can tell who you are by the bugs in your mouth! HMP: Subjects We can tell who you are by the bugs in your mouth! Taxa KEGGs
HMP: Metabolic reconstruction Functional seq. KEGG + MetaCYC CAZy, TCDB, VFDB, MEROPS… 300 subjects 1-3 visits/subject 15-18 body sites/visit 10-20M reads/sample 100bp reads BLAST BLAST → Genes WGS reads Genes (KOs) Genes → Pathways MinPath (Ye 2009) Smoothing Witten-Bell ? Pathways/ modules Pathways (KEGGs) Gap filling
HMP: Metabolic reconstruction Pathway coverage Pathway abundance
HMP: Metabolic reconstruction Pathway abundance ← Samples → ← Pathways→ All body sites (“core”) Aerobic body sites Gastrointestinal body sites Pathway coverage
MetaHIT: Data features 85 healthy, 15 IBD + 12 healthy, 12 IBD ReBLASTed against KEGG since published data obfuscates read counts Taxa 10x bootstrap within training cohort, test on 12+12 as validation Phymm Brady 2009 WGS reads Genes (KOs) Pathways/ modules Pathways (KEGGs)
MetaHIT: Taxonomic CD biomarkers Bacteroidetes Methanomicrobia Enterobacteriaceae Firmicutes Chromatiales Desulfobacterales Bradyrhizobiaceae iTOL Letunic 2007 Rhodobacteraceae Oxalobacteraceae
MetaHIT: Taxonomic CD biomarkers Down in CD Up in CD
MetaHIT: Functional CD biomarkers Down in CD Growth/replication Motility Transporters Sugar metabolism Up in CD
MetaHIT: KO IBD biomarkers Down in IBD Growth/ replication LEfSe Motility Nicola Segata Transporters The same analysis can be performed using KOs instead of KEGGs The same four processes appear, organized into several fairly specific complexes (PTS systems, flagellar assembly, etc.) In addition to transport up in IBD (some metal ion, some sugar) As an aside, essentially all of the metagenomic data I’ve looked at is _tremendously_ rich in transporters and signaling molecules Sugar metabolism Up in IBD
Metagenomic differential analysis: LEfSe 1. Is there a statistically significant difference? t-tests, ANOVA, MANOVA, Friedman, Kruskal–Wallis… 2. Is the difference biologically significant? expert supervision, specific post-hoc tests… 3. How large is the difference? PCA, LDA, mean difference, class or cluster distance… LEfSe: p(ANOVA) < 0.05 pairwise post-hoc Wilcoxon OK Log(Score(LDA)) = 3.68
LEfSe: A non-human example Viromes vs. bacterial metagenomes Dinsdale 2008 Metastats (White 2009): p < 0.001 LEfSE: DIFF! LEfSE: NO DIFF! ANOVA: p < 0.05 Hi-level functional category: Nucleosides and Nucleotides Hi-level functional category: Transporters Hi-level functional category: Carbohydrates Microbial Viral
Sleipnir: Software for scalable functional genomics Massive datasets require efficient algorithms and implementations. Sleipnir C++ library for computational functional genomics Data types for biological entities Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. Network communication, parallelization Efficient machine learning algorithms Generative (Bayesian) and discriminative (SVM) And it’s fully documented! It’s also speedy: microbial data integration computation takes <3hrs.
Outline 1. Data mining: 2. Metagenomics: Network framework for scalable data integration HEFalMp: human data integration TaFTan: cross-species knowledge transfer from functional data 16S and WGS community metabolic reconstruction LEfSe: biologically relevant community differences Sleipnir: software for scalable genomic data mining 1. Data mining: Integrating very large genomic data compendia 2. Metagenomics: Network models of microbial communities
Willythssa Pierre-Louis Thanks! Sarah Fortune Pinaki Sarder Nicola Segata Levi Waldron Larisa Miropolsky Willythssa Pierre-Louis Olga Troyanskaya Chris Park David Hess Matt Hibbs Chad Myers Ana Pop Aaron Wong Hilary Coller Erin Haley Jacques Izard Wendy Garrett Interested? We’re looking for postdocs! http://huttenhower.sph.harvard.edu http://huttenhower.sph.harvard.edu/sleipnir
HEFalMp: Predicting human gene function
HEFalMp: Predicting human genetic interactions
HEFalMp: Analyzing human genomic data
HEFalMp: Understanding human disease
Validating Human Predictions With Erin Haley, Hilary Coller Autophagy 5½ of 7 predictions currently confirmed Predicted novel autophagy proteins Luciferase (Negative control) ATG5 (Positive control) LAMP2 RAB11A Not Starved Starved (Autophagic)
Functional Mapping: Scoring Functional Associations How can we formalize these relationships? Any sets of genes G1 and G2 in a network can be compared using four measures: Edges between their genes Edges within each set The background edges incident to each set The baseline of all edges in the network Stronger connections between the sets increase association. Stronger within self-connections or nonspecific background connections decrease association.
Functional Mapping: Bootstrap p-values For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Scoring functional associations is great… …how do you interpret an association score? For gene sets of arbitrary sizes? In arbitrary graphs? Each with its own bizarre distribution of edges? Null distribution is approximately normal with mean 1. Empirically! # Genes 1 5 10 50 Standard deviation is asymptotic in the sizes of both gene sets. Maps FA scores to p-values for any gene sets and underlying graph. Histograms of FAs for random sets Null distribution σs for one graph
Functional maps for cross-species knowledge transfer O1: G1, G2, G3 O2: G4 O3: G6 … ECG1, ECG2 BSG1 ECG3, BSG2 … G17 G16 G15 G10 G6 G9 G8 G5 G11 G7 G12 G13 G14 G2 G1 G4 G3 O8 O4 O5 O7 O9 O6 O2 O3 O1
Functional maps for functional metagenomics GOS 4441599.3 Hypersaline Lagoon, Ecuador + KEGG Pathways Integrated functional interaction networks in 27 species Mapping organisms into phyla Env. Organisms Pathog ens = Mapping genes into pathways Five query genes from carbon fixation Four query pathways from chemotaxis Five most abundant query organisms Mapping pathways into organisms
Functional Maps: Focused Data Summarization ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next?
Functional Maps: Focused Data Summarization ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? Functional mapping Very large collections of genomic data Specific predicted molecular interactions Pathway, process, or disease associations Underlying experimental results and functional activities in data
Functional maps for cross-species knowledge transfer Project each of ~12 species with good data into KEGG orthology Reproject into Bacillus subtilis genome, weighting by functional similarity Correlation between normalized pathway occurrences So if a pathway’s all there in two organisms, they’re more similar; only partially there, less similar Following up with unsupervised and partially anchored network alignment ← Precision ↑, Recall ↓
LEfSe: A non-human example Viromes vs. bacterial metagenomes Metastats (White 2009): p < 0.001 LEfSE: DIFF! LEfSE: NO DIFF! ANOVA: p < 0.05 Hi-level functional category: Nucleosides and Nucleotides Hi-level functional category: Membrane Transport Hi-level functional category: Nitrogen Metabolism Hi-level functional category: Carbohydrates Microbial Viral