Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable data mining for functional genomics and metagenomics

Similar presentations


Presentation on theme: "Scalable data mining for functional genomics and metagenomics"— Presentation transcript:

1 Scalable data mining for functional genomics and metagenomics
Curtis Huttenhower Harvard School of Public Health Department of Biostatistics

2 Greatest discoveries in biology?
Our job is to create computational microscopes: To ask and answer specific biological questions using millions of experimental results

3 Outline 1. Data mining: 2. Metagenomics:
Integrating very large genomic data compendia 2. Metagenomics: Network models of microbial communities

4 A computational definition of functional genomics
Prior knowledge Genomic data Gene Function Gene Data Function Function

5 A framework for functional genomics
100Ms gene pairs → G1 G2 + G4 G9 G3 G6 - G7 G8 G5 ? 0.9 0.7 0.1 0.2 0.8 0.5 0.05 0.6 ← 1Ks datasets P(G2-G5|Data) = 0.85 High Correlation Low Frequency High Correlation Low Let. Not let. Frequency + = Similar Dissim. Frequency High Similarity Low

6 Functional network prediction and analysis
Global interaction network HEFalMp Currently includes data from 30,000 human experimental results, 15,000 expression conditions + 15,000 diverse others, analyzed for 200 biological functions and 150 diseases Carbon metabolism network Extracellular signaling network Gut community network

7 Functional network prediction from diverse microbial data
486 bacterial expression experiments 310 postprocessed datasets 304 normalized coexpression networks in 27 species 876 raw datasets 307 bacterial interaction experiments postprocessed interactions Integrated functional interaction networks in 15 species raw interactions E. Coli Integration Topmost individual E. coli datasets are heat shock – only activate stress response/translation/ribosome ← Precision ↑, Recall ↓

8 Meta-analysis for unsupervised functional data integration
Huttenhower 2006 Hibbs 2007 Evangelou 2007 Simple regression: All datasets are equally accurate Random effects: Variation within and among datasets and interactions

9 Meta-analysis for unsupervised functional data integration
Huttenhower 2006 Hibbs 2007 Evangelou 2007 + =

10 Unsupervised data integration: TB virulence and ESX-1 secretion
With Sarah Fortune Graphle

11 Unsupervised data integration: TB virulence and ESX-1 secretion
With Sarah Fortune X ? Graphle

12 Predicting gene function
Predicted relationships between genes High Confidence Low Cell cycle genes

13 Predicting gene function
Predicted relationships between genes High Confidence Low Cell cycle genes

14 Predicting gene function
Predicted relationships between genes High Confidence Low These edges provide a measure of how likely a gene is to specifically participate in the process of interest. Cell cycle genes

15 Comprehensive validation of computational predictions
With David Hess, Amy Caudy Genomic data Prior knowledge Computational Predictions of Gene Function SPELL Hibbs et al 2007 bioPIXIE Myers et al 2005 MEFIT Retraining Genes predicted to function in mitochondrion organization and biogenesis New known functions for correctly predicted genes Could go (-) Laboratory Experiments Petite frequency Growth curves Confocal microscopy

16 Evaluating the performance of computational predictions
Genes involved in mitochondrion organization and biogenesis 106 Original GO Annotations 135 Under-annotations 82 Novel Confirmations, First Iteration 17 Novel Confirmations, Second Iteration 340 total: >3x previously known genes in ~5 person-months Could go (-)

17 Evaluating the performance of computational predictions
Genes involved in mitochondrion organization and biogenesis Computational predictions from large collections of genomic data can be accurate despite incomplete or misleading gold standards, and they continue to improve as additional data are incorporated. 106 Original GO Annotations 95 Under-annotations 40 Confirmed Under-annotations 80 Novel Confirmations First Iteration 17 Novel Confirmations Second Iteration 340 total: >3x previously known genes in ~5 person-months Could go (-)

18 Functional mapping: mining integrated networks
Predicted relationships between genes High Confidence Low The strength of these relationships indicates how cohesive a process is. Chemotaxis

19 Functional mapping: mining integrated networks
Predicted relationships between genes High Confidence Low Chemotaxis

20 Functional mapping: mining integrated networks
Predicted relationships between genes High Confidence Low The strength of these relationships indicates how associated two processes are. Chemotaxis Flagellar assembly

21 Functional mapping: Associations among processes
Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance Edges Associations between processes Moderately Strong Very Strong

22 Functional mapping: Associations among processes
Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance Edges Associations between processes Moderately Strong Very Strong Borders Data coverage of processes Sparsely Covered Well Covered

23 Functional mapping: Associations among processes
Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance Edges Associations between processes Moderately Strong Very Strong Nodes Cohesiveness of processes Below Baseline Baseline (genomic background) Very Cohesive Borders Data coverage of processes Sparsely Covered Well Covered

24 Functional mapping: Associations among processes
Edges Associations between processes Moderately Strong Very Strong Nodes Cohesiveness of processes Below Baseline Baseline (genomic background) Very Cohesive Borders Data coverage of processes Sparsely Covered Well Covered

25 Cross-species knowledge transfer using functional data
Pinaki Sarder TaFTan

26 TaFTan: Cross-species knowledge transfer using functional data
E. coli P. aeruginosa Species-specific data Species’ data excluded All species’ data Important to take advantage of all available data for any one organism Important to take advantage of all available data for every organism Scalable to dozens of organisms with hundreds of functional datasets Currently working on making this more context-specific log(precision/random) log(recall) B. subtilis M. tuberculosis

27 Outline 1. Data mining: 2. Metagenomics:
Integrating very large genomic data compendia 2. Metagenomics: Network models of microbial communities

28 So what does all of this have to do with microbial communities
~2000 So what does all of this have to do with microbial communities ? AML/ALL Survival Mutation Batch effects Gene expression Functional modules

29 ~2005 Healthy/Diabetes BMI M/F Population structure LD SNP genotypes

30 2010 Biological story? ??? Cross-validate Taxa & Orthologs
Intervention/perturbation Healthy/IBD Temperature Location Biological story? Independent sample ??? Cross-validate Taxa & Orthologs Niches & Phylogeny Test for correlates Confounds/ stratification/ environment Feature selection p >> n Multiple hypothesis correction

31 What’s metagenomics? Total collection of microorganisms within a community Also microbial community or microbiota Total genomic potential of a microbial community Study of uncultured microorganisms from the environment, which can include humans or other living hosts Total biomolecular repertoire of a microbial community

32 The Human Microbiome Project
300 “normal” adults, 18-40 16S rDNA + WGS 5 sites/18 samples + blood Oral cavity: saliva, tongue, palate, buccal mucosa, gingiva, tonsils, throat, teeth Skin: ears, inner elbows Nasal cavity Gut: stool Vagina: introitus, mid, fornix Reference genomes (~ ) Hamady, 2009 All healthy subjects; followup projects in psoriasis, Crohn’s, colitis, obesity, acne, cancer, resistant infection… ongoing

33 Genomic data (Reference genomes) Functional data (Experimental models)
What features to test? Microbiome data Genomic data (Reference genomes) Functional data (Experimental models) 16S reads Taxa Binning WGS reads Orthologous clusters Functional roles Clustering Pathways/ modules Pathway activity

34 HMP: Data  features 16S reads Taxa Genes (KOs) Pathways (KEGGs)
Orthologous clusters Genes (KOs) Pathways/ modules Pathways (KEGGs)

35 HMP: Body sites Vanilla linear SVM Taxa KOs KEGGs
Then full confusion matrix KEGGs

36 We can tell who you are by the bugs in your mouth!
HMP: Subjects We can tell who you are by the bugs in your mouth! Taxa KEGGs

37 HMP: Metabolic reconstruction
Functional seq. KEGG + MetaCYC CAZy, TCDB, VFDB, MEROPS… 300 subjects 1-3 visits/subject 15-18 body sites/visit 10-20M reads/sample 100bp reads BLAST BLAST → Genes WGS reads Genes (KOs) Genes → Pathways MinPath (Ye 2009) Smoothing Witten-Bell ? Pathways/ modules Pathways (KEGGs) Gap filling

38 HMP: Metabolic reconstruction
Pathway coverage Pathway abundance

39 HMP: Metabolic reconstruction
Pathway abundance ← Samples → ← Pathways→ All body sites (“core”) Aerobic body sites Gastrointestinal body sites Pathway coverage

40 MetaHIT: Data  features
85 healthy, 15 IBD + 12 healthy, 12 IBD ReBLASTed against KEGG since published data obfuscates read counts Taxa 10x bootstrap within training cohort, test on as validation Phymm Brady 2009 WGS reads Genes (KOs) Pathways/ modules Pathways (KEGGs)

41 MetaHIT: Taxonomic CD biomarkers
Bacteroidetes Methanomicrobia Enterobacteriaceae Firmicutes Chromatiales Desulfobacterales Bradyrhizobiaceae iTOL Letunic 2007 Rhodobacteraceae Oxalobacteraceae

42 MetaHIT: Taxonomic CD biomarkers
Down in CD Up in CD

43 MetaHIT: Functional CD biomarkers
Down in CD Growth/replication Motility Transporters Sugar metabolism Up in CD

44 MetaHIT: KO IBD biomarkers
Down in IBD Growth/ replication LEfSe Motility Nicola Segata Transporters The same analysis can be performed using KOs instead of KEGGs The same four processes appear, organized into several fairly specific complexes (PTS systems, flagellar assembly, etc.) In addition to transport up in IBD (some metal ion, some sugar) As an aside, essentially all of the metagenomic data I’ve looked at is _tremendously_ rich in transporters and signaling molecules Sugar metabolism Up in IBD

45 Metagenomic differential analysis: LEfSe
1. Is there a statistically significant difference? t-tests, ANOVA, MANOVA, Friedman, Kruskal–Wallis… 2. Is the difference biologically significant? expert supervision, specific post-hoc tests… 3. How large is the difference? PCA, LDA, mean difference, class or cluster distance… LEfSe: p(ANOVA) < 0.05 pairwise post-hoc Wilcoxon OK Log(Score(LDA)) = 3.68

46 LEfSe: A non-human example Viromes vs. bacterial metagenomes
Dinsdale 2008 Metastats (White 2009): p < 0.001 LEfSE: DIFF! LEfSE: NO DIFF! ANOVA: p < 0.05 Hi-level functional category: Nucleosides and Nucleotides Hi-level functional category: Transporters Hi-level functional category: Carbohydrates Microbial Viral

47 Sleipnir: Software for scalable functional genomics
Massive datasets require efficient algorithms and implementations. Sleipnir C++ library for computational functional genomics Data types for biological entities Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. Network communication, parallelization Efficient machine learning algorithms Generative (Bayesian) and discriminative (SVM) And it’s fully documented! It’s also speedy: microbial data integration computation takes <3hrs.

48 Outline 1. Data mining: 2. Metagenomics:
Network framework for scalable data integration HEFalMp: human data integration TaFTan: cross-species knowledge transfer from functional data 16S and WGS community metabolic reconstruction LEfSe: biologically relevant community differences Sleipnir: software for scalable genomic data mining 1. Data mining: Integrating very large genomic data compendia 2. Metagenomics: Network models of microbial communities

49 Willythssa Pierre-Louis
Thanks! Sarah Fortune Pinaki Sarder Nicola Segata Levi Waldron Larisa Miropolsky Willythssa Pierre-Louis Olga Troyanskaya Chris Park David Hess Matt Hibbs Chad Myers Ana Pop Aaron Wong Hilary Coller Erin Haley Jacques Izard Wendy Garrett Interested? We’re looking for postdocs!

50

51 HEFalMp: Predicting human gene function

52 HEFalMp: Predicting human genetic interactions

53 HEFalMp: Analyzing human genomic data

54 HEFalMp: Understanding human disease

55 Validating Human Predictions
With Erin Haley, Hilary Coller Autophagy 5½ of 7 predictions currently confirmed Predicted novel autophagy proteins Luciferase (Negative control) ATG5 (Positive control) LAMP2 RAB11A Not Starved Starved (Autophagic)

56 Functional Mapping: Scoring Functional Associations
How can we formalize these relationships? Any sets of genes G1 and G2 in a network can be compared using four measures: Edges between their genes Edges within each set The background edges incident to each set The baseline of all edges in the network Stronger connections between the sets increase association. Stronger within self-connections or nonspecific background connections decrease association.

57 Functional Mapping: Bootstrap p-values
For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Scoring functional associations is great… …how do you interpret an association score? For gene sets of arbitrary sizes? In arbitrary graphs? Each with its own bizarre distribution of edges? Null distribution is approximately normal with mean 1. Empirically! # Genes 1 5 10 50 Standard deviation is asymptotic in the sizes of both gene sets. Maps FA scores to p-values for any gene sets and underlying graph. Histograms of FAs for random sets Null distribution σs for one graph

58 Functional maps for cross-species knowledge transfer
O1: G1, G2, G3 O2: G4 O3: G6 ECG1, ECG2 BSG1 ECG3, BSG2 G17 G16 G15 G10 G6 G9 G8 G5 G11 G7 G12 G13 G14 G2 G1 G4 G3 O8 O4 O5 O7 O9 O6 O2 O3 O1

59 Functional maps for functional metagenomics
GOS Hypersaline Lagoon, Ecuador + KEGG Pathways Integrated functional interaction networks in 27 species Mapping organisms into phyla Env. Organisms Pathog ens = Mapping genes into pathways Five query genes from carbon fixation Four query pathways from chemotaxis Five most abundant query organisms Mapping pathways into organisms

60 Functional Maps: Focused Data Summarization
ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next?

61 Functional Maps: Focused Data Summarization
ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? Functional mapping Very large collections of genomic data Specific predicted molecular interactions Pathway, process, or disease associations Underlying experimental results and functional activities in data

62 Functional maps for cross-species knowledge transfer
Project each of ~12 species with good data into KEGG orthology Reproject into Bacillus subtilis genome, weighting by functional similarity Correlation between normalized pathway occurrences So if a pathway’s all there in two organisms, they’re more similar; only partially there, less similar Following up with unsupervised and partially anchored network alignment ← Precision ↑, Recall ↓

63 LEfSe: A non-human example Viromes vs. bacterial metagenomes
Metastats (White 2009): p < 0.001 LEfSE: DIFF! LEfSE: NO DIFF! ANOVA: p < 0.05 Hi-level functional category: Nucleosides and Nucleotides Hi-level functional category: Membrane Transport Hi-level functional category: Nitrogen Metabolism Hi-level functional category: Carbohydrates Microbial Viral


Download ppt "Scalable data mining for functional genomics and metagenomics"

Similar presentations


Ads by Google