Scalable data mining for functional genomics and metagenomics

Slides:



Advertisements
Similar presentations
Network integration and function prediction: Putting it all together Slides courtesy of Curtis Huttenhower Harvard School of Public Health Department.
Advertisements

Use of the genomic data o Reconstruction of metabolic properties o Nature’s Microbiome o NGS in Population Genetics.
Network integration and function prediction: Putting it all together Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
. Inferring Subnetworks from Perturbed Expression Profiles D. Pe’er A. Regev G. Elidan N. Friedman.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Supervised and unsupervised methods for large scale genomic data integration Curtis Huttenhower Harvard School of Public Health Department of.
Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis Jonsson.
Scalable metabolic reconstruction for metagenomic data and the human microbiome Sahar Abubucker, Nicola Segata, Johannes Goll, Alyxandria Schubert, Beltran.
Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Gene Co-expression Network Analysis BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.
Gene Expression Data Analyses (3)
Scalable data mining for functional genomics and metagenomics Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
ONCOMINE: A Bioinformatics Infrastructure for Cancer Genomics
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Sahar Abubucker, Nicola Segata,
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
The NIH Human Microbiome Project
Computational metagenomics and the human microbiome Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Large scale genomic data mining Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
The Microbiome and Metagenomics
Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.
Unit 1: The Language of Science  communicate and apply scientific information extracted from various sources (3.B)  evaluate models according to their.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
Answering biological questions using large genomic data collections Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Cis-regulation Trans-regulation 5 Objective: pathway reconstruction.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Compare and contrast prokaryotic and eukaryotic cells.[BIO.4A] October 2014Secondary Science - Biology.
Beyond the Human Genome Project Future goals and projects based on findings from the HGP.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Gene Regulatory Network Inference. Progress in Disease Treatment  Personalized medicine is becoming more prevalent for several kinds of cancer treatment.
Networks and Interactions Boo Virk v1.0.
Large scale genomic data mining Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Apostolos Zaravinos and Constantinos C Deltas Molecular Medicine Research Center and Laboratory of Molecular and Medical Genetics, Department of Biological.
Metabolomics Metabolome Reflects the State of the Cell, Organ or Organism Change in the metabolome is a direct consequence of protein activity changes.
Large scale genomic data integration for functional genomics and metagenomics Curtis Huttenhower Harvard School of Public Health Department of.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
Large scale genomic data integration for functional metagenomics Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
1 Machine Learning for Functional Genomics I Matt Hibbs
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Shortest Path Analysis and 2nd-Order Analysis Ming-Chih Kao U of M Medical School
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Network applications Sushmita Roy BMI/CS 576 Dec 9 th, 2014.
High throughput biology data management and data intensive computing drivers George Michaels.
Siri, what should I eat? Zeevi et al. Personalized Nutrition by Prediction of Glycemic Responses. Cell 2015;163(5): Vanessa Ha.
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
Functional profiling with HUMAnN2
David Amar, Tom Hait, and Ron Shamir
Network integration and function prediction: Putting it all together
The Human Microbiome Project
1. SELECTION OF THE KEY GENE SET 2. BIOLOGICAL NETWORK SELECTION
Genomic Data Integration
Large Scale Data Integration
Systematic Characterization and Analysis of the Taxonomic Drivers of Functional Shifts in the Human Microbiome  Ohad Manor, Elhanan Borenstein  Cell Host.
Taxonomic profiling with MetaPhlAn2
Genomic Data Manipulation
Human Gut Microbiome: Function Matters
Volume 43, Issue 3, Pages (September 2015)
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Scalable data mining for functional genomics and metagenomics Curtis Huttenhower 09-16-10 Harvard School of Public Health Department of Biostatistics

Greatest discoveries in biology? Our job is to create computational microscopes: To ask and answer specific biological questions using millions of experimental results

Outline 1. Data mining: 2. Metagenomics: Integrating very large genomic data compendia 2. Metagenomics: Network models of microbial communities

A computational definition of functional genomics Prior knowledge Genomic data Gene ↓ Function Gene ↓ Data ↓ Function Function ↓

A framework for functional genomics 100Ms gene pairs → G1 G2 + G4 G9 … G3 G6 - G7 G8 G5 ? 0.9 0.7 0.1 0.2 0.8 0.5 0.05 0.6 ← 1Ks datasets P(G2-G5|Data) = 0.85 High Correlation Low Frequency High Correlation Low Let. Not let. Frequency + = Similar Dissim. Frequency High Similarity Low

Functional network prediction and analysis Global interaction network HEFalMp Currently includes data from 30,000 human experimental results, 15,000 expression conditions + 15,000 diverse others, analyzed for 200 biological functions and 150 diseases Carbon metabolism network Extracellular signaling network Gut community network

Functional network prediction from diverse microbial data 486 bacterial expression experiments 310 postprocessed datasets 304 normalized coexpression networks in 27 species 876 raw datasets 307 bacterial interaction experiments 114786 postprocessed interactions Integrated functional interaction networks in 15 species 154796 raw interactions E. Coli Integration Topmost individual E. coli datasets are heat shock – only activate stress response/translation/ribosome ← Precision ↑, Recall ↓

Meta-analysis for unsupervised functional data integration Huttenhower 2006 Hibbs 2007 Evangelou 2007 Simple regression: All datasets are equally accurate Random effects: Variation within and among datasets and interactions

Meta-analysis for unsupervised functional data integration Huttenhower 2006 Hibbs 2007 Evangelou 2007 + =

Unsupervised data integration: TB virulence and ESX-1 secretion With Sarah Fortune Graphle http://huttenhower.sph.harvard.edu/graphle/

Unsupervised data integration: TB virulence and ESX-1 secretion With Sarah Fortune X ? Graphle http://huttenhower.sph.harvard.edu/graphle/

Predicting gene function Predicted relationships between genes High Confidence Low Cell cycle genes

Predicting gene function Predicted relationships between genes High Confidence Low Cell cycle genes

Predicting gene function Predicted relationships between genes High Confidence Low These edges provide a measure of how likely a gene is to specifically participate in the process of interest. Cell cycle genes

Comprehensive validation of computational predictions With David Hess, Amy Caudy Genomic data Prior knowledge Computational Predictions of Gene Function SPELL Hibbs et al 2007 bioPIXIE Myers et al 2005 MEFIT Retraining Genes predicted to function in mitochondrion organization and biogenesis New known functions for correctly predicted genes Could go (-) Laboratory Experiments Petite frequency Growth curves Confocal microscopy

Evaluating the performance of computational predictions Genes involved in mitochondrion organization and biogenesis 106 Original GO Annotations 135 Under-annotations 82 Novel Confirmations, First Iteration 17 Novel Confirmations, Second Iteration 340 total: >3x previously known genes in ~5 person-months Could go (-)

Evaluating the performance of computational predictions Genes involved in mitochondrion organization and biogenesis Computational predictions from large collections of genomic data can be accurate despite incomplete or misleading gold standards, and they continue to improve as additional data are incorporated. 106 Original GO Annotations 95 Under-annotations 40 Confirmed Under-annotations 80 Novel Confirmations First Iteration 17 Novel Confirmations Second Iteration 340 total: >3x previously known genes in ~5 person-months Could go (-)

Functional mapping: mining integrated networks Predicted relationships between genes High Confidence Low The strength of these relationships indicates how cohesive a process is. Chemotaxis

Functional mapping: mining integrated networks Predicted relationships between genes High Confidence Low Chemotaxis

Functional mapping: mining integrated networks Predicted relationships between genes High Confidence Low The strength of these relationships indicates how associated two processes are. Chemotaxis Flagellar assembly

Functional mapping: Associations among processes Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance Edges Associations between processes Moderately Strong Very Strong

Functional mapping: Associations among processes Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance Edges Associations between processes Moderately Strong Very Strong Borders Data coverage of processes Sparsely Covered Well Covered

Functional mapping: Associations among processes Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance Edges Associations between processes Moderately Strong Very Strong Nodes Cohesiveness of processes Below Baseline Baseline (genomic background) Very Cohesive Borders Data coverage of processes Sparsely Covered Well Covered

Functional mapping: Associations among processes Edges Associations between processes Moderately Strong Very Strong Nodes Cohesiveness of processes Below Baseline Baseline (genomic background) Very Cohesive Borders Data coverage of processes Sparsely Covered Well Covered

Cross-species knowledge transfer using functional data Pinaki Sarder TaFTan

TaFTan: Cross-species knowledge transfer using functional data E. coli P. aeruginosa Species-specific data Species’ data excluded All species’ data Important to take advantage of all available data for any one organism Important to take advantage of all available data for every organism Scalable to dozens of organisms with hundreds of functional datasets Currently working on making this more context-specific log(precision/random) log(recall) B. subtilis M. tuberculosis

Outline 1. Data mining: 2. Metagenomics: Integrating very large genomic data compendia 2. Metagenomics: Network models of microbial communities

So what does all of this have to do with microbial communities ~2000 So what does all of this have to do with microbial communities ? AML/ALL Survival Mutation Batch effects Gene expression Functional modules

~2005 Healthy/Diabetes BMI M/F Population structure LD SNP genotypes

2010 Biological story? ??? Cross-validate Taxa & Orthologs Intervention/perturbation Healthy/IBD Temperature Location Biological story? Independent sample ??? Cross-validate Taxa & Orthologs Niches & Phylogeny Test for correlates Confounds/ stratification/ environment Feature selection p >> n Multiple hypothesis correction

What’s metagenomics? Total collection of microorganisms within a community Also microbial community or microbiota Total genomic potential of a microbial community Study of uncultured microorganisms from the environment, which can include humans or other living hosts Total biomolecular repertoire of a microbial community

The Human Microbiome Project 300 “normal” adults, 18-40 16S rDNA + WGS 5 sites/18 samples + blood Oral cavity: saliva, tongue, palate, buccal mucosa, gingiva, tonsils, throat, teeth Skin: ears, inner elbows Nasal cavity Gut: stool Vagina: introitus, mid, fornix Reference genomes (~200-800) Hamady, 2009 All healthy subjects; followup projects in psoriasis, Crohn’s, colitis, obesity, acne, cancer, resistant infection… 2006 - ongoing

Genomic data (Reference genomes) Functional data (Experimental models) What features to test? Microbiome data Genomic data (Reference genomes) Functional data (Experimental models) 16S reads Taxa Binning WGS reads Orthologous clusters Functional roles Clustering Pathways/ modules Pathway activity

HMP: Data  features 16S reads Taxa Genes (KOs) Pathways (KEGGs) Orthologous clusters Genes (KOs) Pathways/ modules Pathways (KEGGs)

HMP: Body sites Vanilla linear SVM Taxa KOs KEGGs Then full confusion matrix KEGGs

We can tell who you are by the bugs in your mouth! HMP: Subjects We can tell who you are by the bugs in your mouth! Taxa KEGGs

HMP: Metabolic reconstruction Functional seq. KEGG + MetaCYC CAZy, TCDB, VFDB, MEROPS… 300 subjects 1-3 visits/subject 15-18 body sites/visit 10-20M reads/sample 100bp reads BLAST BLAST → Genes WGS reads Genes (KOs) Genes → Pathways MinPath (Ye 2009) Smoothing Witten-Bell ? Pathways/ modules Pathways (KEGGs) Gap filling

HMP: Metabolic reconstruction Pathway coverage Pathway abundance

HMP: Metabolic reconstruction Pathway abundance ← Samples → ← Pathways→ All body sites (“core”) Aerobic body sites Gastrointestinal body sites Pathway coverage

MetaHIT: Data  features 85 healthy, 15 IBD + 12 healthy, 12 IBD ReBLASTed against KEGG since published data obfuscates read counts Taxa 10x bootstrap within training cohort, test on 12+12 as validation Phymm Brady 2009 WGS reads Genes (KOs) Pathways/ modules Pathways (KEGGs)

MetaHIT: Taxonomic CD biomarkers Bacteroidetes Methanomicrobia Enterobacteriaceae Firmicutes Chromatiales Desulfobacterales Bradyrhizobiaceae iTOL Letunic 2007 Rhodobacteraceae Oxalobacteraceae

MetaHIT: Taxonomic CD biomarkers Down in CD Up in CD

MetaHIT: Functional CD biomarkers Down in CD Growth/replication Motility Transporters Sugar metabolism Up in CD

MetaHIT: KO IBD biomarkers Down in IBD Growth/ replication LEfSe Motility Nicola Segata Transporters The same analysis can be performed using KOs instead of KEGGs The same four processes appear, organized into several fairly specific complexes (PTS systems, flagellar assembly, etc.) In addition to transport up in IBD (some metal ion, some sugar) As an aside, essentially all of the metagenomic data I’ve looked at is _tremendously_ rich in transporters and signaling molecules Sugar metabolism Up in IBD

Metagenomic differential analysis: LEfSe 1. Is there a statistically significant difference? t-tests, ANOVA, MANOVA, Friedman, Kruskal–Wallis… 2. Is the difference biologically significant? expert supervision, specific post-hoc tests… 3. How large is the difference? PCA, LDA, mean difference, class or cluster distance… LEfSe: p(ANOVA) < 0.05 pairwise post-hoc Wilcoxon OK Log(Score(LDA)) = 3.68

LEfSe: A non-human example Viromes vs. bacterial metagenomes Dinsdale 2008 Metastats (White 2009): p < 0.001 LEfSE: DIFF! LEfSE: NO DIFF! ANOVA: p < 0.05 Hi-level functional category: Nucleosides and Nucleotides Hi-level functional category: Transporters Hi-level functional category: Carbohydrates Microbial Viral

Sleipnir: Software for scalable functional genomics Massive datasets require efficient algorithms and implementations. Sleipnir C++ library for computational functional genomics Data types for biological entities Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. Network communication, parallelization Efficient machine learning algorithms Generative (Bayesian) and discriminative (SVM) And it’s fully documented! It’s also speedy: microbial data integration computation takes <3hrs.

Outline 1. Data mining: 2. Metagenomics: Network framework for scalable data integration HEFalMp: human data integration TaFTan: cross-species knowledge transfer from functional data 16S and WGS community metabolic reconstruction LEfSe: biologically relevant community differences Sleipnir: software for scalable genomic data mining 1. Data mining: Integrating very large genomic data compendia 2. Metagenomics: Network models of microbial communities

Willythssa Pierre-Louis Thanks! Sarah Fortune Pinaki Sarder Nicola Segata Levi Waldron Larisa Miropolsky Willythssa Pierre-Louis Olga Troyanskaya Chris Park David Hess Matt Hibbs Chad Myers Ana Pop Aaron Wong Hilary Coller Erin Haley Jacques Izard Wendy Garrett Interested? We’re looking for postdocs! http://huttenhower.sph.harvard.edu http://huttenhower.sph.harvard.edu/sleipnir

HEFalMp: Predicting human gene function

HEFalMp: Predicting human genetic interactions

HEFalMp: Analyzing human genomic data

HEFalMp: Understanding human disease

Validating Human Predictions With Erin Haley, Hilary Coller Autophagy 5½ of 7 predictions currently confirmed Predicted novel autophagy proteins Luciferase (Negative control) ATG5 (Positive control) LAMP2 RAB11A Not Starved Starved (Autophagic)

Functional Mapping: Scoring Functional Associations How can we formalize these relationships? Any sets of genes G1 and G2 in a network can be compared using four measures: Edges between their genes Edges within each set The background edges incident to each set The baseline of all edges in the network Stronger connections between the sets increase association. Stronger within self-connections or nonspecific background connections decrease association.

Functional Mapping: Bootstrap p-values For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Scoring functional associations is great… …how do you interpret an association score? For gene sets of arbitrary sizes? In arbitrary graphs? Each with its own bizarre distribution of edges? Null distribution is approximately normal with mean 1. Empirically! # Genes 1 5 10 50 Standard deviation is asymptotic in the sizes of both gene sets. Maps FA scores to p-values for any gene sets and underlying graph. Histograms of FAs for random sets Null distribution σs for one graph

Functional maps for cross-species knowledge transfer O1: G1, G2, G3 O2: G4 O3: G6 … ECG1, ECG2 BSG1 ECG3, BSG2 … G17 G16 G15 G10 G6 G9 G8 G5 G11 G7 G12 G13 G14 G2 G1 G4 G3 O8 O4 O5 O7 O9 O6 O2 O3 O1

Functional maps for functional metagenomics GOS 4441599.3 Hypersaline Lagoon, Ecuador + KEGG Pathways Integrated functional interaction networks in 27 species Mapping organisms into phyla Env. Organisms Pathog ens = Mapping genes into pathways Five query genes from carbon fixation Four query pathways from chemotaxis Five most abundant query organisms Mapping pathways into organisms

Functional Maps: Focused Data Summarization ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next?

Functional Maps: Focused Data Summarization ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? Functional mapping Very large collections of genomic data Specific predicted molecular interactions Pathway, process, or disease associations Underlying experimental results and functional activities in data

Functional maps for cross-species knowledge transfer Project each of ~12 species with good data into KEGG orthology Reproject into Bacillus subtilis genome, weighting by functional similarity Correlation between normalized pathway occurrences So if a pathway’s all there in two organisms, they’re more similar; only partially there, less similar Following up with unsupervised and partially anchored network alignment ← Precision ↑, Recall ↓

LEfSe: A non-human example Viromes vs. bacterial metagenomes Metastats (White 2009): p < 0.001 LEfSE: DIFF! LEfSE: NO DIFF! ANOVA: p < 0.05 Hi-level functional category: Nucleosides and Nucleotides Hi-level functional category: Membrane Transport Hi-level functional category: Nitrogen Metabolism Hi-level functional category: Carbohydrates Microbial Viral