Scalable data mining for functional genomics and metagenomics Curtis Huttenhower 01-06-1011 Harvard School of Public Health Department of Biostatistics.

Slides:



Advertisements
Similar presentations
Network integration and function prediction: Putting it all together Slides courtesy of Curtis Huttenhower Harvard School of Public Health Department.
Advertisements

The Human Microbiome in Health and Disease Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Genetic Analysis in Human Disease
Use of the genomic data o Reconstruction of metabolic properties o Nature’s Microbiome o NGS in Population Genetics.
Network integration and function prediction: Putting it all together Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Supervised and unsupervised methods for large scale genomic data integration Curtis Huttenhower Harvard School of Public Health Department of.
Scalable metabolic reconstruction for metagenomic data and the human microbiome Sahar Abubucker, Nicola Segata, Johannes Goll, Alyxandria Schubert, Beltran.
Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower
Scalable data mining for functional genomics and metagenomics
Annotating Metagenomes Using the NMPDR Rob Edwards Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division,
Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Sahar Abubucker, Nicola Segata,
The NIH Human Microbiome Project
Computational metagenomics and the human microbiome Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Large scale genomic data mining Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
The Microbiome and Metagenomics
Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.
Unit 1: The Language of Science  communicate and apply scientific information extracted from various sources (3.B)  evaluate models according to their.
Metagenomic Analysis Using MEGAN4
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
Answering biological questions using large genomic data collections Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
“Mapping the Human Gut Microbiome in Health and Disease Using Sequencing, Supercomputing, and Data Analysis” Invited Talk Delivered by Mehrdad Yazdani,
Beyond the Human Genome Project Future goals and projects based on findings from the HGP.
H = -Σp i log 2 p i. SCOPI Each one of the many microbial communities has its own structure and ecosystem, depending on the body environment it exists.
Human Microbiome Conference
Charting the function of microbes and microbial communities Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Large scale genomic data mining Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Finish up array applications Move on to proteomics Protein microarrays.
The Human Microbiome: PSC, IBD, and the Gut-Liver Axis
713 Lecture 15 Host metagenomics. Progression of techniques Culture based –Use phenotypes and genotypes to ID Non-culture based, focused on 16S rDNA –Clone.
Network & Systems Modeling 29 June 2009 NCSU GO Workshop.
Large scale genomic data integration for functional genomics and metagenomics Curtis Huttenhower Harvard School of Public Health Department of.
“Observing the Dynamics of the Human Immune System Coupled to the Microbiome in Health and Disease” CASIS Workshop on Biomedical Research Aboard the ISS.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
Large scale genomic data integration for functional metagenomics Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
DNAmRNAProtein Small molecules Environment Regulatory RNA How a cell is wired The dynamics of such interactions emerge as cellular processes and functions.
Meta’omic functional profiling with ShortBRED Curtis Huttenhower Harvard School of Public Health Department of Biostatistics U. Oregon.
Inflammatory Bowel Diseases November 19, 2007 NCDD Meeting Chair: Daniel K. Podolsky, MD Vice Chair: Eugene B. Chang, MD.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
tracking microbes at the strain level
Functional profiling with HUMAnN2
David Amar, Tom Hait, and Ron Shamir
Network integration and function prediction: Putting it all together
Metagenomic Species Diversity.
The Human Microbiome Project
1. SELECTION OF THE KEY GENE SET 2. BIOLOGICAL NETWORK SELECTION
Strain profiling with StrainPhlAn and PanPhlAn
Discovery and Dissemination
Genomic Data Integration
The Pathway Tools FBA Module
Identifying personal microbiomes using metagenomic codes
Large Scale Data Integration
Systematic Characterization and Analysis of the Taxonomic Drivers of Functional Shifts in the Human Microbiome  Ohad Manor, Elhanan Borenstein  Cell Host.
Taxonomic profiling with MetaPhlAn2
Discovery and Dissemination
Genomic Data Manipulation
Strain profiling with StrainPhlAn
Human Gut Microbiome: Function Matters
H = -Σpi log2 pi.
Daniel A. Peterson, Daniel N. Frank, Norman R. Pace, Jeffrey I. Gordon 
Inflammatory Bowel Disease as a Model for Translating the Microbiome
Microbiome studies for microbial disease pathogenesis research
Daniel A. Peterson, Daniel N. Frank, Norman R. Pace, Jeffrey I. Gordon 
A typical current computational meta'omic pipeline to analyze and contrast microbial communities. A typical current computational meta'omic pipeline to.
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Scalable data mining for functional genomics and metagenomics Curtis Huttenhower Harvard School of Public Health Department of Biostatistics

What tools enable biological discoveries? 2 Our job is to create computational microscopes: To ask and answer specific biomedical questions using millions of experimental results Our job is to create computational microscopes: To ask and answer specific biomedical questions using millions of experimental results

Outline 3 2. Metagenomics: Modeling microbial communities for public health 1. Data mining: Integrating very large genomic data compendia

A computational definition of functional genomics 4 Genomic data Prior knowledge Data ↓ Function ↓ Function Gene ↓ Gene ↓ Function

A framework for functional genomics 5 High Similarity Low Similarity High Correlation Low Correlation G1 G2 + G4 G9 + … G3 G6 - G7 G8 - … G2 G5 ? …0.10.2…0.8 +-…--…+ 0.5… …0.6 High Correlation Low Correlation Frequency Let.Not let. Frequency SimilarDissim. Frequency P(G2-G5|Data) = Ms gene pairs → ← 1Ks datasets + =

Functional network prediction and analysis 6 Global interaction network Carbon metabolism networkExtracellular signaling networkGut community network Currently includes data from 30,000 human experimental results, 15,000 expression conditions + 15,000 diverse others, analyzed for 200 biological functions and 150 diseases HEFalMp

Meta-analysis for unsupervised functional data integration 7 Evangelou 2007 Huttenhower 2006 Hibbs 2007 Simple regression: All datasets are equally accurate Random effects: Variation within and among datasets and interactions

Meta-analysis for unsupervised functional data integration 8 Evangelou 2007 Huttenhower 2006 Hibbs =

Unsupervised data integration: TB virulence and ESX-1 secretion 9 With Sarah Fortune Graphle

Unsupervised data integration: TB virulence and ESX-1 secretion 10 With Sarah Fortune Graphle X ?

Outline Metagenomics: Modeling microbial communities for public health 1. Data mining: Integrating very large genomic data compendia

What to do with your metagenome? 12 (x10 10 ) Diagnostic or prognostic biomarker for host disease Public health tool monitoring population health and interactions Comprehensive snapshot of microbial ecology and evolution Reservoir of gene and protein functional information Who’s there? What are they doing? What do functional genomic data tell us about microbiomes? What can our microbiomes tell us about us? * * Using terabases of sequence and thousands of experimental results

The Human Microbiome Project ongoing 300 “normal” adults, S rDNA + WGS 5 sites/18 samples + blood Oral cavity: saliva, tongue, palate, buccal mucosa, gingiva, tonsils, throat, teeth Skin: ears, inner elbows Nasal cavity Gut: stool Vagina: introitus, mid, fornix Reference genomes (~ ) All healthy subjects; followup projects in psoriasis, Crohn’s, colitis, obesity, acne, cancer, antibiotic resistant infection… Hamady, 2009 Kolenbrander, 2010

HMP Organisms: Everyone and everywhere is different 14 ← Body sites + individuals → ← Organisms (taxa) → ear gutnosemouthvaginaarm mucosapalategingivatonsilssalivasub. plaq.sup. plaq.throattongue Every microbiome is surprisingly different Most organisms are rare in most places Even common organisms vary tremendously in abundance among individuals Aerobicity, interaction with the immune system, and extracellular medium appear to be major determinants There are few, if any, organismal biotypes in health

HMP: Metabolic reconstruction 15 WGS reads Pathways/ modules Genes (KOs) Pathways (KEGGs) Functional seq. KEGG + MetaCYC CAZy, TCDB, VFDB, MEROPS… BLAST → Genes Genes → Pathways MinPath (Ye 2009) Smoothing Witten-Bell Gap filling c(g) = max( c(g), median ) 300 subjects 1-3 visits/subject ~6 body sites/visit M reads/sample 100bp reads BLAST ? Taxonomic limitation Rem. paths in taxa < ave. Xipe Distinguish zero/low (Rodriguez-Mueller in review)

HMP: Metabolic reconstruction 16 Pathway coveragePathway abundance

HMP: Metabolic reconstruction 17 Pathway abundance ← Samples → ← Pathways→

HMP: Metabolic reconstruction 18 Pathway coverage ← Samples → ← Pathways→ Aerobic body sites Gastrointestinal body sites All body sites (“core”)

Gene expression SNP genotypes Metagenomic biomarker discovery 19 Healthy/IBD BMI Diet Taxa & pathways Batch effects? Population structure? Niches & Phylogeny Test for correlates Multiple hypothesis correction Feature selection p >> n Confounds/ stratification/ environment Cross- validate Biological story? Independent sample Intervention/ perturbation

LEfSe: Metagenomic class comparison and explanation 20 LEfSe Nicola Segata LDA + Effect Size

LEfSe: The TRUC murine colitis microbiota 21 With Wendy Garrett

MetaHIT: The gut microbiome and IBD 22 WGS reads Pathways/ modules 124 subjects:99 healthy 21 UC + 4 CD ReBLASTed against KEGG since published data obfuscates read counts Taxa Phymm Brady 2009 Genes (KOs) Pathways (KEGGs) Qin 2010 With Ramnik Xavier, Joshua Korzenik

MetaHIT: Taxonomic CD biomarkers 23 Firmicutes Enterobacteriaceae Up in CD Down in CD UC

MetaHIT: Functional CD biomarkers 24 Motility Transporters Sugar metabolism Down in CD Up in CD Subset of enriched modules in CD patientsSubset of enriched pathways in CD patients Growth/replication

MetaHIT: Enzymes and metabolites over/under- enriched in the CD microbiome 25 Transporters Growth/ replication Motility Sugar metabolism Down in CD Up in CD Inferred metabolites Enzyme families

Outline Metagenomics: Modeling microbial communities for public health 1. Data mining: Integrating very large genomic data compendia HMP: microbiome in health, 18 body sites in 300 subjects HUMAnN: metagenomic metabolic and functional pathway reconstruction LEfSe: biologically relevant community differences Network framework for scalable data integration HEFalMp: human data integration Meta-analysis for unsupervised functional network integration

Thanks! 27 Jacques Izard Wendy Garrett Pinaki SarderNicola Segata Levi WaldronLarisa Miropolsky Interested? We’re recruiting students and postdocs! Human Microbiome Project HMP Metabolic Reconstruction George Weinstock Jennifer Wortman Owen White Makedonka Mitreva Erica Sodergren Vivien Bonazzi Jane Peterson Lita Proctor Sahar Abubucker Yuzhen Ye Beltran Rodriguez-Mueller Jeremy Zucker Qiandong Zeng Mathangi Thiagarajan Brandi Cantarel Maria Rivera Barbara Methe Bill Klimke Daniel Haft Ramnik XavierDirk Gevers Bruce BirrenMark Daly Doyle WardEric Alm Ashlee EarlLisa Cosimi Sarah Fortune

Functional network prediction from diverse microbial data bacterial expression experiments 876 raw datasets 310 postprocessed datasets 304 normalized coexpression networks in 27 species Integrated functional interaction networks in 15 species 307 bacterial interaction experiments raw interactions postprocessed interactions E. Coli Integration ← Precision ↑, Recall ↓

Predicting gene function 30 Cell cycle genes Predicted relationships between genes High Confidence Low Confidence

Predicting gene function 31 Predicted relationships between genes High Confidence Low Confidence Cell cycle genes

Predicting gene function 32 Predicted relationships between genes High Confidence Low Confidence These edges provide a measure of how likely a gene is to specifically participate in the process of interest.

Comprehensive validation of computational predictions 33 Genomic data Computational Predictions of Gene Function MEFIT SPELL Hibbs et al 2007 bioPIXIE Myers et al 2005 Genes predicted to function in mitochondrion organization and biogenesis Laboratory Experiments Petite frequency Growth curves Confocal microscopy New known functions for correctly predicted genes Retraining With David Hess, Amy Caudy Prior knowledge

Evaluating the performance of computational predictions Original GO Annotations Genes involved in mitochondrion organization and biogenesis 135 Under-annotations 82 Novel Confirmations, First Iteration 17 Novel Confirmations, Second Iteration 340 total: >3x previously known genes in ~5 person-months

Evaluating the performance of computational predictions Original GO Annotations Genes involved in mitochondrion organization and biogenesis 95 Under-annotations 40 Confirmed Under-annotations 80 Novel Confirmations First Iteration 17 Novel Confirmations Second Iteration 340 total: >3x previously known genes in ~5 person-months Computational predictions from large collections of genomic data can be accurate despite incomplete or misleading gold standards, and they continue to improve as additional data are incorporated.

Functional mapping: mining integrated networks 36 Predicted relationships between genes High Confidence Low Confidence The strength of these relationships indicates how cohesive a process is. Chemotaxis

Functional mapping: mining integrated networks 37 Predicted relationships between genes High Confidence Low Confidence Chemotaxis

Functional mapping: mining integrated networks 38 Flagellar assembly The strength of these relationships indicates how associated two processes are. Predicted relationships between genes High Confidence Low Confidence Chemotaxis

Functional mapping: Associations among processes 39 Edges Associations between processes Very Strong Moderately Strong Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

Functional mapping: Associations among processes 40 Edges Associations between processes Very Strong Moderately Strong Borders Data coverage of processes Well Covered Sparsely Covered Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

Functional mapping: Associations among processes 41 Edges Associations between processes Very Strong Moderately Strong Nodes Cohesiveness of processes Below Baseline (genomic background) Very Cohesive Borders Data coverage of processes Well Covered Sparsely Covered Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

Functional mapping: Associations among processes 42 Edges Associations between processes Very Strong Moderately Strong Nodes Cohesiveness of processes Below Baseline (genomic background) Very Cohesive Borders Data coverage of processes Well Covered Sparsely Covered