Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Network integration and function prediction: Putting it all together Slides courtesy of Curtis Huttenhower Harvard School of Public Health Department.
Network integration and function prediction: Putting it all together Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
The STRING database Michael Kuhn EMBL Heidelberg.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Supervised and unsupervised methods for large scale genomic data integration Curtis Huttenhower Harvard School of Public Health Department of.
An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical.
KEGG: Kyoto Encyclopedia of Genes and Genomes Susan Seo Intro to Bioinformatics Fall 2004.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Gene Co-expression Network Analysis BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.
Scalable data mining for functional genomics and metagenomics
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Data Mining – Intro.
Large scale genomic data mining Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Vertex labels swapping Edges swapping Pathway activity levels with ratio Abstract Metabolic pathway activity estimation from RNA-Seq data Yvette Temate-Tiagueu,
EnrichNet: network-based gene set enrichment analysis Presenter: Lu Liu.
Answering biological questions using large genomic data collections Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Cis-regulation Trans-regulation 5 Objective: pathway reconstruction.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Beyond the Human Genome Project Future goals and projects based on findings from the HGP.
GTL Facilities Computing Infrastructure for 21 st Century Systems Biology Ed Uberbacher ORNL & Mike Colvin LLNL.
Bioinformatics Dr. Víctor Treviño BT4007
Gene Regulatory Network Inference. Progress in Disease Treatment  Personalized medicine is becoming more prevalent for several kinds of cancer treatment.
Networks and Interactions Boo Virk v1.0.
The NIH Roadmap and the Human Microbiome Project Francis S. Collins, M.D., Ph.D. National Human Genome Research Institute April 22, 2007.
Large scale genomic data mining Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
HUMAN-MOUSE CONSERVED COEXPRESSION NETWORKS PREDICT CANDIDATE DISEASE GENES Ala U., Piro R., Grassi E., Damasco C., Silengo L., Brunner H., Provero P.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Network & Systems Modeling 29 June 2009 NCSU GO Workshop.
Abstract Background: In this work, a candidate gene prioritization method is described, and based on protein-protein interaction network (PPIN) analysis.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.
Large scale genomic data integration for functional genomics and metagenomics Curtis Huttenhower Harvard School of Public Health Department of.
Complementarity of network and sequence information in homologous proteins March, Department of Computing, Imperial College London, London, UK 2.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Genomes To Life Biology for 21 st Century A Joint Initiative of the Office of Advanced Scientific Computing Research and Office of Biological and Environmental.
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
Large scale genomic data integration for functional metagenomics Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
BBN Technologies Copyright 2009 Slide 1 The S*QL Plugin for Cytoscape Visual Analytics on the Web of Linked Data Rusty (Robert J.) Bobrow Jeff Berliner,
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
An overview of Bioinformatics. Cell and Central Dogma.
GO based data analysis Iowa State Workshop 11 June 2009.
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Survey of clinical data mining applications on big data in health informatics Matthew Herland, Taghi M. Khoshgoftaar, and Randall Wald 劉俊成.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
High throughput biology data management and data intensive computing drivers George Michaels.
David Amar, Tom Hait, and Ron Shamir
Metagenomic Species Diversity.
Strain profiling with StrainPhlAn and PanPhlAn
Genomic Data Integration
Taxonomic profiling with MetaPhlAn2
Genomic Data Manipulation
Volume 20, Issue 5, Pages (November 2014)
Volume 43, Issue 3, Pages (September 2015)
Volume 20, Issue 5, Pages (November 2014)
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower Harvard School of Public Health Department of Biostatistics

Outline 2 1. Network models of functional data 2. Network models of microbes 3. Network models of microbiomes

Meta-analysis for unsupervised functional data integration 3 Following up with round-robin and semi-supervised evaluations Huttenhower 2006 Hibbs =

Functional network prediction from diverse microbial data bacterial expression experiments 876 raw datasets 310 postprocessed datasets 304 normalized coexpression networks in 27 species Integrated functional interaction networks in 15 species 307 bacterial interaction experiments raw interactions postprocessed interactions

Functional maps for cross-species knowledge transfer 5 Following up with unsupervised and partially anchored network alignment Huttenhower 2008 Huttenhower 2009

Functional maps for functional metagenomics 6 Mapping genes into pathways Mapping pathways into organisms + Integrated functional interaction networks in 27 species Mapping organisms into phyla = GOS Hypersaline Lagoon, Ecuador

Functional maps for functional metagenomics 7 Nodes Process cohesiveness in obesity Very Downregulated Baseline (no change) Very Upregulated Edges Process association in obesity More Coregulated Less Coregulated Baseline (no change) Summarizes information from ~10M metagenomic reads and ~500 genome- scale microbial experiments.

Sleipnir C++ library for computational functional genomics Data types for biological entities Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. Network communication, parallelization Efficient machine learning algorithms Generative (Bayesian) and discriminative (SVM) And it’s fully documented! Efficient Computation For Biological Discovery Massive datasets and genomes require efficient algorithms and implementations. 8 It’s also speedy: microbial data integration computation takes <3hrs.

Thanks! Olga Troyanskaya Matt Hibbs Chad Myers David Hess Chris Park Ana Pop Aaron Wong Hilary Coller Erin Haley Jacques Izard Wendy Garrett Sarah Fortune Tracy Rosebrock

Functional mapping: Functional associations between processes 11 Edges Associations between processes Very Strong Moderately Strong Nodes Cohesiveness of processes Below Baseline (genomic background) Very Cohesive Borders Data coverage of processes Well Covered Sparsely Covered Information mapped from ~100 E. coli experiments

Meta-analysis for unsupervised functional data integration 12 Following up with round-robin and semi-supervised evaluations Evangelou 2007 Huttenhower 2006 Hibbs =

Functional mapping: mining integrated networks 13 Predicted relationships between genes High Confidence Low Confidence The strength of these relationships indicates how cohesive a process is. Chemotaxis

Functional mapping: mining integrated networks 14 Predicted relationships between genes High Confidence Low Confidence Chemotaxis

Functional mapping: mining integrated networks 15 Flagellar assembly The strength of these relationships indicates how associated two processes are. Predicted relationships between genes High Confidence Low Confidence Chemotaxis

Functional maps for cross-species knowledge transfer 16 G17 G16 G15 G10 G6 G9 G8 G5 G11 G7 G12 G13 G14 G2 G1 G4 G3 O8 O4 O5 O7 O9 O6 O2 O3 O1 O1: G1, G2, G3 O2: G4 O3: G6 … ECG1, ECG2 BSG1 ECG3, BSG2 …

Functional network prediction from diverse microbial data bacterial expression experiments 876 raw datasets 310 postprocessed datasets 304 normalized coexpression networks in 27 species Integrated functional interaction networks in 15 species 307 bacterial interaction experiments raw interactions postprocessed interactions E. Coli Integration ← Precision ↑, Recall ↓

Functional maps for functional metagenomics 18 GOS Hypersaline Lagoon, Ecuador KEGG Pathways Organisms Pathogens Env. Mapping genes into pathways Mapping pathways into organisms + Integrated functional interaction networks in 27 species Mapping organisms into phyla =

Functional maps for cross-species knowledge transfer 19 ← Precision ↑, Recall ↓ Following up with unsupervised and partially anchored network alignment

E. Coli Integration Functional network prediction from diverse microbial data bacterial expression experiments 876 raw datasets 310 postprocessed datasets 304 normalized coexpression networks in 27 species Integrated functional interaction networks in 15 species 307 bacterial interaction experiments raw interactions postprocessed interactions

Functional Maps: Focused Data Summarization 21 ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next?

Functional Maps: Focused Data Summarization 22 ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? Functional mapping Very large collections of genomic data Specific predicted molecular interactions Pathway, process, or disease associations Underlying experimental results and functional activities in data

Functional Mapping: Scoring Functional Associations 23 How can we formalize these relationships? Any sets of genes G 1 and G 2 in a network can be compared using four measures: Edges between their genes Edges within each set The background edges incident to each set The baseline of all edges in the network Stronger connections between the sets increase association. Stronger within self-connections or nonspecific background connections decrease association.

Functional Mapping: Bootstrap p-values Scoring functional associations is great… …how do you interpret an association score? –For gene sets of arbitrary sizes? –In arbitrary graphs? –Each with its own bizarre distribution of edges? 24 Empirically! # Genes Histograms of FAs for random sets For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Null distribution is approximately normal with mean 1. Standard deviation is asymptotic in the sizes of both gene sets. Maps FA scores to p-values for any gene sets and underlying graph. Null distribution σ s for one graph

Microbial Communities and Functional Metagenomics Metagenomics: data analysis from environmental samples –Microflora: environment includes us! Pathogen collections of “single” organisms form similar communities Another data integration problem –Must include datasets from multiple organisms What questions can we answer? –What pathways/processes are present/over/under- enriched in a newly sequences microbe/community? –What’s shared within community X? What’s different? What’s unique? –How do human microflora interact with diabetes, obesity, oral health, antibiotics, aging, … –Current functional methods annotate ~50% of synthetic data, <5% of environmental data 25 With Jacques Izard, Wendy Garrett

Data Integration for Microbial Communities 26 ~350 available expression datasets ~25 species Weskamp et al 2004 Flannick et al 2006 Kanehisa et al 2008 Tatusov et al 1997 Data integration works just as well in microbes as it does in yeast and humans We know an awful lot about some microorganisms and almost nothing about others Sequence-based and network-based tools for function transfer both work in isolation We can use data integration to leverage both and mine out additional biology

Functional Maps for Functional Metagenomics 27

Validating Orthology-Based Functional Mapping 28 Does unweighted data integration predict functional relationships? What is the effect of “projecting” through an orthologous space? Recall log(Precision/Random) KEGG GO Recall log(Precision/Random) Recall log(Precision/Random) GO Unsupervised integration Individual datasets Recall log(Precision/Random) Individual datasets KEGG Unsupervised integration

Validating Orthology-Based Functional Mapping 29 YG17 YG16YG15 YG10 YG6 YG9 YG8 YG5 YG11 YG7 YG12 YG13 YG14 YG2 YG1 YG4 YG3 Holdout set, uncharacterized “genome” Random subsets, characterized “genomes”

Validating Orthology-Based Functional Mapping 30

KEGG GO Validating Orthology-Based Functional Mapping 31 Can subsets of the yeast genome predict a heldout subset’s functional maps? Can subsets of the yeast genome predict a heldout subset’s interactome? What have we learned? Yeast is incredibly well-curated KEGG tends to be more specific than GO Predicting interactomes by projecting through functional maps works decently in the absolute best case