Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower 08-12-10 Harvard School of Public Health Department of Biostatistics.

Slides:



Advertisements
Similar presentations
Microarray statistical validation and functional annotation
Advertisements

Network integration and function prediction: Putting it all together Slides courtesy of Curtis Huttenhower Harvard School of Public Health Department.
Journal Club Jenny Gu October 24, Introduction Defining the subset of Superfamilies in LUCA Examine adaptability and expansion of particular superfamilies.
Network integration and function prediction: Putting it all together Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Integrating Cross-Platform Microarray Data by Second-order Analysis: Functional Annotation and Network Reconstruction Ming-Chih Kao, PhD University of.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
. Inferring Subnetworks from Perturbed Expression Profiles D. Pe’er A. Regev G. Elidan N. Friedman.
Gene Ontology John Pinney
Open Day 2006 From Expression, Through Annotation, to Function Ohad Manor & Tali Goren.
Learning rule-based models from gene expression time profiles annotated with Gene Ontology terms Jan Komorowski and Astrid Lägreid.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Supervised and unsupervised methods for large scale genomic data integration Curtis Huttenhower Harvard School of Public Health Department of.
Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis Jonsson.
Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Gene Co-expression Network Analysis BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.
Gene Expression Data Analyses (3)
Scalable data mining for functional genomics and metagenomics
Scalable data mining for functional genomics and metagenomics Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Modularity in Biological networks.  Hypothesis: Biological function are carried by discrete functional modules.  Hartwell, L.-H., Hopfield, J. J., Leibler,
ONCOMINE: A Bioinformatics Infrastructure for Cancer Genomics
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Computational metagenomics and the human microbiome Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Large scale genomic data mining Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.
Metagenomic Analysis Using MEGAN4
MATISSE - Modular Analysis for Topology of Interactions and Similarity SEts Igor Ulitsky and Ron Shamir Identification.
Answering biological questions using large genomic data collections Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Cis-regulation Trans-regulation 5 Objective: pathway reconstruction.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
GTL Facilities Computing Infrastructure for 21 st Century Systems Biology Ed Uberbacher ORNL & Mike Colvin LLNL.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.
Networks and Interactions Boo Virk v1.0.
Large scale genomic data mining Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Apostolos Zaravinos and Constantinos C Deltas Molecular Medicine Research Center and Laboratory of Molecular and Medical Genetics, Department of Biological.
Metabolomics Metabolome Reflects the State of the Cell, Organ or Organism Change in the metabolome is a direct consequence of protein activity changes.
Large scale genomic data integration for functional genomics and metagenomics Curtis Huttenhower Harvard School of Public Health Department of.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
Large scale genomic data integration for functional metagenomics Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
1 Machine Learning for Functional Genomics I Matt Hibbs
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
DNAmRNAProtein Small molecules Environment Regulatory RNA How a cell is wired The dynamics of such interactions emerge as cellular processes and functions.
Cluster validation Integration ICES Bioinformatics.
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
Shortest Path Analysis and 2nd-Order Analysis Ming-Chih Kao U of M Medical School
Microarray Data Analysis The Bioinformatics side of the bench.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
High throughput biology data management and data intensive computing drivers George Michaels.
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
David Amar, Tom Hait, and Ron Shamir
Network integration and function prediction: Putting it all together
Genomic Data Integration
Research in Computational Molecular Biology , Vol (2008)
Large Scale Data Integration
Taxonomic profiling with MetaPhlAn2
Genomic Data Manipulation
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower Harvard School of Public Health Department of Biostatistics

Greatest biological discoveries? 2 Our job is to create computational microscopes: To ask and answer specific biological questions using millions of experimental results Our job is to create computational microscopes: To ask and answer specific biological questions using millions of experimental results

A computational definition of functional genomics 3 Genomic data Prior knowledge Data ↓ Function ↓ Function Gene ↓ Gene ↓ Function

A framework for functional genomics 4 High Similarity Low Similarity High Correlation Low Correlation G1 G2 + G4 G9 + … G3 G6 - G7 G8 - … G2 G5 ? …0.10.2…0.8 +-…--…+ 0.5… …0.6 High Correlation Low Correlation Frequency Let.Not let. Frequency SimilarDissim. Frequency P(G2-G5|Data) = Ms gene pairs → ← 1Ks datasets + =

Functional network prediction and analysis 5 Global interaction network Carbon metabolism networkExtracellular signaling networkGut community network Currently includes data from 30,000 human experimental results, 15,000 expression conditions + 15,000 diverse others, analyzed for 200 biological functions and 150 diseases HEFalMp

Functional network prediction from diverse microbial data bacterial expression experiments 876 raw datasets 310 postprocessed datasets 304 normalized coexpression networks in 27 species Integrated functional interaction networks in 15 species 307 bacterial interaction experiments raw interactions postprocessed interactions E. Coli Integration ← Precision ↑, Recall ↓

Cross-species knowledge transfer using functional data 7 Pinaki Sarder TaFTan

TaFTan: Cross-species knowledge transfer using functional data 8 E. coli B. subtilis P. aeruginosa M. tuberculosis Species-specific data Species’ data excluded All species’ data log(precision/random) log(recall) Important to take advantage of all available data for any one organism Important to take advantage of all available data for every organism Scalable to dozens of organisms with hundreds of functional datasets Currently working on making this more context-specific

Meta-analysis for unsupervised functional data integration 9 Evangelou 2007 Huttenhower 2006 Hibbs 2007 Simple regression: All datasets are equally accurate Random effects: Variation within and among datasets and interactions

Meta-analysis for unsupervised functional data integration 10 Evangelou 2007 Huttenhower 2006 Hibbs =

~ AML/ALL Temperature DNA damage Gene expression Batch effects Functional modules So what does all of this have to do with microbial communities ?

Healthy/IBD Temperature Location Taxa & Orthologs ??? Niches & Phylogeny Test for correlates Multiple hypothesis correction Feature selection p >> n Confounds/ stratification/ environment Cross- validate Biological story? Independent sample Intervention/ perturbation

What features to test? 13 16S reads WGS reads Taxa Orthologous clusters Pathways/ modules Functional roles Pathway activity Genomic data (Reference genomes) Functional data (Experimental models) Binning Clustering Microbiome data

MetaHIT: Data  features 14 WGS reads Pathways/ modules KO clusters KEGG pathways 85 healthy,15 IBD + 12 healthy,12 IBD ReBLASTed against KEGG since published data obfuscates read counts 10x bootstrap within training cohort, test on as validation Taxa Phymm Brady 2009

MetaHIT: Taxonomic CD biomarkers 15 Bacteroidetes Firmicutes Methanomicrobia Enterobacteriaceae Chromatiales Desulfobacterales OxalobacteraceaeRhodobacteraceae Bradyrhizobiaceae iTOL Letunic 2007

MetaHIT: Taxonomic CD biomarkers 16 Down in CD Up in CD

MetaHIT: Functional CD biomarkers 17 Growth/replication Motility Transporters Sugar metabolism Down in CD Up in CD

MetaHIT: KO IBD biomarkers 18 Transporters Growth/ replication Motility Sugar metabolism Down in IBD Up in IBD LEfSe Nicola Segata

t-tests, ANOVA, MANOVA, Friedman, Kruskal–Wallis… Metagenomic differential analysis: LEfSe 1. Is there a statistically significant difference? 2. Is the difference biologically significant? 3. How large is the difference? PCA, LDA, mean difference, class or cluster distance… expert supervision, specific post-hoc tests… p(ANOVA) < 0.05 pairwise post-hoc Wilcoxon OK Log(Score(LDA)) = 3.68 LEfSe: 19

LEfSe: A non-human example Viromes vs. bacterial metagenomes 20 Metastats (White 2009) :p < ANOVA:p < 0.05 LEfSE: DIFF! Hi-level functional category: Carbohydrates Hi-level functional category: Transporters Hi-level functional category: Nucleosides and Nucleotides LEfSE: NO DIFF! MicrobialViral Dinsdale 2008

Sleipnir C++ library for computational functional genomics Data types for biological entities Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. Network communication, parallelization Efficient machine learning algorithms Generative (Bayesian) and discriminative (SVM) And it’s fully documented! Sleipnir: Software for scalable functional genomics Massive datasets require efficient algorithms and implementations. 21 It’s also speedy: microbial data integration computation takes <3hrs.

Recap 22 TaFTanMeta-analytic integration LEfSe Unsupervised system for data mining without curated prior knowledge Comparative microbiome analysis by taxa, orthologs, and pathways Sleipnir software for scalable functional genomics Network framework for scalable data integration Cross-species knowledge transfer from functional data

Thanks! 23 Jacques Izard Wendy Garrett Sarah Fortune Pinaki SarderNicola Segata Levi WaldronLarisa Miropolsky Willythssa Pierre-Louis

Predicting Gene Function 25 Cell cycle genes Predicted relationships between genes High Confidence Low Confidence

Predicting Gene Function 26 Predicted relationships between genes High Confidence Low Confidence Cell cycle genes

Predicting Gene Function 27 Predicted relationships between genes High Confidence Low Confidence These edges provide a measure of how likely a gene is to specifically participate in the process of interest.

Comprehensive Validation of Computational Predictions 28 Genomic data Computational Predictions of Gene Function MEFIT SPELL Hibbs et al 2007 bioPIXIE Myers et al 2005 Genes predicted to function in mitochondrion organization and biogenesis Laboratory Experiments Petite frequency Growth curves Confocal microscopy New known functions for correctly predicted genes Retraining With David Hess, Amy Caudy Prior knowledge

Evaluating the Performance of Computational Predictions Original GO Annotations Genes involved in mitochondrion organization and biogenesis 135 Under-annotations 82 Novel Confirmations, First Iteration 17 Novel Confirmations, Second Iteration 340 total: >3x previously known genes in ~5 person-months

Evaluating the Performance of Computational Predictions Original GO Annotations Genes involved in mitochondrion organization and biogenesis 95 Under-annotations 40 Confirmed Under-annotations 80 Novel Confirmations First Iteration 17 Novel Confirmations Second Iteration 340 total: >3x previously known genes in ~5 person-months Computational predictions from large collections of genomic data can be accurate despite incomplete or misleading gold standards, and they continue to improve as additional data are incorporated.

Validating Human Predictions 31 Autophagy Luciferase (Negative control) ATG5 (Positive control) LAMP2RAB11A Not Starved (Autophagic) Predicted novel autophagy proteins 5½ of 7 predictions currently confirmed With Erin Haley, Hilary Coller

Functional mapping: mining integrated networks 32 Predicted relationships between genes High Confidence Low Confidence The strength of these relationships indicates how cohesive a process is. Chemotaxis

Functional mapping: mining integrated networks 33 Predicted relationships between genes High Confidence Low Confidence Chemotaxis

Functional mapping: mining integrated networks 34 Flagellar assembly The strength of these relationships indicates how associated two processes are. Predicted relationships between genes High Confidence Low Confidence Chemotaxis

Functional Mapping: Scoring Functional Associations 35 How can we formalize these relationships? Any sets of genes G 1 and G 2 in a network can be compared using four measures: Edges between their genes Edges within each set The background edges incident to each set The baseline of all edges in the network Stronger connections between the sets increase association. Stronger within self-connections or nonspecific background connections decrease association.

Functional Mapping: Bootstrap p-values Scoring functional associations is great… …how do you interpret an association score? –For gene sets of arbitrary sizes? –In arbitrary graphs? –Each with its own bizarre distribution of edges? 36 Empirically! # Genes Histograms of FAs for random sets For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Null distribution is approximately normal with mean 1. Standard deviation is asymptotic in the sizes of both gene sets. Maps FA scores to p-values for any gene sets and underlying graph. Null distribution σ s for one graph

Functional Mapping: Functional Associations Between Processes 37 Edges Associations between processes Very Strong Moderately Strong Nodes Cohesiveness of processes Below Baseline (genomic background) Very Cohesive Borders Data coverage of processes Well Covered Sparsely Covered Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

Functional Mapping: Functional Associations Between Processes 38 Edges Associations between processes Very Strong Moderately Strong Nodes Cohesiveness of processes Below Baseline (genomic background) Very Cohesive Borders Data coverage of processes Well Covered Sparsely Covered

Functional maps for cross-species knowledge transfer 39 G17 G16 G15 G10 G6 G9 G8 G5 G11 G7 G12 G13 G14 G2 G1 G4 G3 O8 O4 O5 O7 O9 O6 O2 O3 O1 O1: G1, G2, G3 O2: G4 O3: G6 … ECG1, ECG2 BSG1 ECG3, BSG2 …

Functional maps for functional metagenomics 40 GOS Hypersaline Lagoon, Ecuador KEGG Pathways Organisms Pathogens Env. Mapping genes into pathways Mapping pathways into organisms + Integrated functional interaction networks in 27 species Mapping organisms into phyla =

Functional Maps: Focused Data Summarization 41 ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next?

Functional Maps: Focused Data Summarization 42 ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? Functional mapping Very large collections of genomic data Specific predicted molecular interactions Pathway, process, or disease associations Underlying experimental results and functional activities in data

Functional maps for cross-species knowledge transfer 43 ← Precision ↑, Recall ↓ Following up with unsupervised and partially anchored network alignment

LEfSe: A non-human example Viromes vs. bacterial metagenomes 44 Metastats (White 2009) :p < ANOVA:p < 0.05 LEfSE: DIFF! Hi-level functional category: Carbohydrates Hi-level functional category: Membrane Transport Hi-level functional category: Nitrogen Metabolism Hi-level functional category: Nucleosides and Nucleotides LEfSE: NO DIFF! MicrobialViral