Network integration and function prediction: Putting it all together Curtis Huttenhower 04-13-11 Harvard School of Public Health Department of Biostatistics.

Slides:



Advertisements
Similar presentations
Periodic clusters. Non periodic clusters That was only the beginning…
Advertisements

Network integration and function prediction: Putting it all together Slides courtesy of Curtis Huttenhower Harvard School of Public Health Department.
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Integrating Cross-Platform Microarray Data by Second-order Analysis: Functional Annotation and Network Reconstruction Ming-Chih Kao, PhD University of.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Global Mapping of the Yeast Genetic Interaction Network Tong et. al, Science, Feb 2004 Presented by Bowen Cui.
. Inferring Subnetworks from Perturbed Expression Profiles D. Pe’er A. Regev G. Elidan N. Friedman.
Gene Ontology John Pinney
Open Day 2006 From Expression, Through Annotation, to Function Ohad Manor & Tali Goren.
Work Process Using Enrich Load biological data Check enrichment of crossed data sets Extract statistically significant results Multiple hypothesis correction.
Learning rule-based models from gene expression time profiles annotated with Gene Ontology terms Jan Komorowski and Astrid Lägreid.
Supervised and unsupervised methods for large scale genomic data integration Curtis Huttenhower Harvard School of Public Health Department of.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower
Gene Co-expression Network Analysis BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.
Yeast Dataset Analysis Hongli Li Final Project Computer Science Department UMASS Lowell.
Scalable data mining for functional genomics and metagenomics
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Scalable data mining for functional genomics and metagenomics Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Ohnologs and Regulatory Networks Robbie Sedgewick Group Meeting March 2, 2006.
Computational Approaches in Epigenomics Guo-Cheng Yuan Department of Biostatistics and Computational Biology Dana-Farber Cancer Institute Harvard School.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Large scale genomic data mining Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Unit 1: The Language of Science  communicate and apply scientific information extracted from various sources (3.B)  evaluate models according to their.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
MATISSE - Modular Analysis for Topology of Interactions and Similarity SEts Igor Ulitsky and Ron Shamir Identification.
Answering biological questions using large genomic data collections Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Analyzing transcription modules in the pathogenic yeast Candida albicans Elik Chapnik Yoav Amiram Supervisor: Dr. Naama Barkai.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Kristen Horstmann, Tessa Morris, and Lucia Ramirez Loyola Marymount University March 24, 2015 BIOL398-04: Biomathematical Modeling Lee, T. I., Rinaldi,
Microarrays to Functional Genomics: Generation of Transcriptional Networks from Microarray experiments Joshua Stender December 3, 2002 Department of Biochemistry.
Large scale genomic data mining Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Clustering of protein networks: Graph theory and terminology Scale-free architecture Modularity Robustness Reading: Barabasi and Oltvai 2004, Milo et al.
HUMAN-MOUSE CONSERVED COEXPRESSION NETWORKS PREDICT CANDIDATE DISEASE GENES Ala U., Piro R., Grassi E., Damasco C., Silengo L., Brunner H., Provero P.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Large scale genomic data integration for functional genomics and metagenomics Curtis Huttenhower Harvard School of Public Health Department of.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
Large scale genomic data integration for functional metagenomics Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Gene Expression and Networks. 2 Microarray Analysis Supervised Methods -Analysis of variance -Discriminate analysis -Support Vector Machine (SVM) Unsupervised.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Central dogma: the story of life RNA DNA Protein.
Introduction to biological molecular networks
Shortest Path Analysis and 2nd-Order Analysis Ming-Chih Kao U of M Medical School
Motif Search and RNA Structure Prediction Lesson 9.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
(c) M Gerstein '06, gerstein.info/talks 1 CS/CBB Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Biological Network Analysis
Network Motifs See some examples of motifs and their functionality Discuss a study that showed how a miRNA also can be integrated into motifs Today’s plan.
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
Network integration and function prediction: Putting it all together
Genomic Data Integration
Large Scale Data Integration
Genomic Data Manipulation
Predicting Gene Expression from Sequence
Presentation transcript:

Network integration and function prediction: Putting it all together Curtis Huttenhower Harvard School of Public Health Department of Biostatistics

Outline Functional network integration –B–Bayes nets and LR –T–The human genome, tissues, and disease Network meta-analysis –P–Pathogens and MTb –Q–Quantifying progress in yeast Networks to pathways –F–Functional mapping: networks of networks –H–Hierarchical integration –P–Pathway prediction Regulatory network integration –N–Network motifs 2

A computational definition of functional genomics 3 Genomic data Prior knowledge Data ↓ Function ↓ Function Gene ↓ Gene ↓ Function

A framework for functional genomics 4 High Similarity Low Similarity High Correlation Low Correlation G1 G2 + G4 G9 + … G3 G6 - G7 G8 - … G2 G5 ? …0.10.2…0.8 +-…--…+ 0.5… …0.6 High Correlation Low Correlation Frequency Coloc.Not coloc. Frequency SimilarDissim. Frequency P(G2-G5|Data) = Ms gene pairs → ← 1Ks datasets + =

MEFIT: A Framework for Functional Genomics 5 Golub 1999 Butte 2000 Whitfield 2002 Hansen 1998 Functional Relationship Biological Context Functional area Tissue Disease …

Functional network prediction and analysis 6 Global interaction network Metabolism networkSignaling networkGut community network Currently includes data from 30,000 human experimental results, 15,000 expression conditions + 15,000 diverse others, analyzed for 200 biological functions and 150 diseases HEFalMp

HEFalMp: Predicting human gene function 7 HEFalMp

HEFalMp: Predicting human genetic interactions 8 HEFalMp

HEFalMp: Analyzing human genomic data 9 HEFalMp

HEFalMp: Understanding human disease 10 HEFalMp

Validating Human Predictions 11 Autophagy Luciferase (Negative control) ATG5 (Positive control) LAMP2RAB11A Not Starved (Autophagic) Predicted novel autophagy proteins 5½ of 7 predictions currently confirmed With Erin Haley, Hilary Coller

Outline Functional network integration –Bayes nets and LR –The human genome, tissues, and disease Network meta-analysis –Pathogens and MTb –Quantifying progress in yeast Networks to pathways –Functional mapping: networks of networks –Hierarchical integration –Pathway prediction Regulatory network integration –Network motifs 12

Meta-analysis for unsupervised functional data integration 13 Evangelou 2007 Huttenhower 2006 Hibbs 2007 Simple regression: All datasets are equally accurate Random effects: Variation within and among datasets and interactions

Meta-analysis for unsupervised functional data integration 14 Evangelou 2007 Huttenhower 2006 Hibbs =

Unsupervised data integration: TB virulence and ESX-1 secretion 15 With Sarah Fortune Graphle

Unsupervised data integration: TB virulence and ESX-1 secretion 16 With Sarah Fortune Graphle X ?

Predicting gene function 17 Cell cycle genes Predicted relationships between genes High Confidence Low Confidence

Predicting gene function 18 Predicted relationships between genes High Confidence Low Confidence Cell cycle genes

Predicting gene function 19 Predicted relationships between genes High Confidence Low Confidence These edges provide a measure of how likely a gene is to specifically participate in the process of interest.

Comprehensive validation of computational predictions 20 Genomic data Computational Predictions of Gene Function MEFIT SPELL Hibbs et al 2007 bioPIXIE Myers et al 2005 Genes predicted to function in mitochondrion organization and biogenesis Laboratory Experiments Petite frequency Growth curves Confocal microscopy New known functions for correctly predicted genes Retraining With David Hess, Amy Caudy Prior knowledge

Evaluating the performance of computational predictions Original GO Annotations Genes involved in mitochondrion organization and biogenesis 135 Under-annotations 82 Novel Confirmations, First Iteration 17 Novel Confirmations, Second Iteration 340 total: >3x previously known genes in ~5 person-months

Evaluating the performance of computational predictions Original GO Annotations Genes involved in mitochondrion organization and biogenesis 95 Under-annotations 40 Confirmed Under-annotations 80 Novel Confirmations First Iteration 17 Novel Confirmations Second Iteration 340 total: >3x previously known genes in ~5 person-months Computational predictions from large collections of genomic data can be accurate despite incomplete or misleading gold standards, and they continue to improve as additional data are incorporated.

Outline Functional network integration –Bayes nets and LR –The human genome, tissues, and disease Network meta-analysis –Pathogens and MTb –Quantifying progress in yeast Networks to pathways –Functional mapping: networks of networks –Hierarchical integration –Pathway prediction Regulatory network integration –Network motifs 23

Functional mapping: mining integrated networks 24 Predicted relationships between genes High Confidence Low Confidence The strength of these relationships indicates how cohesive a process is. Chemotaxis

Functional mapping: mining integrated networks 25 Predicted relationships between genes High Confidence Low Confidence Chemotaxis

Functional mapping: mining integrated networks 26 Flagellar assembly The strength of these relationships indicates how associated two processes are. Predicted relationships between genes High Confidence Low Confidence Chemotaxis

Functional mapping: Associations among processes 27 Edges Associations between processes Very Strong Moderately Strong Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

Functional mapping: Associations among processes 28 Edges Associations between processes Very Strong Moderately Strong Borders Data coverage of processes Well Covered Sparsely Covered Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

Functional mapping: Associations among processes 29 Edges Associations between processes Very Strong Moderately Strong Nodes Cohesiveness of processes Below Baseline (genomic background) Very Cohesive Borders Data coverage of processes Well Covered Sparsely Covered Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

Functional mapping: Associations among processes 30 Edges Associations between processes Very Strong Moderately Strong Nodes Cohesiveness of processes Below Baseline (genomic background) Very Cohesive Borders Data coverage of processes Well Covered Sparsely Covered

Gene expression Physical PPIs Genetic interactions Colocalization Sequence Protein domains Regulatory binding sites … ? How do functional interactions become pathways? 31 + =

Functional genomic data 32 With Chris Park, Olga Troyanskaya Simultaneous inference of physical, genetic, regulatory, and functional networks Functional interactions Regulatory interactions Post-transcriptional regulation Metabolic interactions Phosphorylation Protein complexes

Learning a compendium of interaction networks 33 Train one SVM per interaction type Resolve consistency using hierarchical Bayes net

Learning a compendium of interaction networks 34 AUC Both presence/absence and directionality of interactions are accurately inferred

Using network compendia to predict complete pathways 35 Additional 20 novel synthetic lethality predictions tested, 14 confirmed (>100x better than random) Confirmed Unconfirmed With David Hess

Interactive aligned network viewer – Graphle

Outline Functional network integration –Bayes nets and LR –The human genome, tissues, and disease Network meta-analysis –Pathogens and MTb –Quantifying progress in yeast Networks to pathways –Functional mapping: networks of networks –Hierarchical integration –Pathway prediction Regulatory network integration –Network motifs 37

Of only five regulators found, four have generic cell cycle/proliferation targets Just five basic regulators for ~7,000 genes? These motifs only appear upstream of ~half of the genes Human Regulatory Networks 38 G0 I III IV V VI VII IX VIII II X 6,829 genes Serum re-stimulated (hrs)Serum starved (hrs) 1 5<< Development Cholesterol Protein localization Cell cycle RNA processing Metabolism FIRE: Elemento et al Elk-1 Sp1 NF-Y YY1 Quiescence: reversible exit from the cell cycle

COALESCE: Combinatorial Algorithm for Expression and Sequence-based Cluster Extraction 39 Gene ExpressionDNA Sequence 5’ UTR 3’ UTR Upstream flankDownstream flank Evolutionary Conservation Nucleosome Positions Identify conditions where genes coexpress Identify motifs enriched in genes’ sequences Create a new module Select genes based on conditions and motifs Subtract mean from all data Regulatory modules Coregulated genes Conditions where they’re coregulated Putative regulating motifs Feature selection: Tests for differential expression/frequency Bayesian integration

COALESCE: Selecting Coexpressed Conditions For each gene expression condition… –Compare distributions of values for Genes in the module versus Genes not in the module –If significantly different, include the condition 40 Preserving data structure: If multiple conditions derive from the same dataset, can be included/excluded as a unit For example, time course vs. deletion collection Test using multivariate z-test Precalculate covariance matrix; still very efficient

COALESCE: Selecting Significant Motifs Coalesce looks for three kinds of motifs: –K-mers –Reverse complement pairs –Probabilistic Suffix Trees (PSTs) For every possible motif… –Compare distributions of values for Genes in the module versus Genes not in the module –If significantly different, include the motif 41 ACGACGT ACGACAT | ATGTCGT A TC G T TG CA This can distinguish flanks from UTRs Fast! Efficient enough to search coding sequence (e.g. exons/introns)

COALESCE: Selecting Probable Genes For each gene in the genome… 42 For each significant condition…For each significant motif… What’s the probability the gene came from the module’s distribution? What’s the probability that it came from outside the module? Distributions of each feature in and out of the developing module are observed from the data. Prior is used to stabilize module convergence; genes already in the module are more likely to stay there next iteration. The probability of a gene being in the module given some data…

COALESCE: Integrating Additional Data Types 43 Nucleosome placement Evolutionary conservation Can be included as additional datasets and feature selected just like expression conditions/motifs. Or can be used as a prior or weight on the values of individual motifs. NC G G G ……… TCCGGTAGAACTACTGGTATTGTTTTGGATTCCGGTGATG

COALESCE Results: S. cerevisiae Modules 44 ~2,200 conditions ~6,000 genes The haystack A needle 100 genes 80 conditions

COALESCE Results: S. cerevisiae Modules genes, 144 conditions Conjugation 33 genes, 434 conditions Budding 112 genes, 82 conditions Mitosis and DNA replication Swi5 Stb1/Swi6 Ste

COALESCE Results: S. cerevisiae Modules genes, 775 conditions Iron transport 11 genes, 844 conditions Phosphate transport 126 genes, 660 conditions Glycolysis, iron and phosphate transport, amino acid metabolism… Pho4 Helix-Loop-Helix Tye7/Cbf1/Pho4 Aft1/2

COALESCE Results: S. cerevisiae Modules genes, 319 conditions Mitochondrial translation Puf3 822 …plus more ribosome clusters than you can shake a stick at!

COALESCE Results: Yeast TF/Target Accuracy 48

COALESCE Results: TF/Targets Influenced by Supporting Data 49 Improved by any addl. data, mainly conservation Decreased by addl. dataImproved by conservation Improved only by both

COALESCE Results: Yeast Clustering Accuracy ~2,200 yeast conditions –Recapitulation of known biology from Gene Ontology 50

COALESCE Results: Yeast Clustering Accuracy ~2,200 yeast conditions –Recapitulation of known biology from Gene Ontology 51 ASCL1 in 5’ flank, unch. sequences underenriched in 3’ UTR M. musculus: Up in callosal and motor neurons C. elegans: Up in larvae, down in adults GATA in 5’ flank, miR-788 seed in 3’ UTR AAGGGGC (zf?) and enriched in 5’ flank H. sapiens: Up in normal muscle, down in diabetic

COALESCE: Coregulated Quiescence Modules Predicts regulatory modules from genomic data: –Coregulated genes –Conditions under which coregulation occurs –Putative regulatory motifs 5 quiescence-related microarray datasets, 60 conditions –Quiescence program(Coller et al. 2006) –Adenoviral infection(Miller et al. 2007) –let-7 response(Legesse-Miller et al. unpub.) –Contact inhibition(Scarino et al. unpub.) –Serum withdrawal(Legesse-Miller et al. unpub.) 52

COALESCE: Coregulated Quiescence Modules 53 Down during quiescence entry, up during quiescence exit, down with adenoviral infection Specific predicted uncharacterized reverse complement motif Up during quiescence entry, down during quiescence exit Many known related (proliferation) motifs: Pax4, Staf, NFKB1, Gfi, ESR1, Runx1, Su(H) Down during quiescence entry, enriched for transport/trafficking miR-297 motif predicted in 3’ UTR (CACATAC) Down with let-7 exposure let-7 motifs predicted in 3’ UTR (UACCUC)

Network Motifs 54 Coherent feed-forward filter Incoherent feed-forward pulse Bi-fan Positive auto-regulation delay WGD and evolvability Negative auto-regulation speed + stability Feedback memory

March 1, From Milo, et al., Science, 2002

Outline Functional network integration –Bayes nets and LR –The human genome, tissues, and disease Network meta-analysis –Pathogens and MTb –Quantifying progress in yeast Networks to pathways –Functional mapping: networks of networks –Hierarchical integration –Pathway prediction Regulatory network integration –Network motifs 56

1:1 Lewis Carroll Map “… And then came the grandest idea of all! We actually made a map of the country, on the scale of a mile to the mile!" "Have you used it much?" I enquired. "It has never been spread out, yet," said Mein Herr: "the farmers objected: they said it would cover the whole country, and shut out the sunlight! So we now use the country itself, as its own map, and I assure you it does nearly as well. Sylvie and Bruno Concluded by Lewis Carroll, March 1,