1 Joint analysis of regulatory networks and expression profiles Ron Shamir School of Computer Science Tel Aviv University April Sources: Igor Ulitsky and Ron Shamir. Identification of Functional Modules using Network Topology and High-Throughput Data. BMC Systems Biology 1:8 (2007). Igor Ulitsky and Ron Shamir. Identifying functional modules using expression profiles and confidence-scored protein interactions. Bioinformatics Vol. 25 no (2009).
Outline Background Joint network and expression profiles –Matisse –Cezanne 2
Background 3
DNA RNA protein transcription translation The hard disk One program Its output 4
DNA Microarrays / RNA-seq Simultaneous measurement of expression levels of all genes / transcripts. Perform measurements in one experiment Allow global view of cellular processes. The most important biotechnological breakthroughs of the last /current decade 5
The Raw Data genes experiments Entries of the Raw Data matrix: expression levels. Ratios/absolute values/… expression pattern for each gene Profile for each experiment /condition/sample/chip Needs normalization! 6
7 EXPression ANalyzer and DisplayER Clustering Identify clusters of co-expressed genes CLICK, KMeans, SOM, hierarchical A. Maron, R. Sharan Bioinformatics 03 Function. enrichment GO, TANGO Visualization Promoter analysis Analyze TF binding sites of co- regulated genes PRIMA Biclustering Identify homogeneous submatrices SAMBA A. Maron-Katz, A. Tanay, C. Linhart, I. Steinfeld, R. Sharan, Y. Shiloh, R. Elkon BMC Bioinformatics 05 microRNA function inference: FAME Ulitsky et al. Nature Protocols 10
Networks of Protein-protein interactions (PPIs) Large, readily available resource Representation: Network with nodes=proteins/genes edges=interactions 8 Analysis methods: Global properties Motif content analysis Complex extraction Cross-species comparison
The hairball syndrome 9
Potential inroad into pathways and function Can the network help to improve the analysis? 10
Analysis of gene expression profiles + a network 11
12 Goal Challenge: Detect active functional modules: connected subnetwork of proteins whose genes are co-expressed “Where is the action in the network in a particular experiment?”
Ron Shamir, RNA Antalia, April 08 13
14
15 Ulitsky & Shamir BMC Systems Biology 07
Input: Expression data and a PPI network Output: a collection of modules –Connected PPI subnetworks –Correlated expression profiles Interaction High expression similarity 16 Modular Analysis for Topology of Interactions and Similarity SEts
Probabilistic model Event M ij : i,j are mates = highly co-expressed P(S ij |M ij ) ~ N( m, 2 m ) P(S ij | M ij ) ~ N( n, 2 n ) H 0 : U is a set of unrelated genes H 1 : U is a module = connected subnetwork with high internal similarity R i : gene i transcriptionally regulated m : fraction of mates out of module gene pairs that are transcriptionally regulated m = P(M ij | R i R j, H 1 ) p m : fraction of mates out of all gene pairs that are transcriptionally regulated 17
Probabilistic model (2) Is connected gene set U a module? Assuming pair indep: Define m ij = m P(R i )P(R j ) Define n ij = p m P(R i )P(R j ). Likelihood ratio Pr(Data|H 1 )/Pr Data|H 0 ) Taking log: sum of terms ij: 18
Probabilistic model - summary Similarities: mixture of two Gaussians For a candidate group U, the likelihood ratio of originating from a module or from the background is Module score = Gene group likelihood ratio = sum over all the gene pairs Find connected subgraphs U with high W U 19
Complexity Finding heaviest connected subgraph: NP hard even without connectivity constraints (+/- edge weights) Devised a heuristic for the problem 20
MATISSE workflow Seed generation Greedy optimization Significance filtering
Finding seeds Three seeding alternatives tested All alternatives build a seed and delete it from the network Building small seeds around single nodes: Best neighbors All neighbors Approximating the heaviest subgraph Delete low-degree nodes and record the heaviest subnetwork found
Greedy optimization Simultaneous optimization of all the seeds The following steps are considered: Node addition Node removal Assignment change Module merge
Front vs. Back nodes Only a fraction of the genes (front nodes) have meaningful similarity values MATISSE can link them using other genes (back nodes). Back nodes correspond to: –Unmeasured transcripts –Post-translational regulation –Partially regulated pathways 24
Advantages of MATISSE No p-vals needed for measurements Works when a fraction of the genes expression patterns are informative Can handle any similarity data No prespecified number of modules 25
Test case: Yeast osmotic shock Network: 65,990 PPIs & protein-DNA interactions among 6,246 genes Expression: 133 experimental conditions – response of perturbed strains to osmotic shock (O’Rourke & Herskowitz 04) Front nodes: 2,000 genes with the highest variance 26
Pheromone response subnetwork Back Front 27
Performance comparison % of modules with category enrichment at p< % annotations enriched at p<10 -3 in modules 28
GO and promoter analysis (c) 29
Application to stem cells ~150 human stem cell lines of diverse types profiled using microarrays Clustered profiles into groups Adjusted Matisse to seek subnetworks that characteristic to each group Focused analysis on pluripotent stem cells F. Müller, L. Laurent, D. Kostka, I. Ulitsky, R. Williams, C. Lu, I. Park, M. Rao, P. Schwartz, N. Schmidt, J. Loring Nature 08 30
Pluripotent stem cells network Highlights the key protein machinery underlying pluripotency 31
Ulitsky & Shamir Bioinformatics
Accounting for PPI confidence PPI-based analysis is made difficult by abundant false positive / negative interactions Various methods can assign confidence (probability) to individual edges Idea: seek modules that are connected with high probability Ulitsky & Shamir Bioinformatics,
What is a confidently connected module? With high probability, any two parts of the module are connected by an edge ▫Accommodates both sparse and dense pathways ▫Accommodates genes with low-confidence connectivity with many module genes Confidently-connected modules can be found efficiently 34
Connected with high probability? Every two genes are connected by a confident path ▫Bias to dense pathways There is a minimum spanning tree with high-confidence edges ▫Same as ignoring low-confidence edges An edge connects any two parts of the module are connected with high probability 35
CEZANNE: (Co-Expression Zone ANalysis using NEtworks) Edge probability p(e) Edge weight –log(1-p(e)) For any W U, ≥1 edge connects W with U\W with probability q (e.g. 0.95) The weight of the minimum cut of U is at least -log(1-q) Algorithm: among the subnets whose minimum cut exceeds -log(1-q) find the one with the maximum co-expression score P({A},{B,C,D})=1-0.3*0.3=0.91 P({A,C,D},{B})=0.94 P({A,B},{C,D})=0.94 P({A,B,D},{C})=0.994 minimum cut A A B B C C D D 36
How to find confidently connected modules? Seed identification: Run MATISSE ignoring edge weights, then “slice” the modules using minimum cut, until all subnetworks are “legal” Greedy optimization (how to find legal moves?): ▫Adding nodes is easy to test (positive edge weights) ▫Merging modules is easy to test ▫(Re)moving modules: requires maintaining the set of ‘crucial’ nodes in each module Solvable in minutes on real world examples 37
DNA damage response in S. cerevisiae 47 DNA Damage Response expression profiles (Gasch et al., 01) Front nodes: 2,074 genes with at least two-fold expression change Network and confidence values: purification enrichment (PE) scores (Collins et al. 07) 38
Module sizeGO biological processp-valueGO-slim protein complexesp-value 346 ribosome biogenesis and assembly1.2· ribosome5.9· translation1.0· eukaryotic 43S preinitiation complex3.8· rRNA processing7.5· small nucleolar ribonucleoprotein complex1.5· S primary transcript processing4.6· DNA-directed RNA polymerase III complex3.1· ribosome assembly4.3· exosome (RNase complex)4.4· ribosomal large subunit biogenesis9.2· DNA-directed RNA polymerase I complex5.7· rRNA modification4.4· Noc complex3.2· protein catabolism1.8· proteasome complex (sensu Eukaryota)5.7· proteolysis9.0· proteasome core complex (sensu Eukaryota)9.4· ubiquitin cycle1.1· histone acetylation3.6· histone acetyltransferase complex2.1· chromatin modification5.9· transcription from RNA polymerase II promoter1.4· translation1.1· ribosome1.4· nuclear mRNA splicing, via spliceosome3.5· spliceosome complex3.5· small nuclear ribonucleoprotein complex2.5· barbed-end actin filament capping4.8·10 -6 F-actin capping protein complex4.8·10 -6 endocytosis1.1·10 -5 cytoskeleton organization and biogenesis2.8· establishment and/or maintenance of chromatin architecture 1.1·10 -5 chromatin remodeling complex4.6· glycogen metabolism3.0·10 -8 protein phosphatase type 1 complex3.3·10 -5 sporulation (sensu Fungi)2.0· translation1.1·10 -7 ribosome4.0· tRNA processing2.5· ribonuclease P complex9.2·10 -8 rRNA processing2.2· trehalose biosynthesis 6.8· alpha,alpha-trehalose-phosphate synthase complex (UDP-forming) 6.8· ubiquitin-dependent protein catabolism5.2· pseudohyphal growth9.8·10 -7 cAMP-dependent protein kinase complex9.6· proteasome assembly3.2·10 -6 protein folding3.9·10 -6 DNA damage response modules Cytoplasmic ribosome biogenesis Proteasome Mitochondrial ribosome – small subunit Mitochondrial ribosome – large subunit Spliceosome Novel actin-localized pathway? Hsp90 PKA Trehalose biosynthesis Ribonuclease P Suggests SWS2 a novel member Novel pathway enriched with actin-localized proteins; Supported in other datasets; Similar deletion phenotypes 39
Comparison with prior work Combined measure of sensitivity (% of annotations enriched) and specificity (% of modules enriched) with p<0.001 Clustering of only expression data Clustering expression & network (Hanisch et al., 2002) Expression similarity + network connectivity Expression similarity + confident network connectivity 40
41
Summary Algorithms using co-expression + networks to detect functionally coherent modules Accommodate both sparse and dense subnetworks Subnetworks linked to osmotic shock and DNA damage A general framework for confident connectivity in PPI networks The next steps: ▫Co-expression is not the only interesting way to utilize GE data ▫Scaling to complex human datasets 42