Presentation is loading. Please wait.

Presentation is loading. Please wait.

Networks of Protein Interactions Network Alignment Antal Novak CS 374 Lecture 6 10/13/2005 Nuke: Scalable and General Pairwise and Multiple Network Alignment.

Similar presentations


Presentation on theme: "Networks of Protein Interactions Network Alignment Antal Novak CS 374 Lecture 6 10/13/2005 Nuke: Scalable and General Pairwise and Multiple Network Alignment."— Presentation transcript:

1 Networks of Protein Interactions Network Alignment Antal Novak CS 374 Lecture 6 10/13/2005 Nuke: Scalable and General Pairwise and Multiple Network Alignment Flannick, Novak, Srinivasan, McAdams, Batzoglou (2005)

2 Recap Network Integration Combine data from multiple sources to obtain robust probabilities of interaction Can be performed in a high-throughput manner “Whatcha gonna do with it?” Network alignment!

3 Sequence alignment seeks to identify conserved DNA or protein sequence Intuition: conservation implies functionality EFTPPVQAAYQKVVAGV (human) DFNPNVQAAFQKVVAGV (pig) EFTPPVQAAYQKVVAGV (rabbit) Motivation

4 By similar intuition, subnetworks conserved across species are likely functional modules Motivation

5 Network Alignment “Conserved” means two subgraphs contain proteins serving similar functions, having similar interaction profiles Key word is similar, not identical mismatch/substitution

6 Earlier approaches: interologs Interactions conserved in orthologs Orthology is a fuzzy notion Sequence similarity not necessary for conservation of function

7 Goal: identify conserved pathways (chains) Idea: can be done efficiently by dynamic programming if networks are DAGs Kelley et al (2003) D D’ + match Earlier approaches: PathBLAST C X’X’ + mismatch B + gap A A’A’ Score: match

8 Problem: Networks are neither acyclic nor directed Solution: eliminate cycles by imposing random ordering on nodes, perform DP; repeat many times In expectation, finds conserved paths of length L within networks of size n in O(L!n) time Drawbacks Computationally expensive Restricts search to specific topology Kelley et al (2003) Earlier approaches: PathBLAST 142 35 214 53 521 34

9 Goal: identify conserved multi-protein complexes (clique-like structures) Idea: such structures will likely contain at least one hub (high-degree node) Koyuturk et al (2004) Earlier approaches: MaWISh

10 Algorithm: start by aligning a pair of homologous hubs, extend greedily Koyuturk et al (2004) Efficient running time, but also only solves a specific case Earlier approaches: MaWISh

11 A General Network Aligner: Goals Solve restrictions of existing approaches Should extend gracefully to multiple alignment PathBLAST was extended to 3-way alignment, but extension scales exponentially in number of species Should not restrict search to specific network topologies (cliques/pathways) Must be efficient in running time

12 A General Network Aligner: Goals Useful application for biologists: given a candidate module, align to a database of networks (“query-to-database”) Query:Database:

13 Earlier approaches aligned pairs of nodes Instead, alignment as an equivalence relation: equivalence class consists of proteins evolved from a common ancestral protein Can contain multiple proteins in same species (paralogs) Handles multiple alignment in an obvious way { paralog A General Network Aligner: Model

14 Example: hypothetical ancestral module descendants equivalence classes A General Network Aligner: Model

15 Probabilistic scoring of alignments: M : Alignment model (network evolved from a common ancestor) R : Random model (nodes and edges picked at random) Nodes and edges scored independently 2.5 4.01.5 3.0 0.8 0.4 -0.4 0.8 1.2 -0.3 0.60.5 0.6 -0.2 A General Network Aligner: Scoring

16 Node scores: simple Weighted Sum-Of-Pairs (SOP) Each equivalence class scored as sum (over pairs n i, n j ) of, where is weight on phylogenetic tree H. pyloriM. tuberculosisC. crescentus 231 E. coli 4 A General Network Aligner: Scoring

17 Alignment model Based on BLAST pairwise sequence alignment scores S ij Intuition: most proteins descended from common ancestor have sequence similarity Random model Nodes picked at random A General Network Aligner: Scoring

18 Edge scores: more complicated Edge scores in earlier aligners rewarded high edge weights But this biases towards clique-like topology! Don’t want solely conservation either This alignment has highly conserved (zero-weight) edges: Non-trivial tradeoff in pairwise alignment of full networks A General Network Aligner: Scoring

19 Idea: assign each node a label from a finite alphabet ∑, and define edge likelihood in terms of labels it connects During alignment, assign labels which maximize score E: Symmetric matrix of probability distributions, E(x, y) is distribution of edge weights between nodes labeled x and y ESMs: A New Edge-Scoring Paradigm

20 Idea: assign each node a label from a finite alphabet ∑, and define edge likelihood in terms of labels it connects During alignment, assign labels which maximize score E: Symmetric matrix of probability distributions, E(x, y) is distribution of edge weights between nodes labeled x and y Simplest case is clique ESM 1x1 matrix: ∑ contains a single label Duplicates edge-scoring of aligners which search for cliques ESMs: A New Edge-Scoring Paradigm

21 For query-to-database alignment, use a module ESM One label for each node in query module Tractable because queries are usually small (~10-40 nodes) For each pair of nodes (n i, n j ) in query, let E(i, j) be a Gaussian centered at c ij = weight of (n i, n j ) edge ESMs: A New Edge-Scoring Paradigm

22 Multiple alignment gives us more information about conservation Can iteratively improve ESM to adjust mean and deviation based on weights of edges between aligned pairs of query nodes Easily implemented using kernel density estimation (KDE) ESMs: A New Edge-Scoring Paradigm

23 Given this model of network alignment and scoring framework, how to efficiently find alignments between a pair of networks (N 1, N 2 )? Constructing every possible set of equivalence classes clearly prohibitive A General Network Aligner: Algorithm

24 Idea: seeded alignment Inspired by seeded sequence alignment (BLAST) Identify regions of network in which “good” alignments likely to be found MaWISh does this, using high-degree nodes for seeds Can we avoid such strong topological constraints? Seed Extend A General Network Aligner: Algorithm

25 d-Clusters: Intuition “Good” alignments typically have: a significant number of nodes with high sequence similarity Implied by the node scoring function, which prefers aligning nodes with high BLAST scores with mostly conserved connected components Implied by the edge scoring function which prefers conserved edge weights

26 d-Clusters Define D(n), the d-cluster of node n as the d “closest” nodes to n Distance defined in terms of edge weights n d = 4

27 d-Clusters Expect the majority of high-scoring alignments to contain a pair of d-clusters (D(n i ), D(n j )) such that a greedy matching scores at least T for suitably chosen d and T Can optimize d and T for user-specified expected sensitivity nini njnj d = 4 T = 7 3.5 1.6 2.8 Matching score: 3.56.37.9

28 d-Clusters Seeding algorithm: for each n i  N 1 and n j  N 2, emit (n i, n j ) as a seed if matching score exceeds T njnj 3.5 1.6 2.8 Seed: nini

29 Extending seeds Given a pair of d-cluster seeds (D(n i ), D(n j )), want to find highest-scoring alignment containing this seed Start by forming an equivalence class consisting of x  D(n i ) and y  D(n j ) maximizing S N (x, y) All other m  N 1  N 2 are singleton equivalence classes njnj 3.5 1.6 2.8 nini yx

30 Extending seeds Extend greedily: Define the frontier (F) as the set of all already-aligned nodes and their neighbors in each network Picking nodes s, t  F, and label L  ∑, which maximally increase alignment score: Merge equivalence classes [s] and [t] Relabel the resulting equivalence class to L

31 Multiple Alignment Progressive alignment technique Used by most multiple sequence aligners Simple modification of implementation to align alignments rather than networks Node scoring already uses weighted SOP Edge scoring remains unchanged M. tuberculosisE. coliC. crescentus

32 Resulting Alignments

33 Pairwise alignments Cell division Polysaccharide transport DNA uptake

34 Multiple alignments DNA uptake DNA replication

35 11-way Ribosome

36 Comparison to Extant Methods

37 Pairwise Full Network KEGGs Hit Enrichment # Nodes Aligned CPU Time (sec) MaWISh1381.3%34320.2 NetworkBLAST2482.4%7011,353.6 Nuke2586.7%78612.4 KEGGs Hit Enrichment # Nodes Aligned CPU Time (sec) MaWISh983.3%2184.7 NetworkBLAST1482.9%443329.5 Nuke1480.0%69810.7 E. coli versus C. jejuni E. coli versus H. pylori

38 Pairwise Query-to-Database KEGGs Hit Coverage CPU Time (sec) MaWISh2047%22.9 NetworkBLAST1748%3,674.8 Nuke2150%14.9 KEGGs Hit Coverage CPU Time (sec) MaWISh1065%20.3 NetworkBLAST951%3,515.0 Nuke1060%11.6 E. coli versus C. jejuni E. coli versus H. pylori

39 E. coli versus C. jejuni E. coli versus H. pylori Multiple Alignment (3-way) NetworkBLASTNuke CPU Time58,801.6191.8 KEGGs Hit2024 Enrichment88.2%89.6% # Nodes Aligned1018950 KEGGs Hit1415 Enrichment90.4%88.4% # Nodes Aligned979881


Download ppt "Networks of Protein Interactions Network Alignment Antal Novak CS 374 Lecture 6 10/13/2005 Nuke: Scalable and General Pairwise and Multiple Network Alignment."

Similar presentations


Ads by Google