Networks of Protein Interactions Network Alignment Antal Novak CS 374 Lecture 6 10/13/2005 Nuke: Scalable and General Pairwise and Multiple Network Alignment.

Networks of Protein Interactions Network Alignment Antal Novak CS 374 Lecture 6 10/13/2005 Nuke: Scalable and General Pairwise and Multiple Network Alignment Flannick, Novak, Srinivasan, McAdams, Batzoglou (2005)

Recap Network Integration Combine data from multiple sources to obtain robust probabilities of interaction Can be performed in a high-throughput manner “Whatcha gonna do with it?” Network alignment!

Sequence alignment seeks to identify conserved DNA or protein sequence Intuition: conservation implies functionality EFTPPVQAAYQKVVAGV (human) DFNPNVQAAFQKVVAGV (pig) EFTPPVQAAYQKVVAGV (rabbit) Motivation

By similar intuition, subnetworks conserved across species are likely functional modules Motivation

Network Alignment “Conserved” means two subgraphs contain proteins serving similar functions, having similar interaction profiles Key word is similar, not identical mismatch/substitution

Earlier approaches: interologs Interactions conserved in orthologs Orthology is a fuzzy notion Sequence similarity not necessary for conservation of function

Goal: identify conserved pathways (chains) Idea: can be done efficiently by dynamic programming if networks are DAGs Kelley et al (2003) D D’ + match Earlier approaches: PathBLAST C X’X’ + mismatch B + gap A A’A’ Score: match

Problem: Networks are neither acyclic nor directed Solution: eliminate cycles by imposing random ordering on nodes, perform DP; repeat many times In expectation, finds conserved paths of length L within networks of size n in O(L!n) time Drawbacks Computationally expensive Restricts search to specific topology Kelley et al (2003) Earlier approaches: PathBLAST 142 35 214 53 521 34

Goal: identify conserved multi-protein complexes (clique-like structures) Idea: such structures will likely contain at least one hub (high-degree node) Koyuturk et al (2004) Earlier approaches: MaWISh

Algorithm: start by aligning a pair of homologous hubs, extend greedily Koyuturk et al (2004) Efficient running time, but also only solves a specific case Earlier approaches: MaWISh

A General Network Aligner: Goals Solve restrictions of existing approaches Should extend gracefully to multiple alignment PathBLAST was extended to 3-way alignment, but extension scales exponentially in number of species Should not restrict search to specific network topologies (cliques/pathways) Must be efficient in running time

A General Network Aligner: Goals Useful application for biologists: given a candidate module, align to a database of networks (“query-to-database”) Query:Database:

Earlier approaches aligned pairs of nodes Instead, alignment as an equivalence relation: equivalence class consists of proteins evolved from a common ancestral protein Can contain multiple proteins in same species (paralogs) Handles multiple alignment in an obvious way { paralog A General Network Aligner: Model

Example: hypothetical ancestral module descendants equivalence classes A General Network Aligner: Model

Probabilistic scoring of alignments: M : Alignment model (network evolved from a common ancestor) R : Random model (nodes and edges picked at random) Nodes and edges scored independently 2.5 4.01.5 3.0 0.8 0.4 -0.4 0.8 1.2 -0.3 0.60.5 0.6 -0.2 A General Network Aligner: Scoring

Node scores: simple Weighted Sum-Of-Pairs (SOP) Each equivalence class scored as sum (over pairs n i, n j ) of, where is weight on phylogenetic tree H. pyloriM. tuberculosisC. crescentus 231 E. coli 4 A General Network Aligner: Scoring

Alignment model Based on BLAST pairwise sequence alignment scores S ij Intuition: most proteins descended from common ancestor have sequence similarity Random model Nodes picked at random A General Network Aligner: Scoring

Edge scores: more complicated Edge scores in earlier aligners rewarded high edge weights But this biases towards clique-like topology! Don’t want solely conservation either This alignment has highly conserved (zero-weight) edges: Non-trivial tradeoff in pairwise alignment of full networks A General Network Aligner: Scoring

Idea: assign each node a label from a finite alphabet ∑, and define edge likelihood in terms of labels it connects During alignment, assign labels which maximize score E: Symmetric matrix of probability distributions, E(x, y) is distribution of edge weights between nodes labeled x and y ESMs: A New Edge-Scoring Paradigm

Idea: assign each node a label from a finite alphabet ∑, and define edge likelihood in terms of labels it connects During alignment, assign labels which maximize score E: Symmetric matrix of probability distributions, E(x, y) is distribution of edge weights between nodes labeled x and y Simplest case is clique ESM 1x1 matrix: ∑ contains a single label Duplicates edge-scoring of aligners which search for cliques ESMs: A New Edge-Scoring Paradigm

For query-to-database alignment, use a module ESM One label for each node in query module Tractable because queries are usually small (~10-40 nodes) For each pair of nodes (n i, n j ) in query, let E(i, j) be a Gaussian centered at c ij = weight of (n i, n j ) edge ESMs: A New Edge-Scoring Paradigm

Multiple alignment gives us more information about conservation Can iteratively improve ESM to adjust mean and deviation based on weights of edges between aligned pairs of query nodes Easily implemented using kernel density estimation (KDE) ESMs: A New Edge-Scoring Paradigm

Given this model of network alignment and scoring framework, how to efficiently find alignments between a pair of networks (N 1, N 2 )? Constructing every possible set of equivalence classes clearly prohibitive A General Network Aligner: Algorithm

Idea: seeded alignment Inspired by seeded sequence alignment (BLAST) Identify regions of network in which “good” alignments likely to be found MaWISh does this, using high-degree nodes for seeds Can we avoid such strong topological constraints? Seed Extend A General Network Aligner: Algorithm

d-Clusters: Intuition “Good” alignments typically have: a significant number of nodes with high sequence similarity Implied by the node scoring function, which prefers aligning nodes with high BLAST scores with mostly conserved connected components Implied by the edge scoring function which prefers conserved edge weights

d-Clusters Define D(n), the d-cluster of node n as the d “closest” nodes to n Distance defined in terms of edge weights n d = 4

d-Clusters Expect the majority of high-scoring alignments to contain a pair of d-clusters (D(n i ), D(n j )) such that a greedy matching scores at least T for suitably chosen d and T Can optimize d and T for user-specified expected sensitivity nini njnj d = 4 T = 7 3.5 1.6 2.8 Matching score: 3.56.37.9

d-Clusters Seeding algorithm: for each n i  N 1 and n j  N 2, emit (n i, n j ) as a seed if matching score exceeds T njnj 3.5 1.6 2.8 Seed: nini

Extending seeds Given a pair of d-cluster seeds (D(n i ), D(n j )), want to find highest-scoring alignment containing this seed Start by forming an equivalence class consisting of x  D(n i ) and y  D(n j ) maximizing S N (x, y) All other m  N 1  N 2 are singleton equivalence classes njnj 3.5 1.6 2.8 nini yx

Extending seeds Extend greedily: Define the frontier (F) as the set of all already-aligned nodes and their neighbors in each network Picking nodes s, t  F, and label L  ∑, which maximally increase alignment score: Merge equivalence classes [s] and [t] Relabel the resulting equivalence class to L

Multiple Alignment Progressive alignment technique Used by most multiple sequence aligners Simple modification of implementation to align alignments rather than networks Node scoring already uses weighted SOP Edge scoring remains unchanged M. tuberculosisE. coliC. crescentus

Resulting Alignments

Pairwise alignments Cell division Polysaccharide transport DNA uptake

Multiple alignments DNA uptake DNA replication

11-way Ribosome

Comparison to Extant Methods

Pairwise Full Network KEGGs Hit Enrichment # Nodes Aligned CPU Time (sec) MaWISh1381.3%34320.2 NetworkBLAST2482.4%7011,353.6 Nuke2586.7%78612.4 KEGGs Hit Enrichment # Nodes Aligned CPU Time (sec) MaWISh983.3%2184.7 NetworkBLAST1482.9%443329.5 Nuke1480.0%69810.7 E. coli versus C. jejuni E. coli versus H. pylori

Pairwise Query-to-Database KEGGs Hit Coverage CPU Time (sec) MaWISh2047%22.9 NetworkBLAST1748%3,674.8 Nuke2150%14.9 KEGGs Hit Coverage CPU Time (sec) MaWISh1065%20.3 NetworkBLAST951%3,515.0 Nuke1060%11.6 E. coli versus C. jejuni E. coli versus H. pylori

E. coli versus C. jejuni E. coli versus H. pylori Multiple Alignment (3-way) NetworkBLASTNuke CPU Time58,801.6191.8 KEGGs Hit2024 Enrichment88.2%89.6% # Nodes Aligned1018950 KEGGs Hit1415 Enrichment90.4%88.4% # Nodes Aligned979881

Networks of Protein Interactions Network Alignment Antal Novak CS 374 Lecture 6 10/13/2005 Nuke: Scalable and General Pairwise and Multiple Network Alignment.

Similar presentations

Presentation on theme: "Networks of Protein Interactions Network Alignment Antal Novak CS 374 Lecture 6 10/13/2005 Nuke: Scalable and General Pairwise and Multiple Network Alignment."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Networks of Protein Interactions Network Alignment Antal Novak CS 374 Lecture 6 10/13/2005 Nuke: Scalable and General Pairwise and Multiple Network Alignment.

Similar presentations

Presentation on theme: "Networks of Protein Interactions Network Alignment Antal Novak CS 374 Lecture 6 10/13/2005 Nuke: Scalable and General Pairwise and Multiple Network Alignment."— Presentation transcript:

Similar presentations

About project

Feedback