Networks of Protein Interactions Network Alignment Antal Novak CS 374 Lecture 6 10/13/2005 Nuke: Scalable and General Pairwise and Multiple Network Alignment.

Slides:

Advertisements

Similar presentations

Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 10 th, 2013 BMI/CS 576.

Advertisements

FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.

. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.

. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.

Lecture outline Database searches

Heuristic alignment algorithms and cost matrices

Comparison of Networks Across Species CS374 Presentation October 26, 2006 Chuan Sheng Foo.

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.

Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Phylogenetic Trees Presenter: Michael Tung

BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.

Multiple sequence alignment

Similar Sequence Similar Function Charles Yan Spring 2006.

. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.

Probabilistic methods for phylogenetic trees (Part 2)

Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.

Multiple Sequence Alignment

Phylogenetic Tree Construction and Related Problems Bioinformatics.

“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.

Blast heuristics Morten Nielsen Department of Systems Biology, DTU.

Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.

Phylogenetic trees Sushmita Roy BMI/CS 576

Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.

Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,

341: Introduction to Bioinformatics

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.

Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.

BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.

Protein Sequence Alignment and Database Searching.

1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

QNET: A tool for querying protein interaction networks Banu Dost +, Tomer Shlomi*, Nitin Gupta +, Eytan Ruppin*, Vineet Bafna +, Roded Sharan* + University.

BINF6201/8201 Molecular phylogenetic methods

Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

Introduction to Bioinformatics Biological Networks Department of Computing Imperial College London March 18, 2010 Lecture hour 18 Nataša Pržulj

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Chapter 3 Computational Molecular Biology Michael Smith

More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

CSCE555 Bioinformatics Lecture 18 Network Biology: Comparison of Networks Across Species Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.

Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.

Construction of Substitution matrices

Doug Raiford Phage class: introduction to sequence databases.

Advanced Algorithms and Models for Computational Biology -- a machine learning approach Network Algorithms Eric Xing Lecture 23, April 12 & 17, 2006 Reading.

1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.

Step 3: Tools Database Searching

Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

Comparative Network Analysis BMI/CS 776 Spring 2013 Colin Dewey

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

CSCI2950-C Lecture 12 Networks

Spectral methods for Global Network Alignment

Multiple sequence alignment (msa)

Spectral methods for Global Network Alignment

SEG5010 Presentation Zhou Lanjun.

Presentation transcript:

Networks of Protein Interactions Network Alignment Antal Novak CS 374 Lecture 6 10/13/2005 Nuke: Scalable and General Pairwise and Multiple Network Alignment Flannick, Novak, Srinivasan, McAdams, Batzoglou (2005)

Recap Network Integration Combine data from multiple sources to obtain robust probabilities of interaction Can be performed in a high-throughput manner “Whatcha gonna do with it?” Network alignment!

Sequence alignment seeks to identify conserved DNA or protein sequence Intuition: conservation implies functionality EFTPPVQAAYQKVVAGV (human) DFNPNVQAAFQKVVAGV (pig) EFTPPVQAAYQKVVAGV (rabbit) Motivation

By similar intuition, subnetworks conserved across species are likely functional modules Motivation

Network Alignment “Conserved” means two subgraphs contain proteins serving similar functions, having similar interaction profiles Key word is similar, not identical mismatch/substitution

Earlier approaches: interologs Interactions conserved in orthologs Orthology is a fuzzy notion Sequence similarity not necessary for conservation of function

Goal: identify conserved pathways (chains) Idea: can be done efficiently by dynamic programming if networks are DAGs Kelley et al (2003) D D’ + match Earlier approaches: PathBLAST C X’X’ + mismatch B + gap A A’A’ Score: match

Problem: Networks are neither acyclic nor directed Solution: eliminate cycles by imposing random ordering on nodes, perform DP; repeat many times In expectation, finds conserved paths of length L within networks of size n in O(L!n) time Drawbacks Computationally expensive Restricts search to specific topology Kelley et al (2003) Earlier approaches: PathBLAST

Goal: identify conserved multi-protein complexes (clique-like structures) Idea: such structures will likely contain at least one hub (high-degree node) Koyuturk et al (2004) Earlier approaches: MaWISh

Algorithm: start by aligning a pair of homologous hubs, extend greedily Koyuturk et al (2004) Efficient running time, but also only solves a specific case Earlier approaches: MaWISh

A General Network Aligner: Goals Solve restrictions of existing approaches Should extend gracefully to multiple alignment PathBLAST was extended to 3-way alignment, but extension scales exponentially in number of species Should not restrict search to specific network topologies (cliques/pathways) Must be efficient in running time

A General Network Aligner: Goals Useful application for biologists: given a candidate module, align to a database of networks (“query-to-database”) Query:Database:

Earlier approaches aligned pairs of nodes Instead, alignment as an equivalence relation: equivalence class consists of proteins evolved from a common ancestral protein Can contain multiple proteins in same species (paralogs) Handles multiple alignment in an obvious way { paralog A General Network Aligner: Model

Example: hypothetical ancestral module descendants equivalence classes A General Network Aligner: Model

Probabilistic scoring of alignments: M : Alignment model (network evolved from a common ancestor) R : Random model (nodes and edges picked at random) Nodes and edges scored independently A General Network Aligner: Scoring

Node scores: simple Weighted Sum-Of-Pairs (SOP) Each equivalence class scored as sum (over pairs n i, n j ) of, where is weight on phylogenetic tree H. pyloriM. tuberculosisC. crescentus 231 E. coli 4 A General Network Aligner: Scoring

Alignment model Based on BLAST pairwise sequence alignment scores S ij Intuition: most proteins descended from common ancestor have sequence similarity Random model Nodes picked at random A General Network Aligner: Scoring

Edge scores: more complicated Edge scores in earlier aligners rewarded high edge weights But this biases towards clique-like topology! Don’t want solely conservation either This alignment has highly conserved (zero-weight) edges: Non-trivial tradeoff in pairwise alignment of full networks A General Network Aligner: Scoring

Idea: assign each node a label from a finite alphabet ∑, and define edge likelihood in terms of labels it connects During alignment, assign labels which maximize score E: Symmetric matrix of probability distributions, E(x, y) is distribution of edge weights between nodes labeled x and y ESMs: A New Edge-Scoring Paradigm

Idea: assign each node a label from a finite alphabet ∑, and define edge likelihood in terms of labels it connects During alignment, assign labels which maximize score E: Symmetric matrix of probability distributions, E(x, y) is distribution of edge weights between nodes labeled x and y Simplest case is clique ESM 1x1 matrix: ∑ contains a single label Duplicates edge-scoring of aligners which search for cliques ESMs: A New Edge-Scoring Paradigm

For query-to-database alignment, use a module ESM One label for each node in query module Tractable because queries are usually small (~10-40 nodes) For each pair of nodes (n i, n j ) in query, let E(i, j) be a Gaussian centered at c ij = weight of (n i, n j ) edge ESMs: A New Edge-Scoring Paradigm

Multiple alignment gives us more information about conservation Can iteratively improve ESM to adjust mean and deviation based on weights of edges between aligned pairs of query nodes Easily implemented using kernel density estimation (KDE) ESMs: A New Edge-Scoring Paradigm

Given this model of network alignment and scoring framework, how to efficiently find alignments between a pair of networks (N 1, N 2 )? Constructing every possible set of equivalence classes clearly prohibitive A General Network Aligner: Algorithm

Idea: seeded alignment Inspired by seeded sequence alignment (BLAST) Identify regions of network in which “good” alignments likely to be found MaWISh does this, using high-degree nodes for seeds Can we avoid such strong topological constraints? Seed Extend A General Network Aligner: Algorithm

d-Clusters: Intuition “Good” alignments typically have: a significant number of nodes with high sequence similarity Implied by the node scoring function, which prefers aligning nodes with high BLAST scores with mostly conserved connected components Implied by the edge scoring function which prefers conserved edge weights

d-Clusters Define D(n), the d-cluster of node n as the d “closest” nodes to n Distance defined in terms of edge weights n d = 4

d-Clusters Expect the majority of high-scoring alignments to contain a pair of d-clusters (D(n i ), D(n j )) such that a greedy matching scores at least T for suitably chosen d and T Can optimize d and T for user-specified expected sensitivity nini njnj d = 4 T = Matching score:

d-Clusters Seeding algorithm: for each n i  N 1 and n j  N 2, emit (n i, n j ) as a seed if matching score exceeds T njnj Seed: nini

Extending seeds Given a pair of d-cluster seeds (D(n i ), D(n j )), want to find highest-scoring alignment containing this seed Start by forming an equivalence class consisting of x  D(n i ) and y  D(n j ) maximizing S N (x, y) All other m  N 1  N 2 are singleton equivalence classes njnj nini yx

Extending seeds Extend greedily: Define the frontier (F) as the set of all already-aligned nodes and their neighbors in each network Picking nodes s, t  F, and label L  ∑, which maximally increase alignment score: Merge equivalence classes [s] and [t] Relabel the resulting equivalence class to L

Multiple Alignment Progressive alignment technique Used by most multiple sequence aligners Simple modification of implementation to align alignments rather than networks Node scoring already uses weighted SOP Edge scoring remains unchanged M. tuberculosisE. coliC. crescentus

Resulting Alignments

Pairwise alignments Cell division Polysaccharide transport DNA uptake

Multiple alignments DNA uptake DNA replication

11-way Ribosome

Comparison to Extant Methods

Pairwise Full Network KEGGs Hit Enrichment # Nodes Aligned CPU Time (sec) MaWISh1381.3% NetworkBLAST2482.4%7011,353.6 Nuke2586.7% KEGGs Hit Enrichment # Nodes Aligned CPU Time (sec) MaWISh983.3% NetworkBLAST1482.9% Nuke1480.0% E. coli versus C. jejuni E. coli versus H. pylori

Pairwise Query-to-Database KEGGs Hit Coverage CPU Time (sec) MaWISh2047%22.9 NetworkBLAST1748%3,674.8 Nuke2150%14.9 KEGGs Hit Coverage CPU Time (sec) MaWISh1065%20.3 NetworkBLAST951%3,515.0 Nuke1060%11.6 E. coli versus C. jejuni E. coli versus H. pylori

E. coli versus C. jejuni E. coli versus H. pylori Multiple Alignment (3-way) NetworkBLASTNuke CPU Time58, KEGGs Hit2024 Enrichment88.2%89.6% # Nodes Aligned KEGGs Hit1415 Enrichment90.4%88.4% # Nodes Aligned979881