Inferring Functional Information from Domain co-evolution Yohan Kim, Mehmet Koyuturk, Umut Topkara, Ananth Grama and Shankar Subramaniam Gaurav Chadha.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Last lecture summary.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Heuristic alignment algorithms and cost matrices
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
CISC667, F05, Lec23, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines (II) Bioinformatics Applications.
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Dali: A Protein Structural Comparison Algorithm Using 2D Distance Matrices.
A unified statistical framework for sequence comparison and structure comparison Michael Levitt Mark Gerstein.
Scoring matrices Identity PAM BLOSUM.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Multiple testing correction
Lecture 11, CS5671 Secondary Structure Prediction Progressive improvement –Chou-Fasman rules –Qian-Sejnowski –Burkhard-Rost PHD –Riis-Krogh Chou-Fasman.
An Introduction to Bioinformatics
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Specific DNA Uptake Genetic exchange & bacterial evolution DNA uptake is primitive genetic exchange Some important human pathogens have DNA uptake.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Anis Karimpour-Fard ‡, Ryan T. Gill †,
I. Prolinks: a database of protein functional linkage derived from coevolution II. STRING: known and predicted protein-protein associations, integrated.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Construction of Substitution matrices
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment Raja Jothi, Teresa.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Sequence similarity, BLAST alignments & multiple sequence alignments
A Hybrid Algorithm for Multiple DNA Sequence Alignment
Pairwise alignment incorporating dipeptide covariation
Predicting Active Site Residue Annotations in the Pfam Database
Multiple Sequence Alignment (I)
Gautam Dey, Tobias Meyer  Cell Systems 
Alignment IV BLOSUM Matrices
Presentation transcript:

Inferring Functional Information from Domain co-evolution Yohan Kim, Mehmet Koyuturk, Umut Topkara, Ananth Grama and Shankar Subramaniam Gaurav Chadha Deepak Desore

Layout Motivation Computational Methods and Algorithms Results Conclusion Questions

Motivation (1 of 2..) Prior Work Focused on understanding Protein function at the level of entire protein sequences Assumption: Complete Sequence follows single evolutionary trajectory It is well known that a domain can exist in various contexts, which invalidates the above assumption for multi-domain protein sequences

Motivation (2 of 2..) Our approach Improvement of Multiple Profile method Constructs Co-evolutionary Matrix to assign phylogenetic similarity scores to each protein pair Identifies Co-evolving regions using residue- level conservation

Computational Methods & Algorithms Constructing phylogenetic profiles Protein(single) phylogenetic profiles Segment(Multiple) phylogenetic profiles Residue phylogenetic profiles Computing Co-evolutionary matrices Deriving phylogenetic similarity scores

Protein phylogenetic profiles Phylogenetic profile is a vector which tells about the existence of a protein in a genome. Let P = {P 1,P 2,…,P n } be the set of proteins and, G = {G 1,G 2,…,G m } be the set of Genomes Every row represents binary phylogenetic profile of a protein.

Protein phylogenetic profiles(contd.) Single phylogenetic profile ψ i for protein P i is, ψ i (j) = - 1,1 <= j <= m log(E ij ) where E ij is minimum BLAST E-value of local alignment between P i and G j Advantage: gives degree of sequence divergence

Protein phylogenetic profiles(contd.) Mutual Information I(X,Y) defined as, I(X,Y) = H(X) + H(Y) – H(X,Y), where H(X), Shannon Entropy of X is defined as, H(X) = ∑ p x * log(p x ), x Є X andp x = P[X = x] Phylogenetic similarity between ψ i (j) and ψ i (j) is, μ s (P i,P j ) = I(ψ i, ψ i )

Segment phylogenetic profiles Single profile based methods could miss significant interactions. Domain D 1 2 of P 2 follows evolutionary trajectory similar to P 1 and P 3 which single profile method didn’t capture.

Segment phylogen. profiles(contd.) Dividing each protein P i into fixed size segments S 1 i,S 2 i,…,S k i Phylogenetic similarity between two proteins, μ M (P i,P j ) = max I(ψ s i, ψ t j ), s,t where ψ s i is phylogenetic profile of segment S k i of protein P i

Residue phylogenetic profiles Problem with multiple phylogenetic profiles: Both domains covered together by the segment S 2 2, overriding their individual phylogenetic profiles. Significant local alignment between two proteins corresponds to the residues covered in the alignment rather than the whole sequences.

Residue phylog. profiles(contd.) A(P i,G j ) – set of significant local alignments between Protein P i and Genome G j T(A) = [r b,r e ] – interval of residues on P i corresponding to each alignment A Є A(P i,G j ) For each residue r on P i phylogenetic profile is ψ r i (j) = min - 1,1 <= j <= m A Є A r log(E(A)) A r = {A Є A(P i,G j ): r Є T(A)} is the set of local alignments that contain r

Computing co-evolutionary matrices For each protein pair P i and P j with lengths l i and l j, co-evolutionary matrix entry M ij (r,s) is, M ij (r,s) = I (ψ r i, ψ s j ), where1 <= r <= l i and 1 <= s <= l j The Co-evolutionary Matrix contains Information about which regions of the two proteins co- evolved The co-evolved domain(s) appear as a block of high mutual information scores in the matrix

Deriving phylogenetic similarity scores Phylogenetic similarity scores between two proteins P i and P j is, μ C (P i,P j ) = max minM ij (a,b) 1<= r <= l i r <= a <= r + W 1<= s <= l j s <= a <= s + W where W is the window parameter that quantifies the minimum size of the region on a protein to be considered as a conserved domain.

Results Implemented and tested on 4311 E.coli proteins 152 Genomes(131 Bacteria,17 Archaea,4 Eukaryota) Value of f (down-sampling factor) = 30, W = 2 These values translate in overlapping segments of 60 residue long Excluded homologous proteins from analysis Define p-value as fraction of non-homologous protein pairs (N)

Results (contd.) MIS – Mutual Information Score PP – No. of predicted protein pairs PPV = TP / (TP + FP) For all μ*, coverage = TP + FP TN and FN are the no. of protein pairs that do not meet the threshold

Results (contd.) Co-evolutionary matrix has 1.5 times greater coverage at PPV = 0.7 than the single profile method At same no. of PP, Co-evolutionary matrix has better PPV and sensitivity values than single profile method

Results (contd.) Mutual Information score distribution for interacting and non-interacting protein pairs At 0 MIS, SP shows a peak while CM doesn’t. In other ways, at low MIS scores, SP scores over CM

Results (contd.) Shows p-values of Single Profile method v/s Co-evolutionary Matrix method Scattered circles show that the two methods can predict very differently

Results (contd.) – Phosphotransferase system Domain IIA(residues 1-170) and domain IIB(residue ) Darker region shows that the domains have co-evolved. So we can conclude that IIB evolved with IIC rather than IIA Top-20 predicted interacting partners of protein IIAB for both methods

Results (contd.) - Chemotaxis N-terminus of CheA(residues 1-200) and C-terminus of CheA(residues ) co-evolved with C- terminus region of CheB (residues ) Top-20 predicted interacting partners of protein CheA using both methods

Results (contd.) – Kdp System N-terminal domain of KdpD (residues 1-395) co-evolved with KdpC Top-10 predicted interacting partners of protein KdpD using both methods

Conclusion Results in this paper strongly suggest that co- evolution of proteins should be captured at the domain level Because domains with conflicting evolutionary histories can co-exist in a single protein sequence Regions that are important for supporting both functional and physical interactions between proteins can be detected

Questions Thank You !!