Protein Domain Finding Problem Olga Russakovsky, Eugene Fratkin, Phuong Minh Tu, Serafim Batzoglou Algorithm Step 1: Creating a graph of k-mers First,

Slides:



Advertisements
Similar presentations
Protein – Protein Interactions Lisa Chargualaf Simon Kanaan Keefe Roedersheimer Others: Dr. Izaguirre, Dr. Chen, Dr. Wuchty, ChengBang Huang.
Advertisements

Graph-02.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Heuristic alignment algorithms and cost matrices
1 Protein Multiple Alignment by Konstantin Davydov.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Protein Structures.
Graphs, relations and matrices
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Graph mining in bioinformatics Laur Tooming. Graphs in biology Graphs are often used in bioinformatics for describing processes in the cell Vertices are.
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
Physical Mapping of DNA Shanna Terry March 2, 2004.
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Graphs.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Introduction to Bioinformatics Biological Networks Department of Computing Imperial College London March 18, 2010 Lecture hour 18 Nataša Pržulj
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Chapter 3 Computational Molecular Biology Michael Smith
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Markov Cluster (MCL) algorithm Stijn van Dongen.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Sequence Alignment.
Graph Algorithms Maximum Flow - Best algorithms [Adapted from R.Solis-Oba]
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
An Algorithm for the Consecutive Ones Property Claudio Eccher.
MATRICES. DEFINITION A rectangular array of numeric or algebraic quantities subject to mathematical operations. Something resembling such an array, as.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
(CSC 102) Lecture 30 Discrete Structures. Graphs.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Multiple sequence alignment (msa)
Intro to Alignment Algorithms: Global and Local
Graph Clustering based on Random Walk
Protein Structures.
Alignment IV BLOSUM Matrices
Advance Database System
Presentation transcript:

Protein Domain Finding Problem Olga Russakovsky, Eugene Fratkin, Phuong Minh Tu, Serafim Batzoglou Algorithm Step 1: Creating a graph of k-mers First, we break the sequences into k-mers (subsequences of length k) including overlaps. These k- mers will be the vertices of our graph. For each pair of k- mers, we compute their similarity score using the BLOSUM matrix and, if that score is greater than a certain threshold, we add an edge to the graph. Hence we get a sparse graph G = (V, E) which is represented as an adjacency matrix A. Step 2: Clustering the k-mers The first step was easy, now comes the main portion of the algorithm. We try to cluster the graph using the Markov Clustering Algorithm. We then assume that each resulting cluster is a set of anchoring points such that there is a unique isomorphism between these points and repetitious occurrences of a domain. In other words, each domain contains an instance of an anchoring point. The number of such clusters is much greater than the number of domains, hence each instance of a domain contains many anchoring points from separate sets. Step 3: Creating a graph of clusters In this step our goal is to identify sets that belong to the same domains. We compute pairwise similarity between clusters and create a corresponding adjacency matrix. This is similar to the first step, but now we are not interested in k-mer similarity but rather structural similarity of sets. Step 4: Clustering the clusters This step is identical to step 2, except the adjacency matrix is for clusters, not individual k-mers. Markov Clustering Algorithm now produces sets of clusters. We assume that clusters within the same set must contain anchoring points in repetitions of the same domain. Step 5: Combining clusters together to form domains By combining the sets of clusters obtained from step 4 and adding the amino acids in between two similar anchoring points, we get an outline of the domain. We can then extend these outlines using Dynamic Programming. The extension has not yet been implemented. Amino acid sequences Protein domains to be discovered -AABABBABA- 1)AABA 6) BABA 5)BBAB 2)ABAB 3)BABB 4)ABBA 1)2)3)4)5)6) 1) ) ) ) ) ) A = Amino acid sequences Protein domains to be discovered Clusters of kmers Amino acid sequences Protein domains to be discovered Clusters of kmers Domain outlines Many thanks to Marina Sirota for assistance in poster preparation. References: Stijn van Dongen, Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May n Biological Background Proteins are the basic mechanism through which cell processes are initiated, and understanding their structure and function is key to understanding how cells operate. Proteins consist of protein domains, or sequences of amino acids which are self-stabilizing and which fold independently of the rest of the protein. Protein domains figure prominently in the biological function of the protein they belong to. Furthermore, many domains are not unique to a protein family, but appear in many different proteins. Thus studying individual protein domains provides insight into the geometric structure and function of many different proteins. This information, in turn, gives us a key to understanding and inferring cell behavior. Being able to correctly identify these domains in the amino acid sequence is the first step. Problem Formulation From the computer science point of view, we have several sequences (2 to 100) of length 500 to amino acids. We are trying to identify the protein domains by looking for regions that are statistically over-represented. These regions can be of length 30 to 3000 letters. The trivial approach of looking for exact sequence matches doesn't work for two reasons. First of all, we don't know the exact length of the domain (and the input size is so large that we can't try out all lengths). Second, domains have mutations and insertions within them, so searching for exact matches won’t yield good results. Thus we have to come up with a more clever and efficient algorithm. Markov Clustering Algorithm This is the key algorithm in our approach. Given an adjacency matrix A of a graph, MCA produces a forest (collection of trees) such that each tree corresponds to a cluster of highly inter-connected vertices. The algorithm works by alternately executing Expansion and Inflation steps on the matrix. Expansion is A = A x where x is an appropriately chosen coefficient based on graph connectivity. Inflation is raising every entry of A to some power y, followed by column normalization: a ij A, a ij = a ij y a ij A, a ij = a ij / a nj