Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,

Slides:



Advertisements
Similar presentations
Shortest Vector In A Lattice is NP-Hard to approximate
Advertisements

Aggregating local image descriptors into compact codes
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Greedy Algorithms CS 6030 by Savitha Parur Venkitachalam.
Longest Common Rigid Subsequence Bin Ma and Kaizhong Zhang Department of Computer Science University of Western Ontario Ontario, Canada.
Structural bioinformatics
Chapter 5 Orthogonality
Iterative closest point algorithms
Identifying Structural Motifs in Proteins Rohit Singh Joint work with Mitul Saha.
Agenda A brief introduction The MASS algorithm The pairwise case Extension to the multiple case Experimental results.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Discovery of RNA Structural Elements Using Evolutionary Computation Authors: G. Fogel, V. Porto, D. Weekes, D. Fogel, R. Griffey, J. McNeil, E. Lesnik,
Computational Complexity of Approximate Area Minimization in Channel Routing PRESENTED BY: S. A. AHSAN RAJON Department of Computer Science and Engineering,
Polynomial time approximation scheme Lecture 17: Mar 13.
Similar Sequence Similar Function Charles Yan Spring 2006.
Object Recognition. Geometric Task : find those rotations and translations of one of the point sets which produce “large” superimpositions of corresponding.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Improved Approximation Bounds for Planar Point Pattern Matching (under rigid motions) Minkyoung Cho Department of Computer Science University of Maryland.
Structure Alignment in Polynomial Time Rachel Kolodny Stanford University Nati Linial The Hebrew University of Jerusalem.
NP-complete and NP-hard problems. Decision problems vs. optimization problems The problems we are trying to solve are basically of two kinds. In decision.
1 Slides by Asaf Shapira & Michael Lewin & Boaz Klartag & Oded Schwartz. Adapted from things beyond us.
Model Database. Scene Recognition Lamdan, Schwartz, Wolfson, “Geometric Hashing”,1988.
Multiple Sequence Alignment
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
1.1 Chapter 1: Introduction What is the course all about? Problems, instances and algorithms Running time v.s. computational complexity General description.
Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path Ilya N. Shindyalov, Philip E. Bourne.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
Alignment Introduction Notes courtesy of Funk et al., SIGGRAPH 2004.
Chapter 9 Superposition and Dynamic Programming 1 Chapter 9 Superposition and dynamic programming Most methods for comparing structures use some sorts.
Finding a Hausdorff Core of a Polygon: On Convex Polygon Containment with Bounded Hausdorff Distance Reza Dorrigiv, Stephane Durocher, Arash Farzan, Robert.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Closest String with Wildcards ( CSW ) Parameterized Complexity Analysis for the Closest String with Wildcards ( CSW ) Problem Danny Hermelin Liat Rozenberg.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
TECH Computer Science NP-Complete Problems Problems  Abstract Problems  Decision Problem, Optimal value, Optimal solution  Encodings  //Data Structure.
CSC 413/513: Intro to Algorithms NP Completeness.
Protein Structure Comparison. Sequence versus Structure The protein sequence is a string of letters: there is an optimal solution (DP) to the problem.
Chapter 3 Computational Molecular Biology Michael Smith
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.
Conformational Space.  Conformation of a molecule: specification of the relative positions of all atoms in 3D-space,  Typical parameterizations:  List.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
In section 11.9, we were able to find power series representations for a certain restricted class of functions. Here, we investigate more general problems.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Approximation Algorithms For Protein Folding Prediction Giancarlo MAURI,Antonio PICCOLBONI and Giulio PAVESI Symposium on Discrete Algorithms, pp ,
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL Chan The University of Hong Kong Sep 9, 2009.
Stabbing balls and simplifying proteins Ovidiu Daescu and Jun Luo Department of Computer Science University of Texas at Dallas Richardson, TX
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
A Membrane Algorithm for the Min Storage problem Dipartimento di Informatica, Sistemistica e Comunicazione Università degli Studi di Milano – Bicocca WMC.
MINRMS: an efficient algorithm for determining protein structure similarity using root-mean-squared-distance Andrew I. Jewett, Conrad C. Huang and Thomas.
Sequence Alignment.
Structural alignment methods Like in sequence alignment, try to find best correspondence: –Look at atoms –A 3-dimensional problem –No a priori knowledge.
Lecture 11 CS5661 Structural Bioinformatics – Structure Comparison Motivation Concepts Structure Comparison.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Find the optimal alignment ? +. Optimal Alignment Find the highest number of atoms aligned with the lowest RMSD (Root Mean Squared Deviation) Find a balance.
Conceptual Foundations © 2008 Pearson Education Australia Lecture slides for this course are based on teaching materials provided/referred by: (1) Statistics.
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
Dynamic Programming for the Edit Distance Problem.
Polygonal Curve Simplification
Haim Kaplan and Uri Zwick
On the k-Closest Substring and k-Consensus Pattern Problems
Presentation transcript:

Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo, Ontario, Canada

Outline Introduction to Structural Motif Related Work Compact Motif-finding Problem Formulation NP-Hard of the Compact Motif-finding Problem A Polynomial Time Approximate Scheme

Outline Introduction to Structural Motif Related Work Compact Motif-finding Problem Formulation NP-Hard of the Compact Motif-finding Problem A Polynomial Time Approximate Scheme

Introduction Protein is a sequence of amino acids. A protein always folds into a specific 3-D shape. Structures are important to proteins:  The functional properties of proteins depend on their 3-D structures.  Structures are more conserved than sequence during the evolution of proteins.

Structural Motif Structural motif is a frequently occurring substructure of proteins. Motifs are thought to be tightly related to protein functions. Identifying motifs from a set of proteins can help us to know their evolutionary history and functions.

Structural Motif Finding Problem Given a set of protein structures, to find the frequently occurring substructure. Informally, to find one substructure from each protein, that exhibit the highest degree of similarity.

How to measure the similarity of two substructures? Two popular measurements:  dRMSD: measure the root mean square Euclidean distance between the corresponding residues from different protein structures.  cRMSD: calculate the internal distance matrix for each protein, and compare the distance matrices for input structures.

Outline Introduction to Structural Motif Related Work Compact Motif-finding Problem Formulation NP-Hard of the Compact Motif-finding Problem A Polynomial Time Approximate Scheme

Related Work L.P.Chew proposed an iterative algorithm to compute the conserved shape and proved its convergence. (2002) D. Bandyopadhyay applied graph-based data- mining tools to find the family-specific fingerprints. (2006) M. Shatsky presented an algorithm to uncover the binding pattern. (2006) DALI and CE attempt to identify structural alignment with minimal dRMSD. STRUCTRAL and TM-Align employ heuristics to detect the alignment with minimal cRMSD.

Related Work (continued) However, these methods are all heuristic; the solutions are not guaranteed to be optimal or near optimal. The first PTAS for pairwise structural alignment:  R. Kolodny explored the Lipschitz property of the scoring function. (2004) Though this algorithm can be extended to the case of multiple structure alignment, the simple extension has a time complexity exponential in the number of proteins. Is there a PTAS to multiple structure motif finding?

Outline Introduction to Structural Motif Related Work Compact Motif-finding Problem Formulation NP-Hard of the Compact Motif-finding Problem A Polynomial Time Approximate Scheme

We focus on (R, C)-Compact Motif. What is (R, C)-compact motif?  A motif is bounded in a minimum ball with radius R.  In this ball, at most C residues do not belong to this motif. (R,C)-compact motif is biologically meaningful since  We focus on globular proteins.  We allows at most C exceptions.

(R, C)-Compact Motif Finding Problem Input: protein structures S 1 …, S n, and length l Output: a consensus consists of l 3D points  q=(q 1, …, q l )  a substructure u i from each protein Si Objective:  min (  1  i  n d 2 (q, u i )) 1/2 Here, we adopt the dRMSD distance function, i.e.,  d(q, u i )=min  ||q-  (u i )|| 2  consists of a rotation and a translation ||*|| 2 is the Euclidean metric.

Outline Introduction to Structural Motif Related Work Compact Motif-finding Problem Formulation NP-Hard of the Compact Motif-finding Problem A Polynomial Time Approximate Scheme

(R,C)-compact motif finding is still NP-Hard. Reduction from the Sequence Consensus Problem  Input: n binary strings S 1, …, S n, each is of length m  Output: A substring t i of length l from each string S i, 1  i  n,  Objective: minimize  1  i <i’  n d H (t i, t i’ ), where d H is Hamming distance. Basic Idea:  Try to find a way of reduction to make: dRMSD=Hamming Distance

(R,C)-compact motif finding is still NP-Hard. Each l-mer is transformed into 6l 3D points. 110   (0, 2i, 0), 1  (1, 2i, 0)

(R,C)-compact motif finding is still NP-Hard. Each l-mer is transformed into 6l 3D points. 110   (0, 2i, 0), 1  (1, 2i, 0)  The centroid will be (1/2, 2i, 0) (Easy translation)  Large “tail”  no rotation  RMSD = Hamming Distance Small distortion to each point to make it protein- like. Sequence Consensus Problem  (1,0)- Compact Motif Finding Problem

Outline Introduction to Structural Motif Related Work Compact Motif-finding Problem Formulation NP-Hard of the Compact Motif-finding Problem A Polynomial Time Approximate Scheme

The Basic Idea of Our PTAS There are always a few “important” sub- structures, whose consensus holds most of the “secrets” of the true optimal motif. Therefore, if we can simply do exhaustive search to find these few sub-structures, then the trivial optimal solution for these sub- structures is a good approximation to the real optimal solution.

Technique 1: Sampling We sample only r proteins, consider each motif in a sampled protein, we can say we almost know the optimal solution.

Sampling will introduce only a bit of error. There is at least one selection schema, whose consensus has a cost value less than (1+1/r)OPT. So, we can find this schema by simply enumerating operation.

Technique 2: Discretize the Rotation Space Each rotation is parameterized by three angles   1,  2,  3  [0, 2  ) Discretize the angles with step size  ’  we get an  ’-rotation net.

Discretized rotation will not introduce a large error, either. A parameterized algorithm for protein structure alignment. J. Xu, F. Jiao, and B. Berger. RECOMB2006.

PTAS

Performance Ratio Analysis

Running Time Each protein contains M motifs  M is a polynomial of protein length Each motif can adopt W rotations  W depends on the constant  So the number of consensus is less than  O(n r (MW) r )= O((nMW) r )

Conclusion and Future Work We prove the (R,C)-compact motif finding problem is NP-hard We obtain a PTAS for this problem. Future Work:  Further reduce the time complexity  Design some practical algorithms.  Solve a more general case.

Thank You. Questions…