A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL Chan The University of Hong Kong Sep 9, 2009.

Slides:

Advertisements

Similar presentations

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Advertisements

Protein – Protein Interactions Lisa Chargualaf Simon Kanaan Keefe Roedersheimer Others: Dr. Izaguirre, Dr. Chen, Dr. Wuchty, ChengBang Huang.

A 3-D reference frame can be uniquely defined by the ordered vertices of a non- degenerate triangle p1p1 p2p2 p3p3.

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Seminar in structural bioinformatics Multiple structural alignment of proteins By Elad Kaspani.

Structural bioinformatics

Protein Structure Alignment Human Myoglobin pdb:2mm1 Human Hemoglobin alpha-chain pdb:1jebA Sequence id: 27% Structural id: 90% Another example: G-Proteins:

Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.

Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment.

Agenda A brief introduction The MASS algorithm The pairwise case Extension to the multiple case Experimental results.

Proteins  Proteins control the biological functions of cellular organisms  e.g. metabolism, blood clotting, immune system amino acids  Building blocks.

Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,

The Protein Data Bank (PDB)

Protein threading Structure is better conserved than sequence

Similar Sequence Similar Function Charles Yan Spring 2006.

Sequence Alignment III CIS 667 February 10, 2004.

A unified statistical framework for sequence comparison and structure comparison Michael Levitt Mark Gerstein.

BNFO 235 Lecture 5 Usman Roshan. What we have done to date Basic Perl –Data types: numbers, strings, arrays, and hashes –Control structures: If-else,

BMI 731 Protein Structures and Related Database Searches.

Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.

Structure Alignment in Polynomial Time Rachel Kolodny Stanford University Nati Linial The Hebrew University of Jerusalem.

Model Database. Scene Recognition Lamdan, Schwartz, Wolfson, “Geometric Hashing”,1988.

Protein Structure Alignment

Blast heuristics Morten Nielsen Department of Systems Biology, DTU.

Protein Tertiary Structure Prediction Structural Bioinformatics.

IBGP/BMI 705 Lab 4: Protein structure and alignment TA: L. Cooper.

1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.

The dynamic nature of the proteome

PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.

EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science

Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches1 By Jayakumar Rudhrasenan S Primary Supervisor: Prof. Heiko Schroder.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Mining Multidimensional Sequential Patterns over Data Streams Chedy Raїssi and Marc Plantevit DaWak_2008.

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.

Mining High Utility Itemset in Big Data

Comp. Genomics Recitation 3 The statistics of database searching.

Protein Structure Comparison. Sequence versus Structure The protein sequence is a string of letters: there is an optimal solution (DP) to the problem.

Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.

Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.

National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.

Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009

DALI Method Distance mAtrix aLIgnment

Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:

Pharm 201 Lecture 10, Reductionism and Classification Require Detailed Comparison Consider 3D Comparison Pharm 201/Bioinformatics I Philip E. Bourne.

MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:

DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

Pair-wise Structural Comparison using DALILite Software of DALI Rajalekshmy Usha.

Step 3: Tools Database Searching

Lecture 11 CS5661 Structural Bioinformatics – Structure Comparison Motivation Concepts Structure Comparison.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

EMBL-EBI Eugene Krissinel SSM - MSDfold. EMBL-EBI MSDfold (SSM)

An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.

Local Flexibility Aids Protein Multiple Structure Alignment Matt Menke Bonnie Berger Lenore Cowen.

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

Find the optimal alignment ? +. Optimal Alignment Find the highest number of atoms aligned with the lowest RMSD (Root Mean Squared Deviation) Find a balance.

Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.

1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Several motifs (  -sheet, beta-alpha-beta, helix-loop-helix) combine to form a compact globular.

EBI is an Outstation of the European Molecular Biology Laboratory. PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches.

Protein Structure Comparison

Geometric Hashing: An Overview

Protein structure prediction.

DALI Method Distance mAtrix aLIgnment

Protein Structure Alignment

Presentation transcript:

A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL Chan The University of Hong Kong Sep 9, 2009

2 Aligning protein structures  Important step for understanding protein functions  Sequencing proteins and determining 3D structures is easy X-ray crystallography, NMR spectroscopy  Testing functions of proteins is hard  One useful observation Mutations change sequences Structures conserved Structural similarity => Functional similarity  Good structural alignment algorithm => Predict functions of proteins

3 Our focus  We propose studying the problem with less information or assumptions  Sequence order independence Sequence order (arrangement of amino acids) is unknown Reduce the need to find sequence information  Subset alignment Find large alignment for all subsets Extract similar structures from a mixture with dissimilar ones  Bottleneck metric In an alignment, every pair of aligned points have a small distance

4 Related work  Pairwise alignment Dali [Holm & Sander 93] VAST [Gibrat, Madej, & Bryant 96] CE [Shindyalov & Bourne 98]  Techniques to obtain multiple alignments center-star[Akutsu & Kim 99] Tree-progress[Taylor, Flores, & Orengo 94]  Multiple alignment MultiProt[Shatsky et. al. 02](seq. order) MultiBind[Shatsky et. al. 06](all align.) MUSTA[Leibowitz et. al. 01](all align.) MASS[Dror et. al. 03](seq. order) POSA[Ye & Godzik 05](seq. order)

5 In the followings  Model  Algorithms SOIL  Experimental results

6 Model  A protein is a set of amino acid in 3D, and an amino acid = 3 points in space For C α -atom, C, N Substructure = subset of amino acid  Transformation T(S) For each s  S, T(s) = Rs + t, where R is a 3 × 3 rotation matrix t is a 3 × 1 translation matrix  Similarity C = {c1, …, cn} a set of substructures, T = {T1, … Tn} be a set of transformation C is ε-congruent w.r.t. T if we can transform each structures in C and align the amino acids such that the Cα items of every aligned pairs are close (<=ε)  A ε-congruent alignment For a set of S of structures, an alignment is set of substructures C and transformations T Rotate Translate S1S1 S2S2

7 Problem definition  Size of an alignment: number of aligned amino acid or each protein  Cardinality: number of structures involebed.  Input A set of structures S = {S 1, S 2, …, S m } A distance threshold  A subset size threshold min_cardinality An alignment length threshold min_size  Output For each subset S’  S with |S’|  min_size, the maximal length –congruent alignment whose length is at least min_length

8 The SOIL Algorithm  Sequence Order Independent aLignment Step 1. Geometric hashing Step 2. Frequent pattern mining Step 3. Generating alignments

9 Geometric Hashing  Purpose Take each amino acid as a base (reference) and store the relative location of other amino acids in a hashtable S1S1 S2S Store the base Length of box = ε

10 Mining Frequent Patterns  Main observation. Assume that a pair of bases {(k1, i1) {k2, i2)} appears in x boxes. Then if structures S k1 and S k2 are transformed using the bases for S k1 i1 and S k1 i2, there are at least x+1 pairs of points locating closely with each other (distance at most √3ε, i.e., diagonal length).  Proof. Why (k1,i1) is in a box?  When S k1 is transformed using the base S k1 i1, an amino acid locates at that box

11 Mining Frequent Patterns  Let each hashbox be a coincidence group, or transaction.  Consider all bases as items  Find all sets of items that appear frequently in the coincidence group.  “Frequent pattern mining problem”, a well-studied problem in database area.  Efficient algorithms, like fp-tree, are known  Efficient, can consider all possible transformations at the same time

12 Generating Alignments  Given a frequent pattern E.g., (S 1 2, S 2 1 ) Use the bases in a tuple to transform the structures involved Generate a matching of points, bipartite matching for pairwise, greedy for multiple Output the largest alignment x y S1S1 S3S AlignmentTransformed S 1 and S 3

13 Experimental evaluation  Implemented in C++  Test cases run on Intel ® Core TM 2 Duo with 2.66GHz CPU and 4GB main memory  Default settings : 3Å min_size: 2 LRF: 3Atoms Coincidence group: Bin max_trans: 30Avg

14 Pairwise alignment  10 pairs of proteins used before, e.g., MultiProt  SCOP and PRINT families  Comparison of running time C-alpha match: within a few seconds (from web) MultiProt: 0.211s MultiBind: 1.968s SOIL: 0.235s

Multiple alignments  10 groups of proteins  Various superfamilies in SCOP, protein interfaces from PRINT

16 Multiple alignments (Levels) Calcium Binding 4-helix Bundle Superhelix Supersandwich Concanavalin (Levels) tRNA synthetase G-proteins PTB domain PRINT 45 PRINT 8158

17 Multiple alignment

18 Conclusion  Proposed a more difficult problem Sequence order independence  Modeled as the largest common point set problem Subset alignment  Automatically detect subsets of similar structures Similarity measurement  Adopt the bottleneck metric  Developed the SOIL algorithm Combination of Geometric Hashing and Frequent Itemset Mining Simultaneous alignment  Evaluated the algorithm with experiments Can be combined with other methods by simply taking the maximum.

19 Future Work  Variations of the problem  Scoring functions  Disk-based solution  Other applications