DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

Slides:

Advertisements

Similar presentations

Protein Synthesis.

Advertisements

Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.

Improved Alignment of Protein Sequences Based on Common Parts David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Tomáš Skopal, David Hoksza Charles University in Prague Department of Software Engineering.

3D Shape Histograms for Similarity Search and Classification in Spatial Databases. Mihael Ankerst,Gabi Kastenmuller, Hans-Peter-Kriegel,Thomas Seidl Univ.

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB-Technical.

ADBIS 2003 Revisiting M-tree Building Principles Tomáš Skopal 1, Jaroslav Pokorný 2, Michal Krátký 1, Václav Snášel 1 1 Department of Computer Science.

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.

Answering Metric Skyline Queries by PM-tree Tomáš Skopal, Jakub Lokoč Department of Software Engineering, FMP, Charles University in Prague.

PDB-Protein Data Bank SCOP –Protein structure classification CATH –Protein structure classification genTHREADER–3D structure prediction Swiss-Model–3D.

6/2/20151 Bioinformatics & Parallel Computing Jessica Chiang.

Faculty of Computer Science © 2006 CMPUT 605February 04, 2008 Novel Approaches for Small Bio-molecule Classification and Structural Similarity Search Karakoc.

Structural bioinformatics

Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.

Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.

Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.

Protein structure (Part 2 of 2).

Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment.

Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.

The Protein Data Bank (PDB)

BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.

PDB-Protein Data Bank SCOP –Protein structure classification CATH –Protein structure classification genTHREADER–3D structure prediction Swiss-Model–3D.

Protein Structure Prediction II

Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Protein Synthesis Mrs. Harlin.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.

Chapter 12 Protein Structure Basics. 20 naturally occurring amino acids Free amino group (-NH2) Free carboxyl group (-COOH) Both groups linked to a central.

SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,

Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.

Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.

Protein Sequence Alignment and Database Searching.

A computational study of protein folding pathways Reducing the computational complexity of the folding process using the building block folding model.

David Hoksza, Supervisor: Tomáš Skopal, KSI MFF UK Similarity Search in Protein Databases.

VAST 2011 Sebastian Bremm, Tatiana von Landesberger, Martin Heß, Tobias Schreck, Philipp Weil, and Kay Hamacher Interactive-Graphics Systems TU Darmstadt,

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)

Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.

Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.

Protein Synthesis How to code for the correct amino acids.

M- tree: an efficient access method for similarity search in metric spaces Reporter ： Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU

Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.

Protein Structure Comparison. Sequence versus Structure The protein sequence is a string of letters: there is an optimal solution (DP) to the problem.

Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.

Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009

Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.

Protein Strucure Comparison Chapter 6,7 Orengo. Helices α-helix4-turn helix, min. 4 residues helix3-turn helix, min. 3 residues π-helix5-turn helix,

DALI Method Distance mAtrix aLIgnment

Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:

Stabbing balls and simplifying proteins Ovidiu Daescu and Jun Luo Department of Computer Science University of Texas at Dallas Richardson, TX

1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.

341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

DASFAA 2005, Beijing 1 Nearest Neighbours Search using the PM-tree Tomáš Skopal 1 Jaroslav Pokorný 1 Václav Snášel 2 1 Charles University in Prague Department.

Lecture 11 CS5661 Structural Bioinformatics – Structure Comparison Motivation Concepts Structure Comparison.

Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S

An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.

We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.

DNA SEQUENCE ALIGNMENT FOR PROTEIN SIMILARITY ANALYSIS CARL EBERLE, DANIEL MARTINEZ, MENGDI TAO.

Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.

SC.912.L.16.3 DNA Replication. – During DNA replication, a double-stranded DNA molecule divides into two single strands. New nucleotides bond to each.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

Bioinformatics Overview

Prediction of RNA Binding Protein Using Machine Learning Technique

Native Multidimensional Indexing in Relational Databases

Native Multidimensional Indexing in Relational Databases

DALI Method Distance mAtrix aLIgnment

Protein Structural Classification

Protein Synthesis.

Presentation transcript:

DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic

CIBCB Presentation Outline Biological background Similarity search in protein structure databases DDPIn  feature vector extraction  metrics  querying one-step approach multi-step approach Experimental results Conclusion

CIBCB Biological Background Proteins  molecules  translated from mRNA in ribosomes DNA → RNA → protein  sequence of amino acids (20 AAs)  coded by codon (triplet of nucleotides) Function of a protein derived from its three dimensional structure  → similar proteins have similar functions  similar proteins have a common ancestor Identifying protein structure → finding similar proteins → getting clue to the function

CIBCB Similarity Search in Protein Databases Similarity between a pair of proteins  alignment + similarity score RMSD, TM-score, … visual inspection  DALI, CE, SAP, VAST… Classification  SCOP (Structural Classification of Proteins) SCOP  no need for an alignment  indexing various features  PSI, PSIST, ProGreSS, CTSS, …DDPIn

CIBCB DDPIn - Overview Distance and Density based Protein Indexing Classification method Indexing of protein features  distances among C α atoms used  each AA represents a feature → protein p consists of |p| features various semantics used  based on clustering C α atoms into rings  metric indexing employed (M-tree) kNN querying  outcomes of several searches are merged to obtain final results

CIBCB DDPIn - Feature Extraction Features  n-dimensional vectors of real numbers  AA ≈ viewpoint → VPT (viewpoint tag) sDens  density of AAs in rings with a predefined width  sDensSSE enhanced with SSE information sRad  widths of rings containing predefined percentage of AAs  sRadSSE enhanced with SSE information sDir  number of AAs in a ring pointing from the viepoint  sDens enhanced with direction information

CIBCB Metrics L2L2  weighted L 2 close neighborhood of VPs is more important DDPIn - Similarity of VPTs

CIBCB DDPIn – Indexing Structure M-tree (Metric tree) Dynamic, hierarchical indexing structure Data space divided into ball shaped data regions (hyper-spheres)  root node represent data region covering all data children nodes represent regions covering parts of the space, …  data regions form balanced hierarchical structure inner nodes → routing entries  leaf nodes → ground entries 

CIBCB Querying / Classification One-step  extracting VPTs from query → n queries  ranking scheme Two-step  healing  reclassification with Smith- Waterman algorithm on sequences

CIBCB Experimental Results SCOP 1.65 dataset  class → fold → superfamily → family  1810 proteins 181 superfamilies  at least 10 proteins each  all α, all β, α + β and α /β classes query set  reduced queries  full  used also by PSI, ProGreSS, PSIST methods Testing of  superfamily classification accuracy  fold classification accuracy

CIBCB Finding Optimal k for kNN Queries

CIBCB Accuracy of VPT Semantics

CIBCB Accuracy for Increasing Dimension

CIBCB Accuracy of Various Metrics

CIBCB Suitability of Pairs of VPT Semantics for Healing identical correct classification identical wrong classification

CIBCB Comparison of Classification Methods

CIBCB Conclusion We have proposed  new representation of protein structures distance and density of C α atoms ranking scheme two-step classification We implemented  M-tree indexing for proposed representation  classification against SCOP Experimental results  best results among methods using identical classification 98.9% superfamily classification accuracy 100% fold classification accuracy  comparable run time