Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

Hashing.
College of Information Technology & Design
Fast Algorithms For Hierarchical Range Histogram Constructions
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Outline Introduction Related work on packet classification Grouper Performance Empirical Evaluation Conclusions.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Searching Kruse and Ryba Ch and 9.6. Problem: Search We are given a list of records. Each record has an associated key. Give efficient algorithm.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Matrices A set of elements organized in a table (along rows and columns) Wikipedia image.
Jeff Shen, Morgan Kearse, Jeff Shi, Yang Ding, & Owen Astrachan Genome Revolution Focus 2007, Duke University, Durham, North Carolina Introduction.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Heuristic alignment algorithms and cost matrices
Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment.
Protein Modules An Introduction to Bioinformatics.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Hashing General idea: Get a large array
1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.
Protein Structures.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
ICS220 – Data Structures and Algorithms Lecture 10 Dr. Ken Cosh.
Lecture 8 Index Organized Tables Clusters Index compression
ITEC 2620A Introduction to Data Structures
Protein Secondary Structure Prediction with inclusion of Hydrophobicity information Tzu-Cheng Chuang, Okan K. Ersoy and Saul B. Gelfand School of Electrical.
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Energy-Aware Scheduling with Quality of Surveillance Guarantee in Wireless Sensor Networks Jaehoon Jeong, Sarah Sharafkandi and David H.C. Du Dept. of.
Bioinformatics 2 -- Lecture 8 More TOPS diagrams Comparative modeling tutorial and strategies.
Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
DATA STRUCTURE & ALGORITHMS (BCS 1223) CHAPTER 8 : SEARCHING.
Arrays Module 6. Objectives Nature and purpose of an array Using arrays in Java programs Methods with array parameter Methods that return an array Array.
CSC 211 Data Structures Lecture 13
Hashing Hashing is another method for sorting and searching data.
ITEC 2620M Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: ec2620m.htm Office: Tel 3049.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Query Segmentation Using Conditional Random Fields Xiaohui and Huxia Shi York University KEYS’09 (SIGMOD Workshop) Presented by Jaehui Park,
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Multi-object Similarity Query Evaluation Michal Batko.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
D-skyline and T-skyline Methods for Similarity Search Query in Streaming Environment Ling Wang 1, Tie Hua Zhou 1, Kyung Ah Kim 2, Eun Jong Cha 2, and Keun.
Doug Raiford Phage class: introduction to sequence databases.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
Course Code #IDCGRF001-A 5.1: Searching and sorting concepts Programming Techniques.
Protein Structure  The structure of proteins can be described at 4 levels – primary, secondary, tertiary and quaternary.  Primary structure  The sequence.
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
JAVA: An Introduction to Problem Solving & Programming, 5 th Ed. By Walter Savitch and Frank Carrano. ISBN © 2008 Pearson Education, Inc., Upper.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Heuristic Alignment Algorithms Hongchao Li Jan
An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.
Query Optimization Cases. D. ChristozovINF 280 DB Systems Query Optimization: Cases 2 Executable Block 1 Algorithm using Indices (if available) Temporary.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
Module 11: File Structure
Indexing Goals: Store large files Support multiple search keys
Amino Acids and Proteins
SMA5422: Special Topics in Biotechnology
Test Sequence Length Requirements for Scan-based Testing
Feature Extraction Introduction Features Algorithms Methods
Protein Structures.
Chapter 7 Part 2 Edited by JJ Shepherd
ITEC 2620M Introduction to Data Structures
INDEXING.
A Small and Fast IP Forwarding Table Using Hashing
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun Park JungIm Won {mkseo, sanghyun, Department of Computer Science, Yonsei University

Data and Knowledge Engineering Laboratory Outline Introduction Related Works Basic Concept –Segment –Cluster and Look Ahead Selectivity Analysis Algorithm Description –Exact Match –Range Match Experiments Conclusion and Future Work

Data and Knowledge Engineering Laboratory Our target Introduction Similarities in Protein Structure –Functional properties of proteins usually depend on structures of the proteins. –There are many proteins which are structurally similar, but their sequences are not similar at the level of amino acids. Amino Acids (AA) Loop Regions Secondary Structure Element (SSE): Helices and Sheets Polymer Chain Protein Polymer Chain …

Data and Knowledge Engineering Laboratory Introduction (cont’) Secondary Structure of Protein Sequences –Linear sequence of amino acids folds into three dimensional structures. –There are three basic types of folds: h : Alpha Helices e : Beta Sheets l : Turns of Loops –Normally, these types occur in groups. For example, ‘lhhheeeelll’ is more likely to occur than ‘helhehhll.’ Primary :GQISDSIEEKRGFFSTKR.. Secondary:HLLLLLLLLLLHHHEEEE.. Probability:

Data and Knowledge Engineering Laboratory Outline Introduction Related Works Basic Concept –Segment –Cluster and Look Ahead Selectivity Analysis Algorithm Description –Exact Match –Range Match Experiments Conclusion and Future Work

Data and Knowledge Engineering Laboratory Related Works BLAST and FASTA –These exhaustive search techniques are often adapted to run on specialized high-end hardware or parallelized. –They do not solve the fundamental problem of needing to compare a query to each sequence in the collection.

Data and Knowledge Engineering Laboratory Related Works (cont’) Segment Based Indexing L. Hammel and J. M. Patel, "Searching on the Secondary Structure of Protein Sequence," In Proc. VLDB Conference, 2002 –Indexing scheme is based on the idea of the segmentation of secondary protein sequences. –Index selectivity is not uniform. –Gap can be defined only as segments.

Data and Knowledge Engineering Laboratory Outline Introduction Related Works Basic Concept –Segment –Cluster and Look Ahead Selectivity Analysis Algorithm Description –Exact Match –Range Match Experiments Conclusion and Future Work

Data and Knowledge Engineering Laboratory Segment IDType Length Start Loc. 1H10 1L101 1H311 1E414 Segmentation Process H LLLLLLLLLL HHH EEEE H,1L,10H,3E, Type, Len: Loc : Protein ID :1 Primary :GQISDSIEEKRGFFSTKR.. Secondary :HLLLLLLLLLLHHHEEEE.. Probability: Sample Protein Sequence

Data and Knowledge Engineering Laboratory Segment (cont’) Selectivity of Type+Length –Index selectivity is not uniform.

Data and Knowledge Engineering Laboratory Cluster and Look Ahead ID Type Str Length Start Loc. Look Ahead 1h10lhe 1l101he 1h311e 1e414 IDTypeLength Start Loc. 1h10 1l101 1h311 1e414 ID Type Str Length Start Loc. Look Ahead 1hl110he 1lh131e 1he711 ID Type Str Length Start Loc. Look Ahead 1hlhe180 k=0 k=1 k=2 2 k segments are combined and put into Cluster table. By doing this, we index several characters instead of exactly one character. ‘Look Ahead’ fields contain ‘Type’ fields of next n segments. Maximum value of k can be limited according to query length.

Data and Knowledge Engineering Laboratory Outline Introduction Related Works Basic Concept –Segment –Cluster and Look Ahead Selectivity Analysis Algorithm Description –Exact Match –Range Match Experiments Conclusion and Future Work

Data and Knowledge Engineering Laboratory Index Selectivity Cluster table is searched by TypeStr+TypeLen+LookAhead. Segment table is searched by Type+TypeLen. Cluster indices are 30~3,000 times more selective than the segment index, though the exact length information is lost. Number of Proteins = 320,000

Data and Knowledge Engineering Laboratory LookAhead Selectivity LookAhead improves selectivity 2~34 times. Number of Proteins = 320,000

Data and Knowledge Engineering Laboratory Outline Introduction Related Works Basic Concept –Segment –Cluster and Look Ahead Selectivity Analysis Algorithm Description –Exact Match –Range Match Experiments Conclusion and Future Work

Data and Knowledge Engineering Laboratory Exact Match Query Compute k Clustering ① k=2 ② k=2 Sort Merge ID must be equal. Loc difference must be 1(= ). Post Processing Validate result using actual sequences. Searching ① TypeStr=hlhe, TypeLen= , LookAhead=l ② TypeStr=lhel, TypeLen=

Data and Knowledge Engineering Laboratory Range Match Clustering TypeStr=elhe, TypeLen=64( )~ 77( ) Searching Range of TypeLen is increased too much. Clustering Comparing Histograms ① ② ③ ④ Search the cluster which seems to return the smallest number of rows. ① k=2, TypeLen=64~77 ② k=1, TypeLen=53~63 ③ k=1, TypeLen=8~9 ④ k=1, TypeLen=11~14

Data and Knowledge Engineering Laboratory Outline Introduction Related Works Basic Concept –Segment –Cluster and Look Ahead Selectivity Analysis Algorithm Description –Exact Match –Range Match Experiments Conclusion and Future Work

Data and Knowledge Engineering Laboratory Environment Oracle Database on Linux –To store cluster table and cluster indices. JDK 1.4.2_04 on Windows XP –To run searching program. Protein Data Used –80,000 amino acid sequences of PDB(Protein Data Bank) were downloaded. –Secondary structure of protein sequences were predicted using PREDATOR. –320,000 sequences were populated by duplicating 80,000 sequences 4 times.

Data and Knowledge Engineering Laboratory Exact Match Our method is 3~20 times faster than MISS(2) and 8~49 times faster than SSS and ISS.

Data and Knowledge Engineering Laboratory Range Match Range Match, FirstRange Match, Middle Range Match, End SCRM is 1.2~6 times faster than MISS(2).

Data and Knowledge Engineering Laboratory Scalability Exact Match, |Q|=5 Range Match (Middle), |Q|=5 Our methods are 2~4 times faster than MISS(2).

Data and Knowledge Engineering Laboratory Index Size Index size increases linearly. Maximum K value can be set according to expected length of queries. In most cases, maximum K value of 2 will suffice. Number of proteins = 320,000

Data and Knowledge Engineering Laboratory Outline Introduction Related Works Basic Concept –Segment –Cluster and Look Ahead Selectivity Analysis Algorithm Description –Exact Match –Range Match Experiments Conclusion and Future Work

Data and Knowledge Engineering Laboratory Conclusion and Future Work We have proposed the concept of clustered indexing, which indexes fixed number of characters, and look ahead, which enhances index selectivity. Our experiments show that proposed techniques are more efficient than the previous algorithms. In the future, we would like to study and propose approximate searching algorithms for 3-D structures of proteins.

Q & A