Indexing DNA Sequences Using q-Grams

Slides:



Advertisements
Similar presentations
6/10/20141 Top-Down Clustering Method Based On TV-Tree Zbigniew W. Ras.
Advertisements

Rizwan Rehman Centre for Computer Studies Dibrugarh University
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
The Dictionary ADT Definition A dictionary is an ordered or unordered list of key-element pairs, where keys are used to locate elements in the list. Example:
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
Searching on Multi-Dimensional Data
Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “where­am­I” queries.
Addressing Diverse User Preferences in SQL-Query-Result Navigation SIGMOD ‘07 Zhiyuan Chen Tao Li University of Maryland, Baltimore County Florida International.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
Tools for Text Review. Algorithms The heart of computer science Definition: A finite sequence of instructions with the properties that –Each instruction.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Modern Information Retrieval Chapter 8 Indexing and Searching.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
TEMPLATE DESIGN © SSAHA: Search with Speed Nick Altemose, Kelvin Gu, Tiffany Lin, Kevin Tao, Owen Astrachan Duke University.
Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga.
Modern Information Retrieval
Universiteit Utrecht BLAST CD Session 2 | Wednesday 4 May 2005 Bram Raats Lee Provoost.
BTrees & Bitmap Indexes
B+-tree and Hashing.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
TEMPLATE DESIGN © Haha  SSAHA Kelvin Gu, Tiffany Lin, Nick Altemose, Kevin Tao Duke University, Trinity College of Arts.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
Querying Structured Text in an XML Database By Xuemei Luo.
Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.
Chapter 3 Computational Molecular Biology Michael Smith
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
1 Indexing. 2 Motivation Sells(bar,beer,price )Bars(bar,addr ) Joe’sBud2.50Joe’sMaple St. Joe’sMiller2.75Sue’sRiver Rd. Sue’sBud2.50 Sue’sCoors3.00 Query:
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin 8-1 Chapter 8 Hashing Introduction to Data Structure CHAPTER 8 HASHING 8.1 Symbol Table Abstract Data.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Doug Raiford Phage class: introduction to sequence databases.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Database Management System
COP Introduction to Database Structures
Spatial Indexing I Point Access Methods.
Fast Sequence Alignments
Indexing and Hashing Basic Concepts Ordered Indices
CSE 589 Applied Algorithms Spring 1999
2018, Spring Pusan National University Ki-Joune Li
Minwise Hashing and Efficient Search
Donghui Zhang, Tian Xia Northeastern University
Presentation transcript:

Indexing DNA Sequences Using q-Grams Adriano Galati & Bram Raats

Indexing DNA Sequences Using q-Grams Method for indexing the DNA sequences efficiently based on q-grams to facilitate similarity search in a DNA database To sidestep the linear scan of the entire database Proposed: Hash table C-trees based on the q-grams These data structures allow quick detection of sequences

Introduction Two sequences share a certain number of q-grams if ed is a certain threshold Since there are 4 letters combinations Two level index to prune data sequences

Introduction(2) Two level index Two level index to prune data sequences: First level Clusters of similar q-grams in DNA are generated A typical Hash table is built in the segments with respect to the qClusters Second level The segments are transformed into the c-signature based on their q-grams A new index called the c-signature trees is proposed to organize the c-signatures of all segments of a DNA sequence for search efficiency

Edit distance To process approximate matching, one common and simple approximation metric is called edit distance Definition: The edit distance between two sequences is defined as the minimum number of edit operations (i.e. insertions, deletions and substitutions) of single characters needed to transform the first string into the second

Preliminaries Intuition: Two sequences would have a large number of q-grams in common when the ed between them is within a certain number Given a sequence S, its q-grams are obtained by sliding a window of length q over the characters of S |S| - q + 1 q-grams for a sequence S

Question (Bogdan) 1. I have noticed that the segments of the database text that are considered in this method are disjoint (see page 4, Introduction). I understand that for each segment all the consecutive, non-disjoint, q-grams are taken into consideration when computing the q-cluster and the c-signature of the segment. However, I am a bit puzzled that at the border between two adjacent segments nothing is done, which means that (q-1) q-grams are disregarded at each border. Since each segment contains w-q+1 q-grams, it means that overall a ratio of approximately (q-1)/(w-q+1) of all q-grams are disregarded (if we ignore the difference of 1 between the nr. of segments and the nr. of borders between adjacent segments). For common values of q=3 and w=30, this means about 7% of the q-grams. Do you see a solution for overcoming this problem?

Answer (Bogdan) Effort to improve the efficiency discarding the regions (filtering) with low sequence similarity Approximate sequence matching is preferred to exact matching in genomic database due to evolutionary mutation in the genomic sequences and the presence of noise data in a real sequence database

q-gram Signature kinds of q-grams All the possible q-grams are denoted as The q-gram signature is a bitmap with 4q bits where i-th bit corresponds to the presence or absence of ri . For a sequence S, the i-th bit is set as ‘ 1’ if occurs at least once in sequence S, else ‘ 0’

c-signature q-gram signature where where and when

Example c-signature P=“ACGGTACT” q-gram signature is (01 00 00 11 00 11 10 00) with 42 dimensions when q=2

Hash table Any DNA segment s can be encoded into a λ-bit (bitmap ) by the coding function: Hash table with size 2λ respect to qClusters

Question (Jacob) I can't get my hands on the c-Trees (mentioned first on page 9). Could you please explain how such a tree is built up, because I can't figure it out.

c-Trees Group of rooted dynamic trees built for indexing c-signature Height l set by user Given trees Each path from the root to a leaf in Ti corresponds to the c-signature string internal node there are children

Example c-Trees Consider the five DNA segments: If we get trees

Example c-Trees(2)

Query Processing HT and c-T are built on the DNA segments Query sequence Q is also partitioned in sliding query patterns Two level filtering FLF: Hash Table Based Similarity Search SLF: c-Trees Based Similarity Search

Hash Table Based Similarity Search Query pattern qi encoded to a hash key hi (λ bit) ngbr of hi are enumerated ngbr are encoded in λ bit from the segments which are within a ed from qi Once is enumerated, the segments in the bucket will be retrieved as candidates and stored into

c-Trees Based Similarity Search Candidates will be further verified by c-trees c-signature of query q is divided into c-signature strings The algorithm retrieves the segment s which satisfies the range constraint During query processing, for each leaf in the tree T1 are computed

Space and Time complexity Space complexity HT is for the table head for the bucket of the table Thus the total space complexity for the Hash structure is Time complexity for query Space complexity of each tree is

Question (Bogdan) I have trouble understanding the graphic in Fig. 2(a). My intuition would tell me that the more common q-grams exist in the 2 sequences, the higher the probability of finding a high score alignment between them. However, the figure seems to show the opposite: as the nr. of q-grams increases, the probability decreases. I've obviously got something mixed up here, but I can't figure out what it is. Could you please explain?

The Sensitivity vs The Number of Common q-grams

Answer (Bogdan) Sensitivity can be measured by the probability that a high score alignment is found by the algorithm The graph starts with probability almost 1 when we have only 1 common q-gram and if we increase the number of q-grams, the probability (sensitivity) of matching the alignment will surely decrease