10/30/2012COMP 555 Bioalgorithms (Fall 2012)1 Lecture 16: Combinatorial Pattern Matching Study Chapter 9.1 – 9.5.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
HASH TABLE. HASH TABLE a group of people could be arranged in a database like this: Hashing is the transformation of a string of characters into a.
Space-for-Time Tradeoffs
Chapter 4: Trees Part II - AVL Tree
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Combinational Pattern Matching Shashank Kadaveru.
Modern Information Retrieval
BTrees & Bitmap Indexes
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
General Trees and Variants CPSC 335. General Trees and transformation to binary trees B-tree variants: B*, B+, prefix B+ 2-4, Horizontal-vertical, Red-black.
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Induction and recursion
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Comp 249 Programming Methodology Chapter 15 Linked Data Structure - Part B Dr. Aiman Hanna Department of Computer Science & Software Engineering Concordia.
UNC Chapel Hill M. C. Lin Point Location Reading: Chapter 6 of the Textbook Driving Applications –Knowing Where You Are in GIS Related Applications –Triangulation.
Chapter 19: Binary Trees. Objectives In this chapter, you will: – Learn about binary trees – Explore various binary tree traversal algorithms – Organize.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Copyright © Curt Hill Hashing A quick lookup strategy.
B+ Trees: An IO-Aware Index Structure Lecture 13.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Combinatorial Pattern Matching An Introduction to Bioinformatics Algorithms (Jones and Pevzner)
Week 15 – Wednesday.  What did we talk about last time?  Review up to Exam 1.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Strings: Tries, Suffix Trees
CSE 5290: Algorithms for Bioinformatics Fall 2011
CSE 5290: Algorithms for Bioinformatics Fall 2009
Indexing and Hashing Basic Concepts Ordered Indices
Chapter 7 Space and Time Tradeoffs
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Strings: Tries, Suffix Trees
Presentation transcript:

10/30/2012COMP 555 Bioalgorithms (Fall 2012)1 Lecture 16: Combinatorial Pattern Matching Study Chapter 9.1 – 9.5

10/30/2012COMP 555 Bioalgorithms (Fall 2012)2 Repeat Finding Example of repeats: – ATGGTCTAGGTCCTAGTGGTC Motivation to find them: – Phenotypes arise from copy-number variations – Genomic rearrangements are often associated with repeats – Trace evolutionary secrets – Many tumors are characterized by an explosion of repeats

10/30/2012COMP 555 Bioalgorithms (Fall 2012)3 Repeat Finding Near matches are more difficult: – ATGGTCTAGGACCTAGTGTTC Motivation to find them: – Phenotypes arise from copy-number variations – Genomic rearrangements are often associated with repeats – Trace evolutionary secrets – Many tumors are characterized by an explosion of repeats

10/30/2012COMP 555 Bioalgorithms (Fall 2012)4 l -mer Repeats Long repeats are difficult to find Short repeats are easy to find Strategy for finding long repeats: – Find exact repeats of short l -mers ( l is usually 10 to 13) – Extend l -mer repeated seeds into longer, maximal repeats

10/30/2012COMP 555 Bioalgorithms (Fall 2012)5 l -mer Repeats (cont’d) There are typically many locations where an l -mer is repeated: GCTTACAGATTCAGTCTTACAGATGGT The 4-mer TTAC starts at locations 3 and 17

10/30/2012COMP 555 Bioalgorithms (Fall 2012)6 Extending l -mer Repeats GCTTACAGATTCAGTCTTACAGATGGT Extend these 4-mer matches: GCTTACAGATTCAGTCTTACAGATGGT Maximal repeat: CTTACAGAT Maximal repeats cannot be extended in either direction To find maximal repeats in this way, we need ALL start locations of all l -mers in the genome Hashing lets us find repeats quickly in this manner

10/30/2012COMP 555 Bioalgorithms (Fall 2012)7 Hashing: What is it? How hashing works… – Generate an integer “key” from an arbitrary record – Store record in an data structure indexed by this integer key Hashing is a very efficient way to store and retrieve data – e.g., Python dictionaries are hashes

10/30/2012COMP 555 Bioalgorithms (Fall 2012)8 Hashing: Definitions Hash table: array used in hashing Records: data stored in a hash table Keys: identify sets of records Hash function: uses a key to generate an index to insert at in hash table Collision: when more than one record is mapped to the same index in the hash table

10/30/2012COMP 555 Bioalgorithms (Fall 2012)9 Hashing: Example Where do the animals eat? Key: each animal

10/30/2012COMP 555 Bioalgorithms (Fall 2012)10 Hashing DNA sequences – Each l -mer can be translated into a binary string, key = quaternary(seq) ( A, T, C, G can be represented as 0, 1, 2, 3) – After assigning a unique integer per l -mer it is easy to store the starting locations of occurance of each l -mer in a genome of length n in O( l n ) time

10/30/2012COMP 555 Bioalgorithms (Fall 2012)11 Hashing: Maximal Repeats To find repeats in a genome: – For all l -mers in the genome, note its starting position and the sequence – Generate a hash table index for each unique l -mer sequence – In each index of the hash table, store all genome start locations of the l -mer which generated that index – Extend l -mer repeats to maximal repeats Problem as l gets big the number of possible patterns becomes larger than the genome’s length (4 l >> n)

10/30/2012COMP 555 Bioalgorithms (Fall 2012)12 Hashing: Collisions Generate hash keys from a reduced space Ex. Key = quaternary(seq) % (N/ l ) Leads to possible collisions Dealing with collisions: – “Chain” tuples of ( l -mer, start location) pairs in a linked list

10/30/2012COMP 555 Bioalgorithms (Fall 2012)13 Hashing: Summary When finding genomic repeats from l -mers: – Generate a hash table index for each l -mer sequence – In each index, store all genome start locations of the l -mer which generated that index – Extend l -mer repeats to maximal repeats

10/30/2012COMP 555 Bioalgorithms (Fall 2012)14 Pattern Matching What if, instead of finding repeats in a genome, we want to find all positions of a particular sequences in given sequence? This leads us to a different problem, the Pattern Matching Problem

10/30/2012COMP 555 Bioalgorithms (Fall 2012)15 Pattern Matching Problem Goal: Find all occurrences of a pattern in a text Input: Pattern p = p 1 … p n and text t = t 1 … t m Output: All positions 1< i < ( m – n + 1) such that the n -letter substring of t starting at i matches p Motivation : Searching database for a known pattern

10/30/2012COMP 555 Bioalgorithms (Fall 2012)16 Exact Pattern Matching: A Brute-Force Algorithm PatternMatching(p,t) 1n  length of pattern p 2m  length of text t 3for i  1 to (m – n + 1) 4 if t i …t i+n-1 = p 5 output i

10/30/2012COMP 555 Bioalgorithms (Fall 2012)17 Exact Pattern Matching: An Example PatternMatching algorithm for: – Pattern GCAT – Text CGCATC GCAT CGCATC GCAT CGCATC GCAT CGCATC GCAT

10/30/2012COMP 555 Bioalgorithms (Fall 2012)18 Exact Pattern Matching: Running Time PatternMatching runtime: O( nm ) Probability-wise, it’s more like O( m ) – Rarely will there be close to n comparisons in line 4 Worse case: Find “AAAAT” in “AAAAAAAAAAAAAAAT” Better solution: suffix trees – Can solve problem in O( m ) time – Conceptually related to keyword trees

10/30/2012COMP 555 Bioalgorithms (Fall 2012)19 Keyword Trees: Example Keyword tree : – Apple

10/30/2012COMP 555 Bioalgorithms (Fall 2012)20 Keyword Trees: Example (cont’d) Keyword tree : – Apple – Apropos

10/30/2012COMP 555 Bioalgorithms (Fall 2012)21 Keyword Trees: Example (cont’d) Keyword tree : – Apple – Apropos – Banana

10/30/2012COMP 555 Bioalgorithms (Fall 2012)22 Keyword Trees: Example (cont’d) Keyword tree : – Apple – Apropos – Banana – Bandana

10/30/2012COMP 555 Bioalgorithms (Fall 2012)23 Keyword Trees: Example (cont’d) Keyword tree : – Apple – Apropos – Banana – Bandana – Orange

10/30/2012COMP 555 Bioalgorithms (Fall 2012)24 Keyword Trees: Properties – Stores a set of keywords in a rooted labeled tree – Each edge labeled with a letter from an alphabet – Any two edges coming out of the same vertex have distinct labels – Every keyword stored can be spelled on a path from root to some leaf – Searches are performed by “threading” the target pattern through the tree

10/30/2012COMP 555 Bioalgorithms (Fall 2012)25 Keyword Trees: Threading (cont’d) Thread “appeal” – appeal

10/30/2012COMP 555 Bioalgorithms (Fall 2012)26 Keyword Trees: Threading (cont’d) Thread “appeal” – appeal

10/30/2012COMP 555 Bioalgorithms (Fall 2012)27 Keyword Trees: Threading (cont’d) Thread “appeal” – appeal

10/30/2012COMP 555 Bioalgorithms (Fall 2012)28 Keyword Trees: Threading (cont’d) Thread “appeal” – appeal

10/30/2012COMP 555 Bioalgorithms (Fall 2012)29 Keyword Trees: Threading (cont’d) Thread “apple” – apple

10/30/2012COMP 555 Bioalgorithms (Fall 2012)30 Keyword Trees: Threading (cont’d) Thread “apple” – apple

10/30/2012COMP 555 Bioalgorithms (Fall 2012)31 Keyword Trees: Threading (cont’d) Thread “apple” – apple

10/30/2012COMP 555 Bioalgorithms (Fall 2012)32 Keyword Trees: Threading (cont’d) Thread “apple” – apple

10/30/2012COMP 555 Bioalgorithms (Fall 2012)33 Keyword Trees: Threading (cont’d) Thread “apple” – apple Now thread “band”, “or”, and the nonsense word “apro” How do you tell “real” words from nonsense? (Homework problem)

10/30/2012COMP 555 Bioalgorithms (Fall 2012)34 Multiple Pattern Matching Problem Goal: Given a set of patterns and a text, find all occurrences of any of patterns in text Input: k patterns p 1,…, p k, and text t = t 1 … t m Output: Positions 1 < i < m where substring of t starting at i matches p j for 1 < j < k Motivation : Searching database for known multiple patterns

10/30/2012COMP 555 Bioalgorithms (Fall 2012)35 Multiple Pattern Matching: Straightforward Approach Can solve as k “Pattern Matching Problems” – Runtime: O(kmn) using the PatternMatching algorithm k times – m - length of the text – n - average length of the pattern

10/30/2012COMP 555 Bioalgorithms (Fall 2012)36 Multiple Pattern Matching: Keyword Tree Approach Or, we could use keyword trees: – Build keyword tree in O( N ) time; N is total length of all patterns – With naive threading: O( N + nm ) – Aho-Corasick algorithm: O( N + m )

10/30/2012COMP 555 Bioalgorithms (Fall 2012)37 Keyword Trees: Threading To match patterns in a text using a keyword tree: – Build keyword tree of patterns – “Thread” the text through the keyword tree

10/30/2012COMP 555 Bioalgorithms (Fall 2012)38 Keyword Trees: Threading Threading is “complete” when we reach a leaf in the keyword tree When threading is “complete,” we’ve found a pattern in the text

10/30/2012COMP 555 Bioalgorithms (Fall 2012)39 Suffix Trees=Collapsed Keyword Trees All suffixes of a given sequence Similar to keyword trees, except vertices of degree 2 are removed and the strings on either side are merged – Each edge is labeled with a substring of a text – All internal vertices have at least three edges – Terminal vertices, leaves, are labeled by the index of the pattern. All suffixes of: ATCATG TCATG CATG ATG TG G

10/30/2012COMP 555 Bioalgorithms (Fall 2012)40 Suffix Tree of a Text Construct a keyword tree from all suffixes of a text Collapse non-branching paths into an edge (path compression) ATCATG TCATG CATG ATG TG G Keyword Tree Suffix Tree How much time does it take? quadratic Time is linear in the total size of all suffixes, which is quadratic in the length of the text

10/30/2012COMP 555 Bioalgorithms (Fall 2012)41 Suffix Trees: Advantages With careful bookkeeping a test’s suffix tree can be constructed in a single pass of the text Thus, suffix trees can be built faster than keyword trees of suffixes and transforming them ATCATG TCATG CATG ATG TG G quadratic Keyword Tree Suffix Tree linear (Weiner, McCreight & Ukkonen suffix tree algorithms)

Suffix Tree Construction Few books, including ours, delve into the details of suffix tree construction algorithms due to its reputation for being overly complicated. Weiner’s and McCreight's original linear algorithms for constructing a suffix trees had some disadvantages. Principle among them was the requirement that the tree be built in reverse order, meaning that the tree was grown incrementally by adding characters from the end of the input. This ruled it out for on-line processing 10/30/2012COMP 555 Bioalgorithms (Fall 2012)42

Ukkonen’s Clever Bookkeeping Esko Ukkonen’s construction works from left to right. It’s incremental. Each step transforms the Suffix Tree of the prefix ending at the i th character to the Suffix Tree ending at the i+1 th. 10/30/2012COMP 555 Bioalgorithms (Fall 2012)43 Example from Mark Nelson Dr. Dobb's JournalDr. Dobb's Journal, August, 1996 B BOBOO OO BOOKO K K OK O

Tree Properties Extensions are done by threading each new prefix through the tree and visiting each of the suffixes of the current tree. At each step we start at the longest suffix (BOOK), and work our way down to the shortest (empty string) Each ends at a node of three types: – A leaf node (1,2,4,5) – An explicit node (0, 3) – An implicit node (Between characters of a substring labeling an edge, such as BO, BOO, and OO). 10/30/2012COMP 555 Bioalgorithms (Fall 2012) BOOKO K K OK

Observations There are 5 suffixes in the tree (including the empty string) after adding BOOK They are represented by the root and 4 leaves Adding the next letter, another ‘K’, requires visiting each of the suffixes in the existing tree, in order of decreasing length, and adding letter ‘K’ to its end. Adding a character to a leaf node never creates a new explicit node, regardless of the letter If the root already has an edge labeled ‘K’ we just extend it 10/30/2012COMP 555 Bioalgorithms (Fall 2012) OBOOKK KK OKK

Split and Add Update The next step is to add an ‘E’ to our tree As before, add ‘E’ to each suffix in order of decreasing lengths BOOKK, OOKK, OKK, KK, K The first suffix that does not terminate at a leaf is called the “active point” of the suffix tree 10/30/2012COMP 555 Bioalgorithms (Fall 2012) OBOOKKE KKE OKKE Add ‘E’ to: BOOKK OOKK OKK KK BOOKKE O K KKE OKKE 6 KE Split Add ‘E’ to: K O BOOKKE K KKE OKKE 6 KE 7 E Add Add ‘E’ to: Empty string

Updating an Explicit Node After updating suffix K, we still have to update the next shorter suffix, which is the empty string. 10/30/2012COMP 555 Bioalgorithms (Fall 2012) O BOOKKE K KKE OKKE 6 KE 7 E 8 E

Generalizing Once a leaf node, always a leaf node Additional characters only extends the edge leading to the leaf (leaves are easy) When adding a new leaf, its edge will represent all characters from the i th suffix’s starting point to the i+1 st text’s end. Because of this once a leaf is created, we can just forget about it. If the edge is later split, its start may change but it will extend to the end. This means that we only need to keep track of the active point in each tree, and update from there. 10/30/2012COMP 555 Bioalgorithms (Fall 2012)48

One Last Detail The algorithm sketch so far glosses over one detail. At each step of an update we need to keep track of the next smaller suffix from the i th update To do this a suffix pointer is kept at each internal node For Pseudo code – Mark Nelson, “Fast String Searching with Suffix Trees” Dr. Dobb's Journal August, 1996 Dr. Dobb's Journal For proofs of linear space/time performance – E. Ukkonen. “On-line construction of suffix trees. Algorithmica, 14(3): , September 1995.“On-line construction of suffix trees 10/30/2012COMP 555 Bioalgorithms (Fall 2012)49 The suffix tree for ABABABC with suffix pointers shown as dashed lines

10/30/2012COMP 555 Bioalgorithms (Fall 2012)50 Use of Suffix Trees Suffix trees hold all suffixes of a text, T – i.e., ATCGC: ATCGC, TCGC, CGC, GC, C – Builds in O( m ) time for text of length m To find any pattern P in a text: – Build suffix tree for text, O( m ), m = |T| – Thread the pattern through the suffix tree – Can find pattern in O( n ) time! ( n = |P| ) O(| T | + | P | ) time for “Pattern Matching Problem” (better than Naïve O(| P || T |) – Build suffix tree and lookup pattern Multiple Pattern Matching in O(| T | + k| P | )

10/30/2012COMP 555 Bioalgorithms (Fall 2012)51 Pattern Matching with Suffix Trees SuffixTreePatternMatching(p,t) 1Build suffix tree for text t 2Thread pattern p through suffix tree 3if threading is complete 4 traverse all paths from the threading’s endpoint to leaves and output their positions 5else 6 output “Pattern does not appear in text”

10/30/2012COMP 555 Bioalgorithms (Fall 2012)52 Suffix Trees: Example the suffix tree for ATGCATACATGG Threading the pattern ATG ATGCATACATGG 1 TGCATACATGG 2 GCATACATGG 3 CATACATGG 4 ATACATGG 5 TACATGG 6 ACATGG 7 CATGG 8 ATGG 9 TGG 10 GG 11 G What is the shortest pattern that occurs only once in the string? What letter occurs most frequently?

10/30/2012COMP 555 Bioalgorithms (Fall 2012)53 Multiple Pattern Matching: Summary Keyword and suffix trees are useful data structures supporting various pattern finding problems Keyword trees : – Build keyword tree of patterns, and thread text through it Suffix trees : – Build suffix tree of text, and thread patterns through it

10/30/2012COMP 555 Bioalgorithms (Fall 2012)54 Suffix Trees: Theory vs. Practice In concept, suffix trees are extremely powerful for making a variety of queries concerning a sequence – What is the shortest unique substring? – How many times does a given string appear in a text? Despite the existence of linear-time construction algorithms, and O(m) search times, suffix trees are still rarely used for genome scale searching – Large storage overhead Close cousins of the Suffix-Tree (Suffix Arrays and Burrows-Wheeler Transforms) are more common Next lecture