Presentation is loading. Please wait.

Presentation is loading. Please wait.

String Data Structures and Algorithms

Similar presentations


Presentation on theme: "String Data Structures and Algorithms"— Presentation transcript:

1 String Data Structures and Algorithms
David Fernández-Baca UNAM (Mexico) (based on notes by Srinivas Aluru) slightly modified by Benny Chor

2 BBSI Summer School - Iowa State University
Why Strings? Biological sequences can be viewed as strings, or finite series of characters, over an alphabet Σ. There is a wealth of algorithmic theory developed for general strings that we can apply to specific biological problems. February 5, 2019 BBSI Summer School - Iowa State University

3 BBSI Summer School - Iowa State University
Suffix Trees S = M A L A Y A L A M $ A $ M LA YALAM$ AL 5 10 YALAM$ $M $M YALAM$ $ ALAYALAM$ 8 4 7 3 $M YALAM$ 1 9 6 2 February 5, 2019 BBSI Summer School - Iowa State University

4 Suffix tree properties
For a string S of length n, there are n leaves and at most n internal nodes. therefore requires only linear space Each leaf represents a unique suffix. Concatenation of edge labels from root to a leaf spells out the suffix. Each internal node represents a distinct common prefix to at least two suffixes. February 5, 2019 BBSI Summer School - Iowa State University

5 BBSI Summer School - Iowa State University
Edge Encoding S = M A L A Y A L A M $ (2, 2) (10, 10) (3, 4) (5, 10) (1, 1) 10 (3, 4) (5, 10) 5 (5, 10) (9, 10) (2, 10) (10, 10) (9, 10) 8 4 7 3 1 9 (9, 10) (5, 10) 6 2 February 5, 2019 BBSI Summer School - Iowa State University

6 Näive Suffix Tree Construction
1 MALAYALAM$ 2 ALAYALAM$ 3 LAYALAM$ 4 AYALAM$ 5 YALAM$ 6 ALAM$ 7 LAM$ 8 AM$ 9 M$ 10 $ Before starting: Why exactly do we need this $, which is not part of the alphabet? February 5, 2019 BBSI Summer School - Iowa State University

7 Näive Suffix Tree Construction
2 3 4 1 MALAYALAM$ 2 ALAYALAM$ 3 LAYALAM$ 4 AYALAM$ 5 YALAM$ 6 ALAM$ 7 LAM$ 8 AM$ 9 M$ 10 $ A $MALAYALAM LAYALAM$ LAYALAM$ YALAM$ 2 1 3 4 February 5, 2019 BBSI Summer School - Iowa State University

8 Finding a (short) Pattern in a (long) String
Build a suffix tree of the string. Starting from the root, traverse a path matching characters of the pattern. If stuck, pattern not present in string. Otherwise, each leaf below gives a position of the pattern in the string. February 5, 2019 BBSI Summer School - Iowa State University

9 Finding a Pattern in a String
Find “ALA” A $ M LA YALAM$ AL 5 10 YALAM$ M$ M$ YALAM$ $ ALAYALAM$ 3 M$ 8 4 7 YALAM$ 1 9 Two matches - at 6 and 2 6 2 February 5, 2019 BBSI Summer School - Iowa State University

10 Finding Common Substrings
Construct a generalized suffix tree for two strings (each suffix of each string is represented). Label each leaf with the suffix number and string label. Each internal node with a leaf from both strings in its subtree gives a common substring. February 5, 2019 BBSI Summer School - Iowa State University

11 Generalized Suffix Tree
WINDOW$ INDIGO$ $ D $OG I ND O W (2, 5) (1, 7) (2, 7) ND OW$ $OGI $OG $OGI OW$ $ $W $ INDOW$ (2, 4) (2, 2) (1, 3) (1, 5) (2, 6) (2, 3) (1, 4) $OGI OW$ (1, 6) (1, 1) (2, 1) (1, 2) February 5, 2019 BBSI Summer School - Iowa State University

12 Lowest Common Ancestors
The lowest common ancestor (lca) of two nodes x and y in a rooted tree is the deepest node (farthest away from root) that is an ancestor of both x and y Concatenation of edge labels from root to the lca of two leaves spells out the longest common prefix (lcp) of two strings lca(x,y) an be found in constant time after linear preprocessing [Bender00] February 5, 2019 BBSI Summer School - Iowa State University

13 BBSI Summer School - Iowa State University
A Useful Property String depth (lca (i , j)) = lcp (suffixi, suffixj) A A $ String depth = 3 M LA YALAM$ AL AL lca 5 10 YALAM$ M$ M$ YALAM$ $ ALAYALAM$ 8 4 7 3 M$ YALAM$ 1 9 6 2 February 5, 2019 BBSI Summer School - Iowa State University

14 Longest Common Extension
RAI RAILWAY$ GRAINY$ RAI lce(1,1) = 0 lce(2,1) = 3 We’ll soon find lce’s useful in reconstructing phylogenetic trees based on whole genome/proteome sequences February 5, 2019 BBSI Summer School - Iowa State University

15 BBSI Summer School - Iowa State University
lce’s and lca’s To compute lce’s for two strings S1 and S2 Build generalized suffix tree, T, of S1 and S2 Compute string depth for each node in T Preprocess T for lca queries lce(i,j) = string depth of lca of suffix i ofS1 and suffix j ofS2 February 5, 2019 BBSI Summer School - Iowa State University

16 BBSI Summer School - Iowa State University
Example WINDOW$ INDIGO$ $ D $OG I ND O W (2, 5) (1, 7) (2, 7) ND OW$ $OGI $OG $OGI OW$ $ $W $ INDOW$ (2, 4) (2, 2) (1, 3) (1, 5) (2, 6) (2, 3) (1, 4) $OGI OW$ (1, 6) (1, 1) (2, 1) (1, 2) February 5, 2019 BBSI Summer School - Iowa State University

17 BBSI Summer School - Iowa State University
lce’s, revisited Given two strings S1 and S2 , we are now interested in finding, for each i, the index j such that lce (i, j) is maximal. What is the meaning of this task? How do we accomplish it efficiently? Notice that computing the values lce (i, j) for all j is inefficient! February 5, 2019 BBSI Summer School - Iowa State University

18 BBSI Summer School - Iowa State University
Palindromes A palindrome is a string that reads the same in both directions E.g., CATGTAC red rum, sir, is murder Palindrome problem: Find all maximal palindromes in a string S February 5, 2019 BBSI Summer School - Iowa State University

19 Finding Palindromes in S
Construct the reverse S’ of S Build generalized suffix tree of S and S’ Preprocess T for lce queries Now what? Left as homework Requirement: Linear time (const. per query) S q + 1 February 5, 2019 BBSI Summer School - Iowa State University

20 Palindromes in DNA sequences
We sometimes need to deal with complemented palindromes A  T C  G E.g., ATCATGAT is a complemented palindrome All complemented palindromes in S can be found using a GST of S and the complement of S’ February 5, 2019 BBSI Summer School - Iowa State University

21 Suffix Array – Reducing Space
6 ALAM$ 2 ALAYALAM$ 8 AM$ 4 AYALAM$ 7 LAM$ 3 LAYALAM$ 1 MALAYALAM$ 9 M$ 5 YALAM$ 10 $ M A L A Y A L A M $ 6 2 8 4 7 3 1 9 5 10 Suffix Array 3 1 2 - Longest common prefix Array Suffix 6 and 2 share “ALA” Suffix 2 and 8 share just “A”. lcp is always with adjacent. February 5, 2019 BBSI Summer School - Iowa State University

22 Pattern Search in Suffix Array
All suffixes that share a common prefix appear in consecutive positions in the array. Pattern P can be located in the string using a binary search on the suffix array. Naïve Run-time = O(|P|  log n). Improved to O(|P| + log n) [Manber&Myers93], and to O(|P|) [Abouelhoda et al. 02]. February 5, 2019 BBSI Summer School - Iowa State University

23 Computing longest common prefix Values
Find where S1 is in the suffix array. Compute lcp value of S1. Find S2 in the suffix array. Compute lcp value of S2. Repeat for all suffixes. Run-time is linear (why?) February 5, 2019 BBSI Summer School - Iowa State University

24 BBSI Summer School - Iowa State University
Example Text M A L Y $ Position 1 2 3 4 5 6 7 8 9 10 Suffix Array 3 7 4 10 5 8 9 1 2 6 lcp Array 3 1 1 2 1 6 ALAM$ 2 ALAYALAM$ 8 AM$ 4 AYALAM$ 7 LAM$ 3 LAYALAM$ 1 MALAYALAM$ 9 M$ 5 YALAM$ 10 $ February 5, 2019 BBSI Summer School - Iowa State University

25 Suffix Trees vs. Suffix Arrays
Suffix Array = Lexicographic order of the leaves of the Suffix Tree Suffix Tree = Suffix Array + lcp Array (why? Wait for next slide) February 5, 2019 BBSI Summer School - Iowa State University

26 Building a ST from a SA and lcp
6 ALAM$ 2 ALAYALAM$ 8 AM$ 4 AYALAM$ 7 LAM$ 3 LAYALAM$ 1 MALAYALAM$ 9 M$ 5 YALAM$ 10 $ A LA D = 1 D = 2 AL YALAM$ $M $M YALAM$ D = 3 $M 8 4 7 3 YALAM$ 6 2 SA 6 2 8 4 7 3 1 9 5 10 lcp 3 1 2 - February 5, 2019 BBSI Summer School - Iowa State University

27 BBSI Summer School - Iowa State University
Some Results Suffix tree can be constructed in O(n) time and O(n  |∑|) space [Weiner73, McCreight76, Ukkonen92]. Suffix arrays can be constructed without using suffix trees in O(n) time [Pang&Aluru03]. February 5, 2019 BBSI Summer School - Iowa State University

28 BBSI Summer School - Iowa State University
More Applications Suffix-prefix overlaps in fragment assembly Maximal and tandem repeats Shortest unique substrings Maximal unique matches [MUMmer] Approximate matching February 5, 2019 BBSI Summer School - Iowa State University

29 BBSI Summer School - Iowa State University
Dealing with errors The basic string data structures can only extract information in the absence of errors. To deal with errors, decompose into parts that do not involve errors. February 5, 2019 BBSI Summer School - Iowa State University

30 The k-mismatch problem
Given a pattern P, a text T, and a number k, find all occurrences of P in T with at most k mismatches Example P = bend, T = abentbananaend, k = 2 Match 1: bent Match 2: bana Match 3: aend February 5, 2019 BBSI Summer School - Iowa State University

31 BBSI Summer School - Iowa State University
Solution Build GST of P and T and preprocess it for lce queries For each starting index i in T, do at most k lce queries to determine if there is a k-mismatch beginning at i T P Time = O(k |T |) February 5, 2019 BBSI Summer School - Iowa State University

32 BBSI Summer School - Iowa State University
References M. I. Abouelhoda, S. Kurtz and E. Ohlebusch, The enhanced suffix array and its applications to genome analysis, 2nd Workshop on Algorithms in Bioinformatics, pp , 2002. M. A. Bender and M. Farach-Colton, The LCA Problem Revisited, LATIN, pages 88-94, 2000. P. Ko and S. Aluru, Linear time suffix sorting, CPM, pages , U. Manber and G. Myers. Suffix arrays: a new method for on-line search, SIAM J. Comput., 22: , 1993. E. M. McCreight, A space-economical suffix tree construction algorithm, J. ACM, 23(2): , 1976. E. Ukkonen, Constructing suffix trees on-line in linear time. Intern. Federation of Information Processing, pp ,1992. Also in Algorithmica, 14(3): , 1995. P. Weiner, Linear pattern matching algorithms, Proc. of the 14th IEEE Annual Symp. on Switching and Automata Theory, pp. 1-11, 1973. February 5, 2019 BBSI Summer School - Iowa State University


Download ppt "String Data Structures and Algorithms"

Similar presentations


Ads by Google