Email: 1 y.chen@uwinnipeg.ca, 2wyj1128@yahoo.com BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches 1Yangjun Chen, 2Yujia.

Slides:



Advertisements
Similar presentations
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Advertisements

Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
1 CSC 421: Algorithm Design & Analysis Spring 2013 Space vs. time  space/time tradeoffs  examples: heap sort, data structure redundancy, hashing  string.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
Goodrich, Tamassia String Processing1 Pattern Matching.
Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen
1 The Colussi Algorithm Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen Correctness and Efficiency of Pattern Matching Algorithms Information and Computation,
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Ultrafast and memory-efficient alignment of short reads to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center for Bioinformatics.
Faster Algorithm for String Matching with k Mismatches Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp Date.
1 A Fast IP Lookup Scheme for Longest-Matching Prefix Authors: Lih-Chyau Wuu, Shou-Yu Pin Reporter: Chen-Nien Tsai.
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
Efficient algorithms for the scaled indexing problem Biing-Feng Wang, Jyh-Jye Lin, and Shan-Chyun Ku Journal of Algorithms 52 (2004) 82–100 Presenter:
Linear Time Algorithms for Finding and Representing all Tandem Repeats in a String Dan Gusfield and Jens Stoye Journal of Computer and System Science 69.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
MCS 101: Algorithms Instructor Neelima Gupta
Short Read Mapper Evan Zhen CS 124. Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome,
MCS 101: Algorithms Instructor Neelima Gupta
PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.
On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Huidan Liu, Minghua Nuo, Longlong Ma, Jian Wu, Yeping He
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
RNAseq: a Closer Look at Read Mapping and Quantitation
Burrows-Wheeler Transformation Review
FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.
An Efficient Algorithm for Read Matching in DNA Databases
Indexing Graphs for Path Queries with Applications in Genome Research
Succinct Data Structures
BWT-Transformation What is BWT-transformation? BWT string compression
Reducing the Space Requirement of LZ-index
Introduction to Algorithms
Genomic Data Clustering on FPGAs for Compression
The short-read alignment in distributed memory environment
Byung Joon Park, Sung Hee Kim
13 Text Processing Hongfei Yan June 1, 2016.
Yangjun Chen, Yujia Wu Department of Applied Computer Science
Yangjun Chen, Yujia Wu Department of Applied Computer Science
CSCE350 Algorithms and Data Structure
Objective of This Course
CSC2431 February 3rd 2010 Alecia Fowler
Next-generation sequencing - Mapping short reads
Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
On the Designing of Popular Packages
A Small and Fast IP Forwarding Table Using Hashing
Space-for-time tradeoffs
BIOINFORMATICS Fast Alignment
Chap 3 String Matching 3 -.
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Hashing.
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

email: 1 y.chen@uwinnipeg.ca, 2wyj1128@yahoo.com BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches 1Yangjun Chen, 2Yujia Wu, Dept. Applied Computer Science, University of Winnipeg, Canada email: 1 y.chen@uwinnipeg.ca, 2wyj1128@yahoo.com Introduction Methods Experiments Conclusions By the string matching with k mismatches, we mean a problem to find all the occurrences of a pattern string r in a target string s with each occurrence having up to k positions different between r and s. This problem is important for DNA databases to support the biological research, where we need to locate all the appearances of a read (a short DNA sequence) in a genome (a very long NDA sequence) for disease diagnosis or some other purposes. Due to polymorphisms or mutations among individuals or even sequencing errors, the read may disagree in some positions at any of its occurrences in the genome. As an example, consider a target s = ccacacagaagcc, and a pattern r = aaaaacaaac. Assume that k = 4. Let us see whether there is an occurrence of r with k mismatches that starts at the third position in s. a a a a a c a a a c c c a c a c a g a a g c c At only four locations s and r have different characters, implying an occurrence of r starting at the third position of s. Note that the case k = 0 is the extensively studied string matching problem. This topic has received much attention in the research community and many efficient algorithms have been proposed, such as [1, 2, 4, 6, 7, 9, 10]. Among them, [6] and [7] are two on-line algorithms (using no indexes) with the worst-case time complexities bounded by O(kn + mlogm), where n = |s| and m = |r|. By these two methods, the mismatch information among substrings of r is used to speed up the working process. The methods discussed in [2] and [10] are also on-line strategies, but with a slightly better time complexity O(nlogk). By these two methods, the periodicity within r is utilized. Only the algorithms discussed in [4, 9] are index-based. By the method discussed in [4], a (compressed) suffix tree over s is created. Then, a brute-force tree searching is conducted to find all the possible string matchings with k mismatches. Its time complexity is bounded by O(m + n + (clogn)k/k!), where c is a very large constant. For DNA databases, this time complexity can be much worse than O(nk) since n tends to be very large and k is often set to be larger than 10. By the method discussed in [9], s is transformed to a BWT-array (denoted BWT(s)) as an index [8]. In comparison with suffix trees, BWT(s) uses much less space [3, 5]. However, the time complexity of [9] is bounded by O(mn + n), where n is the number of leaf nodes of a tree (forest) produced during the search of BWT(s). Again, this time requirement can also be much worse than the best on-line algorithm for large patterns. Thus, simply indexing s is not always helpful for k mismatches. The reason for this is that in both the above index-based methods neither mismatch information nor periodicity within r is employed, leading to a lot of redundancy, which shadows the benefits brought by indexes. However, to use such information efficiently and effectively in an indexing environment is very challenging since in this case s will no longer be scanned character by character and the auxiliary information extracted from r cannot be simply integrated into an index searching process. In this paper, we address this issue, and propose a new method for the k-mismatch problem, based on a BWT-transformation, 1. BWT Transformation In our experiments, we have tested altogether four different methods: BWT-based [9] (BWT for short), Amir’s method [2] (Amir for short), Cole’s method [4] (Cole for short), Algorithm A discussed in this paper (A( ) for short) By the Cole’s, a suffix tree for a target is constructed. (The code for constructing suffix trees is taken from the gsuffix package: http:://gsuffix.Sourceforge.net/). All the four methods are implemented in C++, compiled by GNU make utility with optimization of level 2. In addition, all of our experiments are performed on a 64-bit Ubuntu operating system, run on a single core of a 2.40GHz Intel Xeon E5-2630 processor with 32GB RAM. For the tests, five reference genomes are used: In this paper, a new method to do the string matching with k mismatches is proposed. Its main idea is to transform the reverse of target string s to BWT( ) and use the mismatch information over a pattern string r to speed up the computation. Its time complexity is bounded by O(kn + n + mlogm), where m = |r|, n = |s|, and n is the number of leaf nodes of a tree structure produced during the search of a BWT(s). Our experiments show that it has a better running time than any existing on-line and index-based algorithms. s = a1 c1 a2 g2 a3 c2 a4 $ F: $ a4 a3 a1 a2 c2 c1 g1 L: a4 c2 g1 $ c1 a3 a1 a2 L[i] = $, if J[i] = 0; L[i] = s[J[i] – 1], otherwise. LF mapping: F[i] = L[i]’s successor Rank correspondence: rankF(e) = rankL(e) where J is the suffix array of s. 2. Search Sequences Bibliography 8 7 5 1 3 6 2 4 Suffix Array F L $ a4 a4 c2 a3 g1 a1 $ a2 c1 c2 a3 c1 a1 g1 a2 search(c, <a, [2, 5]>) <a, [2, 5]> <c, [1, 2]> <a, [3, 4]> search(a, <c, [1, 2]>) Search sequence: [1] A.V. Aho and M.J. Corasick, Efficient string matching: an aid to bibliographic search, Communication of the ACM, Vol. 23, No. 1, pp. 333 -340, June 1975. [2] A. Amir, M. Lewenstein and E. Porat, Faster algorithms for string matching with k mismatches, Journal of Algorithms, Vol. 50, No. 2, Feb.2004, pp. 257-275. [3] M. Burrows and D.J. Wheeler, A block-sorting lossless data compression algorithm, 1994. [4] R. Cole, L. Gottlieb, and M. Lewenstein, Dictionary Matching and Indexing with Errors and Don’t Cares, STOC’04, pp. 91 – 100, 2004. [5] P. Ferragina and G. Manzini, Opportunistic data structures with applications. In Proc. 41st Annual Symposium on Foundations of Computer Science, pp. 390 - 398. IEEE, 2000. [6] Z. Galil and R. Giancarlo, Improved string matching with k mismatches, ACM SIGACT News, Vol. 17, Issue 4, Spring 1986, pp. 52b- 54. [7] G.M. Landau and U. Vishkin, Efficient string matching with k mismatches, Theoretical Computer Science, Vol. 43, pp. 239 – 249, 1986. [8] G.M. Landau and U. Vishkin, Efficient string matching with k mismatches, Theoretical Computer Science, Vol. 43, pp. 239 – 249, 1986. [9] H. Li and R. Durbin, Fast and accurate short read alignment with Burrows–Wheeler Transform, Bioinformatics, Vol. 25 no. 14 2009, pp. 1754–1760. [10] M. Nicolas and S. Rajasekarian, On string matching with k mismatches, https://arxiv.org/pdf/1307.1406, 2013. [11] H. Li, wgsim: a small tool for simulating sequence reads from a reference genome, https://github.com/lh3/wgsim/, 2014. Genomes Genome sizes (bp) Rat chr1 (Rnor_6.0) 290,094,217 C. merolae (ASM9120v1) 16,728,967 C. elegans (WBcel235) 103,022,290 Zebra fish (GRCz10) 1,464,443,456 Rat (Rnor_6.0) 2,909,701,677 All the pattern strings are created by simulating reads from the five genomes shown in the above table, with varying lengths and amounts. It is done by using the wgsim program included in the SAMtools package [11] with default model for single read simulation. To store BWT( ), we use 2 bits to represent a character  {a, c, g, t} and store 4 rankAll values (respectively in Aa, Ac, Ag, and At) for every 4 elements (in L) with each taking 32 bits. In the following figures, we report the average time of testing the Rat (Rnor_6.0) for matching 100 reads of length 100 to 300 bps. From this figure, we can see that Algorithm A( ) outperforms all the other three methods. But the Amir’s method is better than the other two methods. The BWT-based and the Cole’s method are comparable. However, for small k, the Cole’s is a little bit better than the BWT-based method while for large k their performances are reversed. 3. Search Trees pattern: r = tcaca; target: s = acagaca; k = 2. <-, [1, 8]> <a, [1, 4]> <c, [1, 2]> <a, [2, 3]> <g, [1, 1]> <a, [4, 4]> <c, [2, 2]> v0 v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v12 v13 P1 P2 P3 P4 <a, [3, 3]> v16 v17 r[2] = c r[3] = a r[4] = c r[5] = a r[1] = t r: <a, [4, 4>] v14 v18 <$, [-, -]> v15 v19 v11 T: 4. Mismatching Trees pattern: r = tcaca; target: s = acagaca; k = 2. <-, 0> <a, 1> <g, 4> <g, 2> <c, 1> <a, 2> <g, 1> v0 u1 u2 u3 u4 u5 u6 u7 u9 u12 P1 P2 P3 P4 u13 r[2] = c r[3] = a r[4] = c r[5] = a r[1] = t r: T: u10 <g, 3> u10 <g, 3> varying values of k varying length of reads