Download presentation
Presentation is loading. Please wait.
Published byErick Gray Modified over 9 years ago
1
Compressed Pattern Matching in DNA Sequences BARNA SAHA
2
Motivation Large size of DNA sequence Human genome: three billions characters over twenty-three pairs of chromosomes Secondary Storage Bring to memory Main Memory Decompress Search 1. Saves main memory space 2. Reduces Time Complexity
3
Works on Compressed Pattern Matching BB (Boyer Moore on byte pair encoding) [SHIBATA:CPM00] SASO (Super-alphabet shift-or) [FRED:IPL03] d-BM (Derivative Boyer Moore) [Chen:CSB04] LZ-Grep (Boyer Moore on LZ compressed text) [NAVARRO:SPE05] FED (Fast matching with encoded DNA sequences) [KIM:COMB07] Project Goal: To implement Derivative Boyer Moore and compare it with Original Boyer Moore and AGREP
4
Project Goals Explore d-BM algorithm, by implementing it and comparing time complexity of it with BM, AGREP (fastest algorithm so far) and LZ-Grep. If Possible extend the method for matching multiple patterns and for inexact matching.
5
Two Steps of Implementation Compression of DNA string (Text, Pattern) Searching the compressed Pattern in the compressed Text
6
Compression Technique for Text A=00, C=01, T=10, G=11 Four characters represented by 1 byte Don’t care symbol appended to make the sequence divisible by 4 TACCGT =10 00 01 01 11 10 xx xx S=TACCGT, S’= 10 00 01 01 11 10, A=2 Replace S by (S’,A)
7
Compression Technique for Pattern Text: TACT GATx Pattern: ACTG Axxx Boundary of Bytes don’t match 4 Patterns: 1. ACTG Axxx 2. xACT GAxx 3. xxAC TGAx 4. xxxA CTGA
8
Searching Adapt Boyer Moore Algorithm Needs to handle Don’t Care Symbols
9
Handling Don't Care Search (P', A1, A2) in (T',B1): Four cases Matching P'[1] to T'[j], 1<=j<=m-1: mask A1 cells of P'[1] Matching P'[i], 2<=i<=n-1, to T'[j], 1<=j<=m-1: normal search Matching P'[n] to T'[j], 1<=j<m-1: mask A2 end cells of P'[n] Matching P'[n] to T'[m]: mask A2 end cells of P'[n] and B1 cells of T'[m]
10
Handling Don't Care Significant overhead if need to consider these four cases separately while searching and preprocessing P'=, T'= Search Pm in Tf If FOUND, extend search with Pf, Pe in T' using case 1,3,4.
11
Example Pattern=TACTTTGGA Text=GCTACTTTGGATGCT P'3=xx xx TA CTTT GGA xx T'= GCTA CTTT GGAT GCTxx
12
Two Reason For Speed-Up Shortened length of text and pattern Number of different symbols=256, hence larger shift
13
References. [Chen:CSB04] Lei Chen and Shiyong Lu and Jeffrey Ram, Compressed Pattern Matching in DNA Sequences, CSB '04: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference [FRED:IPL03] Fredriksson, K.: Shift-Or String Matching with Super-Alphabets. Information Processing Letters 87(4), 201–204 (2003). [SHIBATA:CPM00] Shibata, Y., Matsumoto, T., Takeda, M., Shinohara, A., Arikawa, S.: A Boyer-Moore Type Algorithm for Compressed Pattern Matching. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 181–194. Springer, Heidelberg (2000). [KIM:COMB07] Jin Wook Kim, Eunsang Kim, and Kunsoo Park, “Fast Matching Method for DNA Sequences”, Book of Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, 271-281, 2007. [NAVARRO:SPE05] Gonzalo Navarro and Jorma Tarhio, “LZgrep: a BoyerMoore string matching tool for ZivLempel compressed text: Research Articles”, Softw. Pract. Exper., 35, 2005.
14
THANK YOU !
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.