Compressed Pattern Matching in DNA Sequences BARNA SAHA.

Compressed Pattern Matching in DNA Sequences BARNA SAHA

Motivation  Large size of DNA sequence  Human genome: three billions characters over twenty-three pairs of chromosomes Secondary Storage Bring to memory Main Memory Decompress Search 1. Saves main memory space 2. Reduces Time Complexity

Works on Compressed Pattern Matching  BB (Boyer Moore on byte pair encoding) [SHIBATA:CPM00]  SASO (Super-alphabet shift-or) [FRED:IPL03]  d-BM (Derivative Boyer Moore) [Chen:CSB04]  LZ-Grep (Boyer Moore on LZ compressed text) [NAVARRO:SPE05]  FED (Fast matching with encoded DNA sequences) [KIM:COMB07] Project Goal: To implement Derivative Boyer Moore and compare it with Original Boyer Moore and AGREP

Project Goals  Explore d-BM algorithm, by implementing it and comparing time complexity of it with BM, AGREP (fastest algorithm so far) and LZ-Grep.  If Possible extend the method for matching multiple patterns and for inexact matching.

Two Steps of Implementation  Compression of DNA string (Text, Pattern) ‏  Searching the compressed Pattern in the compressed Text

Compression Technique for Text  A=00, C=01, T=10, G=11  Four characters represented by 1 byte  Don’t care symbol appended to make the sequence divisible by 4  TACCGT =10 00 01 01 11 10 xx xx  S=TACCGT, S’= 10 00 01 01 11 10, A=2  Replace S by (S’,A) ‏

Compression Technique for Pattern  Text: TACT GATx  Pattern: ACTG Axxx  Boundary of Bytes don’t match 4 Patterns: 1. ACTG Axxx 2. xACT GAxx 3. xxAC TGAx 4. xxxA CTGA

Searching  Adapt Boyer Moore Algorithm  Needs to handle Don’t Care Symbols

Handling Don't Care  Search (P', A1, A2) in (T',B1): Four cases  Matching P'[1] to T'[j], 1<=j<=m-1: mask A1 cells of P'[1]  Matching P'[i], 2<=i<=n-1, to T'[j], 1<=j<=m-1: normal search  Matching P'[n] to T'[j], 1<=j<m-1: mask A2 end cells of P'[n]  Matching P'[n] to T'[m]: mask A2 end cells of P'[n] and B1 cells of T'[m]

Handling Don't Care  Significant overhead if need to consider these four cases separately while searching and preprocessing  P'=, T'=  Search Pm in Tf  If FOUND, extend search with Pf, Pe in T' using case 1,3,4.

Example  Pattern=TACTTTGGA  Text=GCTACTTTGGATGCT  P'3=xx xx TA CTTT GGA xx  T'= GCTA CTTT GGAT GCTxx

Two Reason For Speed-Up  Shortened length of text and pattern  Number of different symbols=256, hence larger shift

References.  [Chen:CSB04] Lei Chen and Shiyong Lu and Jeffrey Ram, Compressed Pattern Matching in DNA Sequences, CSB '04: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference  [FRED:IPL03] Fredriksson, K.: Shift-Or String Matching with Super-Alphabets. Information Processing Letters 87(4), 201–204 (2003).  [SHIBATA:CPM00] Shibata, Y., Matsumoto, T., Takeda, M., Shinohara, A., Arikawa, S.: A Boyer-Moore Type Algorithm for Compressed Pattern Matching. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 181–194. Springer, Heidelberg (2000).  [KIM:COMB07] Jin Wook Kim, Eunsang Kim, and Kunsoo Park, “Fast Matching Method for DNA Sequences”, Book of Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, 271-281, 2007.  [NAVARRO:SPE05] Gonzalo Navarro and Jorma Tarhio, “LZgrep: a BoyerMoore string matching tool for ZivLempel compressed text: Research Articles”, Softw. Pract. Exper., 35, 2005.

THANK YOU !

Compressed Pattern Matching in DNA Sequences BARNA SAHA.

Similar presentations

Presentation on theme: "Compressed Pattern Matching in DNA Sequences BARNA SAHA."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Compressed Pattern Matching in DNA Sequences BARNA SAHA.

Similar presentations

Presentation on theme: "Compressed Pattern Matching in DNA Sequences BARNA SAHA."— Presentation transcript:

Similar presentations

About project

Feedback