Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM

Aug. 29, 2011 Aims Tuning algorithms for exact string matching. Studying the effect of simultaneous 2-byte read.

Aug. 29, 2011 SBNDM Simple Backward Nondeterministic DAWG Matching SBNDM [18] is a simplification of BNDM [17]. Both are bit-parallel algorithms. Text T = t 1...t n, pattern P = p 1...p m. At each alignment window of P in T, scan T from right to left until the suffix of the window is not a factor of P or an occurrence of P is found.

Aug. 29, 2011 Shift of SBNDM No factor: m P found: 1 Else: next alignment starts at the last factor

Aug. 29, 2011 SBNDM, example P = banana, T = antanabadbanana... alignment: antanabadbanana a na ana

Aug. 29, 2011 SBNDM, example P = banana, T = antanabadbanana... alignment: antanabadbanana a na ana not a factor: tana next alignment: antanabadbanana

Aug. 29, 2011 SBNDM, example P = banana, T = antanabadbanana... alignment: antanabadbanana a na ana not a factor: tana next alignment: antanabadbanana not a factor: d next alignment: antanabadbanana

Aug. 29, 2011 SBNDMq SBNDMq [6] is a tuned version of SBNDM. Processing of an alignment starts with checking a q-gram. Let q = 4. Consider an alignment at antana. Instead of testing four suffixes a, na, ana, tana, only tana is tested. Testing is done in a fast loop.

Aug. 29, 2011 Forward-SBNDM Forward-SBNDM (FSB for short) by Faro & Lecroq [7] is a lookahead version of SBNDM2. Both FSB and SBNDM2 read a 2-gram x 1 x 2 before a factor test. x 1 x 2 is matched with the end of P in SBNDM2. Only x 1 is matched with the end of P in FSB, and x 2 is a lookahead character following the current alignment. FSB is faster than SBNDM2 for large alphabets.

Aug. 29, 2011 Generalization of FSB: FSB(q,f) FSB(q,f) (= Forward-SBNDM(q,f)) is SBNDMq with f lookahead characters, f = 0, 1,..., q - 1. FSB(2,1) = FSB and FSB(q,0) = SBNDMq. Motivation: SBNDMq works well on modern processors also for q>2.

Aug. 29, 2011 FSB(q,f) Let UV be a q-gram, where |V| = f. After reading UV there are 3 alternatives: i. If U is a suffix of P, reading continues leftwards. ii. Else if UV is a factor of P, reading continues leftwards. iii. Else the state vector is zero and P is shifted m - q+f+1 positions (f positions more than in SBNDMq).

Aug. 29, 2011 Occurrence vectors in FSB(q,2) Example: P = banana banana SBNDMq: B[n] = 00001010 FSB(q,2): B[n] = 00101011 B[a] = 01010111 B[x] = 00000011 extra bits

Aug. 29, 2011 State vectors in FSB(q,2) for q=4 4-gram nanx : x 00000011 n 00101011 a 01010111 n 00101011 00001000 4-gram State vectorConclusion nanx 00001000 na is a suffix of P xana 00000000not a factor anan 01000000factor of P nanx is not a factor

Aug. 29, 2011 Benefits / drawbacks of lookahead characters and extra bits Benefits Longer shifts  more speed Combined suffix / factor test Drawback More q-grams accepted  less speed

Aug. 29, 2011 Greedy skip loop for SBNDM2 (GSB2 = Greedy-SBNDM2) Factor tests of two 2-grams are done in one round. Let B 2 [x,y] denote the combined occurrence vector of characters x and y. B 2 [x,y] = B[x] & (B[y]<<1) next: D  B 2 [t i,t i+1 ] if D = 0 then if B 2 [t i+m-1,t i+m ] = 0 then i  i+2*m-2 goto next

Aug. 29, 2011 2-byte read Read two characters (= 2 bytes = 16 bits) in one instruction (in a skip loop). Suits well q-gram algorithms with even q. For experiments we made two versions of the algorithms: Standard (1-byte read) b-version using 2-byte read

Aug. 29, 2011 2-byte read (cont.) Advantage: a part of computation can moved to preprocessing phase Example: B 2 [x,y] = B[x] & (B[y]<<1) Speed-up factor even more than 2 Drawback: extra 0.1 ms for preprocessing.

Aug. 29, 2011 4-byte read? Many border crosses happen => slow down 2 32 tables too big for practice

Aug. 29, 2011 Experimental results/KJV Bible In the recent comparison S. Faro, T. Lecroq: The Exact String Matching Problem: a Comprehensive Experimental Evaluation (2010), the algorithms EBOM and Hash3 were the fastest in the bible text for m = 4,...,20. 4816 Hash314.65.422.79 EBOM6.533.872.91

Aug. 29, 2011 KJV: EBOM & Hash3 (on ThinkPad X61s)

Aug. 29, 2011 KJV: EBOMb & Hash3b (with 2-byte read) added

Aug. 29, 2011 KJV: SBNDM2b = FSB(2,0)b added

Aug. 29, 2011 KJV: GSB2b added

Aug. 29, 2011 KJV: FSB(4,i)b added, i = 0,1,2

Aug. 29, 2011 KJV: Speed-up factors of 2-byte read GSB21.32 FSB(2,0)1.34 FSB(2,1)1.24 FSB(4,0)1.72 FSB(4,1)2.15 FSB(4,2)2.03 Hash31.05 EBOM1.17

Aug. 29, 2011 Other experiments DNA and binary data was also tested. Gain of lookahead characters or the greedy loop was smaller than with the bible data. Gain of 2-byte read was smaller with 64-bit code than with 32-bit code.

Aug. 29, 2011 Conclusions Two new algorithms were presented: FSB(q,f) GSB2 The new algorithms are faster than earlier algorithms on English data: GSB2 for m = 4, …, 8 FSB(q,f) for m = 8, …, 20 2-byte read makes most string algorithms faster.

Aug. 29, 2011 Web site for practical speed comparison cse.aalto.fi/stringmatching

Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Similar presentations

Presentation on theme: "Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Similar presentations

Presentation on theme: "Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM."— Presentation transcript:

Similar presentations

About project

Feedback