Łódź, 2008 Intelligent Text Processing lecture 1 Intro to string matching Szymon Grabowski

Łódź, 2008 Intelligent Text Processing lecture 1 Intro to string matching Szymon Grabowski sgrabow@kis.p.lodz.pl http://szgrabowski.kis.p.lodz.pl/IPT08/

2 Text is everywhere natural language (NL) texts, web documents (e.g. html), DNA and protein sequences, computer program sources, XML databases, log files (Web, system, database logs etc.), config files, music data (e.g. MIDI),....

3 Research fields: string matching and information retrieval (IR) In shortest: In string matching we know what we’re looking for (main problem is how to do it fast, or possibly in little space). In information retrieval we don’t know what we’re looking for.... Typical IR problem: given a human query (like a couple of keywords), present 10 most relevant web pages matching the query (and rank them from the most relevant).

4 The field of string matching (aka stringology) Myriads of problems of the type: find a string (more generally: pattern) in a text considered. Applications: text editors, pdf / ebook readers, file systems, computational biology (text is DNA, or proteins, or RNA...), web search engines, compression algorithms, template matching (in images), music retrieval (e.g., query by humming handling), AV scanners,....

5 Algorithm – formal description of a way to achieve a specific goal (e.g., to obtain a sorted sequence of integers, find the greatest common divisor, etc.) Features of any algorithm: [Knu1, pp. 5-6 (Polish edition)] it is finite, it is well-defined (each step must be defined precisely), input data, output data, it is defined effectively (the steps should be simple enough to end each step (at least in theory) in a finite time, with merely a piece of paper and pencil ).

6 Many algorithms work faster with some kinds of data than with others. For example, a compression alg may compress text faster than audio data; a sorting alg may run faster on partially sorted data than on random data (while some other sorting algorithm does not reveal this property, or shows just the opposite behavior!). Basic criteria for algorithm evaluation: speed in the average case, speed in the worst case, memory utilization, flexibility (modification possibilities, e.g. in problem generalizations), simplicity.

7 How to measure / estimate speed of algorithms? experimental tests (data of varied kinds / parameters and varied size), computational complexity analysis. Empirical test cons: dependence on particular hardware architecture, low-level implementation details, the compiler used... Sometimes it is hard to choose “representative” data (e.g., for universal lossless data compression). Comp. complexity analysis cons: constant factors usually neglected, tight analysis often very hard, theoretical models rather weakly fit modern hardware.

8 Computational complexity (Asymptotic) computational complexity – how fast the algorithm accomplishment time grows, if the volume of data grows to infinity.  (theta) notation Assume that T(n) =  (g(n)).  (g(n)) = { f(n): there exist constants c 1  0, c 2  0 and n 0  0, such that 0  c 1 g(n)  f(n)  c 2 g(n) for all n  n 0 }

9 Notation  – example Max searching in an array of n items: trivial alg of  (n) time complexity (linear complexity). O notation (asymptotic upper bound) O(g(n)) = { f(n): there exist constants c  0 and n 0  0, such that 0  f(n)  cg(n) for all n  n 0 }

10 How fast typical functions grow nlg nnn lg nn2n2 n3n3 2n2n 103 301001000 1007 70010 4 10 6 10 30 100010100010 4 10 6 10 9 10 301 10 6 2010 6 2 · 10 7 10 12 10 18 10 301030

11 Text T, pattern P, alphabet . Characters of T and P are taken from the alphabet . n = |T|, m = |P|  = |  | (alphabet size) $ – text terminator; abstract symbol, either lex. lowest or lex. greatest (depends on a convention). String matching: basic notation and terminology A string x is a prefix of string z iff z = xy, for some string y (y may be empty, denoted by  ). If z = xy and y  , then x is a proper prefix of z. Similarly, x is a suffix of z iff z = yx for some string y. We also say that x is a factor of z iff there exist strings a, b such that z = axb.

12 Exact string matching problem Problem: Find all occurrences of P[1..m] in text T[1..n]. We say that P occurs in T with shift s, if P[1..m] = T[s+1..s+m] (what is the largest possible s?). T is presented on-line, no time to preprocess it. On the other hand, P is usually much shorter than T, so we can (and should!) preprocess P. Fundamental problem considered for 30+ years, still new algorithms appear...

13 approximate search (several mismatches between P and a subsequence of T allowed); multiple search (looking for several patterns “at once”); extended search (classes of characters, regular expressions, etc.); global (as opposed to local) measure of similarity between strings; 2D search (in images). Very hard, if combined with rotation and/or scaling and/or lightness invariance. More advanced string (pattern) searching tasks

14 on-line (Boyer-Moore, KMP, etc.) – whole T must be scanned (even if skipping some symbols possible); On-line vs. off-line string search off-line – preprocessing space and time involved but truly sublinear in |T|. off-line searching = indexed searching Two types of text indexes: word based indexes (e.g., Glimpse (Manber & Wu, 1994)); full-text indexes (e.g., suffix trees, suffix arrays). Don’t worry, next lecture...

15 Exact string matching problem, example http://www.cs.ust.hk/~dekai/271/notes/L16/L16.ps

16 Exact string matching problem, cont’d The naïve (brute-force) algorithm tries to match P against each position of T. That is, for each i, i=1..n–m+1, P[1..m] is matched against T[i..i+m–1]. More precisely, compare P[1..m] to T[startpos...startpos+m–1] from left to right; if a mismatch occurs, slide P by 1 char and start anew. Worst case complexity is O(mn). In practice it is however close to O(n). Not very bad usually, but there are much faster algorithms.

17 The naïve algorithm in action http://www.cs.ust.hk/~dekai/271/notes/L16/L16.ps Worst case example for the naïve algorithm Let T = aaaaaaaaa...b (e.g. one b preceded with 999,999 a’s). Let P = aaaaab. At each position in T, a mismatch is found after as many as |P| = 6 char comparisons.

18 The idea of the Knuth-Morris-Pratt (KMP) alg (1977) Let’s start with a simple observation. If a mismatch occurs at position j, with the naïve algorithm, we know that j–1 previous chars do match. KMP exploits this fact and after a mismatch at P’s pos j it shifts P by (j –1) minus the length of the longest proper prefix of P[1..j–1] being also a suffix of P[1..j–1]. Complicated? Not really. See an example: P = abababc. T = ababad... Mismatch at position 6 (b  d). KMP, in O(1) time (thx to its preprocessing), finds that P can safely be shifted by 5 – 3. Why 5, why 3? 5 = |ababa|. 3 = |aba|.

19 KMP properties Linear time in the worst case. (But also in the best case.) O(m) extra space (for a table built in the preprocessing phase). Not practical: about 2x slower than the naïve one (acc. to: R. Baeza-Yates & G. Navarro, Text Searching: Theory and Practice). Still, to some degree it depends on the alphabet size: with a small alphabet (e.g., DNA) KMP runs relatively fast.

20 The Boyer-Moore algorithm (1977). First search algorithm with skips KMP is optimal in the worst case but never skips characters in T. That is,  (n) time also in the best case. Skipping chars in T?! You gotta be kidding. How can it be possible..? Idea: compare P against T from right to left. If, e.g., the char of T aligned with the rightmost char in P does not appear anywhere in P, we can shift P by its whole length! But how could we quickly check that some symbol does not appear in P ?

21 The Boyer-Moore idea The key observation is that, usually, m << n and  << n. Consequently, any preprocessing in O(m +  ) time is practically free. The BM preprocessing involves a  -sized table telling the rightmost position of each alphabet symbol in P (or zero, if a given symbol does not occur in P). Thanks to this table, the question how far can we shift P after a char mismatch?, can be answered in O(1) time.

22 Why we don’t like the original BM that much... Boyer & Moore tried to be too smart. They used not one but two heuristics intended to maximize the pattern shift (skip). In the Cormen et al. terminology, they are bad-character heuristic and good-suffix heuristic. The skip is the max of the skips suggested by the two heuristics. In practice, however, it does not pay to complicate things. The bad-character heuristic alone is good enough. Using both heuristics makes the skip longer on avg, but the extra calculations cost too...

23 From: R. Baeza-Yates & G. Navarro, Text Searching: Theory and Practice, to appear in Formal Languages and Applications, Physica-Verlag, Heidelberg. Boyer-Moore-Horspool (1980) Very simple and practical BM variant

24 BMH example. Miss, miss... For technical reasons (relatively large alphabet), we do not present the whole table d. Text T: from T.S. Eliot’s The Hippopotamus.

25 BMH example. Miss... Hit! What then? We read that d[‘s’] is 12, hence we shift P by its whole length (there is no other ‘s’ in P, right?). And the search continues...

26 Worst and average case time complexities Assumption for the avg case analysis: uniformly random char distribution, characters in T independent on each other. Same assumptions for P. Naturally, we assume m  n and  = O(n), so instead of e.g., O(n+m) we’re going to write just O(n). Naïve alg: O(n) avg case, O(mn) worst case. BM: O(n / min(m,  )) avg case and O(mn) worst case (alas). BMH: same complexities as in BM. Shift-Or: O(n) avg and worst case, as far as m  w.

27 Worst and average case time complexities, cont’d The lower bounds on the avg and worst case time complexites are  (n  log  (m)  / m) and  (n), respectively. Note that n  log  (m)  / m is close to n / m in practice (they are equal in complexity terms as long as m = O(  O(1) )). Backward DAWG Matching (BDM) (Crochemore et al., 1994) alg reaches the average case complexity lower bound. Some of its variants, e.g., TurboBDM and TurboRF (Crochemore et al., 1994), reach O(n) worst case without losing on avg.

28 Bit-parallelism CPUs work on registers. Nowadays, they have (at least) 32 bits each. Bitwise operations like >, &, |, ^, ~ actually perform in parallel many operations on single bits. The general idea of exploiting this simple hardware property is called bit-parallelism.

29 Bit-parallelism, cont’d Assumption: a machine word has w bits. In practice, w=32 or 64 usually. But the SSE in Pentium 4 works on 128bit words! More precisely, SSE2 actually supports for integer data types in the 128bit vector registers. In theory, the common assumption is that w =  (log n) (or w  lg n). Bit-parallelism has been successfully used for a plethora of string matching problems.

30 Shift-Or (Baeza-Yates & Gonnet, 1992) http://www.cs.joensuu.fi/~kfredrik/slides.pdf

31 Shift-Or, cont’d http://www.cs.joensuu.fi/~kfredrik/slides.pdf

32 Shift-Or pseudo code ~0 means all bits 1 http://www.cs.joensuu.fi/~kfredrik/slides.pdf

33 Shift-Or, simple example [ http://www.egeen.ee/u/vilo/edu/ 200506/Text_Algorithms/index.cgi?f=L1_Exact ] T = lasteaedP = aste n = |T| = 8, m = |P| = 4, w  m Obtained in the preprocessing: B[a] = 0111, B[s] = 1011 B[t] = 1101, B[e] = 1110 B[b] = B[c] = B[d] =... = B[z] = 1111 Search phase, step-by-step (=column-by-column) Found a match!

34 Shift-Or, another example http://www-igm.univ-mlv.fr/~lecroq/string/images/sotab1.png The columns are the state vectors (D) after processing of each of 24 characters of T.

Łódź, 2008 Intelligent Text Processing lecture 1 Intro to string matching Szymon Grabowski

Similar presentations

Presentation on theme: "Łódź, 2008 Intelligent Text Processing lecture 1 Intro to string matching Szymon Grabowski"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Łódź, 2008 Intelligent Text Processing lecture 1 Intro to string matching Szymon Grabowski

Similar presentations

Presentation on theme: "Łódź, 2008 Intelligent Text Processing lecture 1 Intro to string matching Szymon Grabowski"— Presentation transcript:

Similar presentations

About project

Feedback