Presentation is loading. Please wait.

Presentation is loading. Please wait.

 Author: Ricardo A. Baeza-Yates, Gaston H. Gonnet  Publisher: 1992 Communications of the ACM  Presenter: Yuen-Shuo Li  Date: 2013/08/14 1.

Similar presentations


Presentation on theme: " Author: Ricardo A. Baeza-Yates, Gaston H. Gonnet  Publisher: 1992 Communications of the ACM  Presenter: Yuen-Shuo Li  Date: 2013/08/14 1."— Presentation transcript:

1  Author: Ricardo A. Baeza-Yates, Gaston H. Gonnet  Publisher: 1992 Communications of the ACM  Presenter: Yuen-Shuo Li  Date: 2013/08/14 1

2  String searching is a very important component of many problems, including text editing, bibliographic retrieval, and symbol manipulation.

3 T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111 cbaba T[a] = 11010

4 50301 State cbbabababcaba… text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

5 0301 State cbbabababcaba… text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

6 14020 State cbbabababcaba… text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

7 14020 State cbbabababcaba… text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

8 14020 State cbbabababcaba… text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

9 4020 State cbbabababcaba… text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

10 50301 State cbbabababcaba… text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

11 04121 State cbbabababcaba… text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

12  To update the state after reading a new character on the text, we must  Shift the vector state b bits to the left to reflect that we have advanced one position in the text.  Update the individual states according to the new character.

13 The number of mismatches

14 0 or 1 b = 1

15 Let {a, b, c, d} be the alphabet, and ababc the pattern. T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

16  The initial state is 11111 11111 State abdabababc text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

17  The initial state is 11111 11111 State abdabababc text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

18  The initial state is 11111 1111 State abdabababc text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

19  The initial state is 11111 11110 State abdabababc text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

20  The initial state is 11111 11101 State abdabababc text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

21  The initial state is 11111 11111 State abdabababc text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

22  The initial state is 11111 11110 State abdabababc text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

23  The initial state is 11111 11101 State abdabababc text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

24  The initial state is 11111 11010 State abdabababc text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

25  The initial state is 11111 10101 State abdabababc text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

26  The initial state is 11111 11010 State abdabababc text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

27  The initial state is 11111 10101 State abdabababc text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

28  The initial state is 11111 01111 State abdabababc text T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111 The match at the end of the text is indicated by the value 0 in the leftmost bit of the state

29

30 m: pattern size w: word size

31

32 T[a] = 11000 T[b] = 10011 T[c] = 11101 T[d] = 01101

33

34  We allow up to k characters of the pattern to mismatch with the corresponding text. For example, if k = 2, the pattern mismatch: mismatch (match) dispatch (match) respatch (mismatch)

35

36 At each step we record the overflow bits in an overflow state, and we reset the overflow bits of all individual states.

37  We want to search for all occurrences of ababc with at most 2 mismatch. Because the value of b is 3 for 2 mismatches, every position in the state is represented by a number in the range 0- 4.  Initial state: 00000  Initial overflow: 44444 We report a match when the sum of the leftmost digits of the state and the overflow is less than 3

38

39

40

41  Experimental results for searching 100 times for all possible matches of a pattern in a 50,000 character English text(a legal document)

42 BMH: Boyer-Moore, as suggested by Horspool

43  The execution time while search 1,000 words chosen at random from the same English text


Download ppt " Author: Ricardo A. Baeza-Yates, Gaston H. Gonnet  Publisher: 1992 Communications of the ACM  Presenter: Yuen-Shuo Li  Date: 2013/08/14 1."

Similar presentations


Ads by Google