CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 Design and Analysis of Algorithms Khawaja Mohiuddin Assistant Professor, Department of Computer Sciences Bahria University, Karachi Campus, Contact: Lecture # 10 – String Algorithms
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 String Algorithms 2 Topics To Cover Pattern Matching DFAs Building DFAs for Regular Expressions NFAs String Searching
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 String Algorithms 3 String operations are common in many programs. Many programming libraries have good string tools. These tools probably use the best algorithms available, so you are unlikely to beat them with your own code. For example, the Boyer-Moore algorithm. Because this is such a common operation, most high-level programming languages have tools for doing this. In fact, many libraries are written in assembly language or at some other very low level, so they may give better performance even if you use the same algorithm in your code. The algorithms explained in this chapter are presented because: They are interesting They are an important part of a solid algorithmic education They provide examples of useful techniques that you may be able to adapt for other purposes
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 Pattern Matching 4 Parsing is a very common task in computer programming. It would be nice to have a general approach that could be used to parse all kinds of text. For example, a regular expression is a string that a program can use to represent a pattern for matching in a string. Programmers have defined several different regular expression languages. To keep this discussion reasonably simple, this section uses a language that defines the following symbols: An alphabetic character such as A or Q represents that letter. The + symbol represents concatenation. For the sake of readability, this symbol is often omitted, so ABC is the same as A + B + C. The * symbol means the previous expression can be repeated any number of times (including zero).
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 Pattern Matching (contd.) 5 The | symbol means the text must match either the previous or following expression. Parentheses determine the order of operation. For example, with this restricted language, the regular expression AB*A matches strings that begin with an A, contain any number of Bs, and then end with an A. That pattern would match ABA, ABBBBA and AA. More generally, a program might want to find the first occurrence of a pattern within a string. For example, the string AABBA matches the previous pattern AB*A starting at the second letter. To understand the algorithms described here for regular expression matching, it helps to understand deterministic finite automata (DFAs) and nondeterministic finite automata (NFAs).
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 DFAs 6 A deterministic finite automaton, also known as a deterministic finite state machine, is basically a virtual computer that uses a set of states to keep track of what it is doing. At each step, it reads some input and, based on that input and its current state, moves into a new state. One state is the initial state in which the machine starts. One or more states can be marked as accepting states. If the machine ends its computation in an accepting state, the machine accepts the input. In terms of regular expression processing, if the machine ends in an accepting state, the input text matches the regular expression. You can represent a DFA with a state transition diagram, which is basically a network in which circles represent states and directed links represent transitions to new states.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 DFAs (contd.) 7 Each link is labeled with the inputs that make the machine move into the new state. If the machine encounters an input that has no corresponding link, then it halts in a non-accepting state. For example, the figure shows the state transitions for a DFA that recognizes the pattern AB*A. The DFA starts in state 0. If it reads an A character, it moves to state 1. If it sees any other character, the machine halts in a non-accepting state. Next, if the DFA is in state 1 and reads a B, it follows the loop and returns to state 1. If the DFA is in state 1 and reads an A, it moves to state 2. State 2 is marked with a double circle to indicate that it is an accepting state.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 Building DFAs for Regular Expressions 8 You can translate simple regular expressions into transition diagrams and transition tables easily enough by using intuition, but for complicated regular expressions, it’s nice to have a methodical approach. Then you can apply this approach to let a program do the work for you. Figure below shows the transition diagrams for the simple literal patterns A and B on the left and the combined pattern A + B on the right.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 Building DFAs for Regular Expressions (contd.) 9 To implement the * operator, make the single sub- expression’s accepting state coincide with the sub- expression’s starting state. The figure (right) shows the transition diagram for the pattern A + B on the left and the pattern (A + B)* on the right. To implement the | operator, make the starting and ending states of the left and right sub-expressions’ transition diagram coincide. The figure (left) shows the transition diagram for the patterns A + B and B + A on the left and the combined pattern (A + B) | (B + A) on the right.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 Building DFAs for Regular Expressions (contd.) 10 What happens to the | operator if the two sub- expressions start with the same input transitions? For example, suppose the two sub- expressions are A + A and A + B. In that case, blindly following the previous discussion leads to the transition diagram on the left in given figure. It has two links labeled A that leave state 0. If the DFA is in state 0 and encounters input character A, which link should it follow? One solution is to restructure the diagram so that the diagrams for two sub- expressions share their first state (state 1). If the sub-expressions were more complicated, finding a similar solution might be difficult—at least for a program. One solution to this problem is to use an NFA instead of a DFA.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 NFAs 11 A deterministic finite automaton is called deterministic because its behavior is completely determined by its current state and the input it sees. A DFA moves into state 2 from state 1, if the input is correct, without question. A nondeterministic finite automaton (NFA) is similar to a DFA, except that multiple links may be leaving a state for the same input, as shown on the left in previous figure. When that situation occurs during processing, the NFA is allowed to guess which path it should follow to eventually reach an accepting state. Of course, in practice a computer cannot really guess which state it should move into to eventually find an accepting state. What it can do is try all the possible paths. To do that, a program can keep a list of states it might be in. When it sees an input, the program updates each of those states, possibly creating a larger number of states.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 NFAs (contd.) 12 Here the Ø character indicates a null transition, and a box indicates a possibly complicated network of states representing a sub-expression. To make it slightly easier to implement, an alternative is to introduce a new kind of null transition that occurs without any input. If the NFA encounters a null transition, it immediately follows it. Figure below shows how you can combine state transition machines for sub-expressions to produce more-complex expressions.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 NFAs (contd.) 13 The second part of the figure shows how you can combine two machines, M1 and M2, by using the + operator. The output state from M1 is connected by a null transition to the input state of M2. The first part of figure shows a set of states representing some sub-expression. This could be as simple as a single transition that matches a single input, or it could be a complicated set of states and transitions. The only important feature of this construct from the point of view of the rest of the states is that it has a single input state and a single output state.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 NFAs (contd.) 14 The final part shows how you can combine two machines M1 and M2 by using the | operator. The resulting machine uses a new input state and a final output state for the new combined machine. The third part of the figure shows how you can add the * operator to M1. M1’s output state is connected to its input state by a null transition. The * operator allows whatever it follows to occur any number of times, including zero times, so another null transition allows the NFA to jump to the accept state without matching whatever is inside the M1.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 String Searching 15 The methods of using DFAs and NFAs to search for patterns in a string are quite flexible, but they’re also relatively slow. To search for a complicated pattern, an NFA might need to track a large number of states as it examines each character in an input string one at a time. If you want to search a piece of text for a target substring instead of a pattern, there are faster approaches. Brute-force Approach The most obvious strategy is to loop over all the characters in the text and see if the target is at each position. The pseudo-code in next slide shows this brute-force approach:
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 String Searching 16 Brute-force Approach (contd.) // Return the position of the target in the text. Integer: FindTarget(String: text, String: target) For i = 0 To // See if the target begins at position i. Boolean: found_it = True For j = 0 To If (text[i + j] != target[j]) Then found_it = False Next j // See if we found the target. If (found_it) Then Return i Next i // If we got here, the target isn't present. Return -1 End FindTarget
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 String Searching 17 Brute-force Approach (contd.) In this algorithm, variable i loops over the length of the text. For each value of i, the variable j loops over the length of the target. If the text has length N and the target has length M, the total run time is O(N × M). This is simpler than using an NFA, but it’s still not very efficient.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 String Searching 18 Boyer-Moore Algorithm The Boyer-Moore algorithm uses a quicker approach to search for target substrings. Instead of looping through the target’s characters from the beginning, it examines the target’s characters starting at the end and works backwards. The easiest way to understand the algorithm is to imagine the target substring sitting below the text at a position where a match might occur. The algorithm compares characters starting at the target’s leftmost character. If it finds a position where the target and text don’t match, the algorithm slides the target to the right to the next position where a match might be possible.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 String Searching 19 Boyer-Moore Algorithm (contd.) The brute-force algorithm described earlier would have required at least 27 comparisons to decide that the target wasn’t present. The Boyer-Moore algorithm required only three comparisons in this example.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 String Searching 20 Boyer-Moore Algorithm (contd.) Things don’t always work out this smoothly. Consider a more complicated example:
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 String Searching 21 Boyer-Moore Algorithm (contd.) The following steps describe the basic Boyer-Moore algorithm at a high level: 1. Align the target and text on the left. 2. Repeat until the target’s last character is aligned beyond the end of the text: a) Compare the characters in the target with the corresponding characters in the text, starting from the end of the target and moving backwards toward the beginning. b) If all the characters match, you’ve found a match. c) Suppose character X in the text doesn’t match the corresponding character in the target. Slide the target to the right until the X aligns with the next character with the same value X in the target to the left of the current position. If no such character X exists to the left of the position in the target, slide the target to the right by its full length.