Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions Publisher : Conference on emerging Networking EXperiments and Technologies (CoNext), 2008 Author : Michela Becchi and Patrick Crowley Presenter : Yu-Hsiang Wang Date : 2011/02/16 1
Outline Introduction Counting Constraints Back-References Combining Multiple Reg-Ex Architecture Experimental Evaluation 2
Introduction As of November 2007, 5,549 of the 8,536 Snort rules contain at least one Perl- Compatible Regular Expression (PCRE). Among these, 905 (16.3%) and 2,445 (44%) contain unbounded and bounded repetitions of large character classes, respectively, and 338 (6%) include back-references. This paper show how the proposed extended- automaton can be combined with the hybrid-FA proposed in [20]. 3
Counting Constraints -NFA When the counting constraint n is large, the states of the NFA is linear in n. The basic problem of an NFA representation resides in the fact that, during operation, many states can be active in parallel, leading to a high memory bandwidth requirement and/or processing time. (e.g. aaaaaaaaaaaa...aaaabc) 4
Counting Constraints -DFA 5 For large n, number of states is exponential in n. e.g. n=40 =>1000 billion states
Counting NFA This basic concept is complicated by observing that, to preserve functional equivalence between the original and the counting-NFA, one instance of the counter is not enough. e.g. axaybzbc 6 a, cnt b | cnt=n c | cnt n 01 2 cnt++ 34 ∑ ∑ ∑
Counting NFA Differential representation - Since the increment operation acts in parallel on all the counter instances, the difference between them will remain constant over execution. store oldest (and largest) instance c i ’ and, for j>i, Δc j =c j -c j-1 7 c’ΔciΔci n=10
Back-References A back-reference in a regular expression refers to some sub-expression enclosed within capturing parentheses, and indicates that the referred sub-expression can be matched later within the regular expression itself. e.g..*(abc|bcd).\1y -matches abcdabcdy, does not match abcdaabcy.*a([a-z]+)a\1y -matches babacabacy 8
Back-References ( ) Defines a marked sub-expression. The string matched within the parentheses can be recalled later. A marked sub-expression is also called a block or capturing group. \num Matches what the num-th marked sub- expression matched e.g. HTML ( abcde ) /^ (.*) |\s+\/>)$/ 9
Back-References Each active state can be associated with a set of matched substrings MS k for each back-reference \k. This is performed as follows. (i) When a transition S x → S y is taken, the set MS k associated to S x gets moved to S y. (ii) If the taken transition is tagged k, the current input character is appended to the strings in MS k. If a back-reference \k originates from state S j, S j is consuming: when active, all the strings in its MS k are processed and shortened (one character at a time) Two special conditional transitions representing the back-reference are created. If the input character matches some string in MS k then: (i) transition S j → S j+1 is taken if the corresponding string is consumed completely; (ii) transition S j → S j is taken otherwise. 10
Back-References RegEx :.*(abc|bcd).\1y Input : abcdabcdy 11
Combining Multiple Reg-Ex States representing dot-star conditions, which we will call special states. If a regular expression RE1 containing a dot-star is compiled with a regular expression RE2, the sub-DFA representing the match of RE2 is duplicated: one instance will start at state 0 and one instance will start at the special state. 12
Combining Multiple Reg-Ex When more regular expressions containing.* conditions are compiled, the number of possible special state combinations will affect the complexity and the size of the resulting DFA. The same considerations hold for extended automata. In fact, counting/consuming states behave like special states (they have an auto-loop on a large character class). 13
head-DFA tail-DFA 1 tail-DFA 2 tail-DFA k Combining Multiple Reg-Ex When processing the input text, the head-DFA is always active: each input character will trigger a state transition on it. The operation of distinct tail-DFA machines is, in principle, independent. Furthermore, once the head- DFA has activated a tail-DFA, the two can execute in sequence or in parall el threads. [20] M. Becchi and P. Crowley, “A Hybrid Finite Automaton for Practical Deep Packet Inspection”, in CoNEXT
Combining Multiple Reg-Ex 15 The tail-DFA will be activated every time the head- state 0-3 is traversed.
Combining Multiple Reg-Ex A distinct activation of the tail-DFA is required each time head-state 1-3 is reached. Therefore, the tail-DFA may be active in parallel on different states. New activation of the tail-DFA always begins from the counting state 3, this ensures that, if the tail-DFA is already active, the new activation will be covered by the current active state. 16
Architecture 17 Head-DFA Tail-DFA
Experimental Evaluation 18