Notes on Python Regular Expressions and parser generators (by D. Parson) These are the Python supplements to the author’s slides for Chapter 1 and Section C310Spring2010.html has a link to the author’s slides, which are password protected by your K.U. Windows login / password used to access your student account. C310Spring2010.html
Regular Expressions in Python re module in the optional Python text. A RE is a pattern in the form of a string. compile(pattern [, flags]) compiles an RE expression into a finite automaton object. Return value can be used by other functions. Flags are for case, multiline, and meta-character options. search(pattern, string [, flags) searches string for the first match of pattern. match(pattern, string [, flags) checks at string’s beginning. Both return a MatchObject or None.
Regular Expressions in Python split(pattern, string [, maxsplit = 0]) splits string into occurrences of pattern. Returns a list of strings sub(pattern, repl, string [, count = 0]) performs substitutions of repl for pattern occurrences. String and sequence operations are related. >>> s = "abcde" >>> dir(s) ['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_formatter_field_name_split', '_formatter_parser', 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
Python Regular Expression Examples >>> m1 = search('a+z*(b.d)', 'abcdefghi') >>> m1 >>> m1.groups() ('bcd',) >>> m1.start() 0 >>> m1.end() 4 >>> m1.start(0) 0 >>> m1.start(1) 1 # Group 0 is the entire match, 1 is the first parenthesized subexpression, etc.
Learn the major Meta-characters! Text – verbatim text. – any character except newline ^ – matches start of the string (anchor) $ – matches end of the string * – Kleene start, 0 or more subpattern repetitions + – Kleene plus, 1 or more subpattern repetitions ? – optional, 0 or 1 subpattern occurrence | – alternation, either left or right subpattern () – group a subexpression inside parentheses \ – escape a meta-character (make it normal) [set of chars], [^set of chars not matched]
More Python RE Examples >>> m2 = search('a+z*(b.d)', 'Abcde') >>> m2 >>> print m2 None >>> split(':', "abc:cd:e:f") ['abc', 'cd', 'e', 'f'] >>> split('[:]', "abc:cd:e:f") ['abc', 'cd', 'e', 'f'] >>> split('[^:]', "abc:cd:e:f") ['', '', '', ':', '', ':', ':', '']
More Python RE Examples (sub) >>> sub('a([^b]+)b', 'A\\1B', 'a123b45ab67a9b aab') 'A123B45ab67A9B AaB' The parenthesized subexpression matches one or more occurrences of anything except for b. The matched substring of the first parenthesized subexpression is group 1. The replacement pattern \1 says “insert group 1 at this point.” Effect is to re-insert characters between a and b.
Finite State Automata A regular expression compiler translates a regular expression into a finite state automaton. This could be a linked data structure or code. It looks like a graph of mapping steps needed for the regular expression. There are nondeterministic and deterministic flavors. (a|b)c+d is a simple example expression. a b start c d accept c c ε s1 s2 s3s4
Lookahead 1 types of parsers. LL(1) and LR(1) grammars require a parser to get at most 1 look-ahead terminal from the scanner. LL(1) cannot handle left-recursive grammar productions. It can handle other recursion. LR(1) and its variants can handle left, right and nested recursion; left is the most efficient. A generated parser is essentially a deterministic finite state automaton that uses a stack to keep track of nested syntactic structures. This topic is covered exhaustively in compiler design.
Parser generators in Python. YAPPS2 is an LL(1) parser generator. PLY is a Python LALR(1) (subset of LR(1)) equivalent to UNIX YACC and GNU Bison that are used to generate compilers for C code. Both generate Python executable parsers from stylized Python code.