Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 i206: Lecture 18: Regular Expressions Marti Hearst Spring 2012.

Similar presentations


Presentation on theme: "1 i206: Lecture 18: Regular Expressions Marti Hearst Spring 2012."— Presentation transcript:

1 1 i206: Lecture 18: Regular Expressions Marti Hearst Spring 2012

2 2 Formal Languages Bits & Bytes Binary Numbers Number Systems Gates Boolean Logic Circuits CPU Machine Instructions Assembly Instructions Program Algorithms Application Memory Data compression Compiler/ Interpreter Operating System Data Structures Analysis I/O Memory hierarchy Design Methodologies/ Tools Process Truth table Venn Diagram DeMorgan ’ s Law Numbers, text, audio, video, image, … Decimal, Hexadecimal, Binary AND, OR, NOT, XOR, NAND, NOR, etc. Register, Cache Main Memory, Secondary Storage Op-code, operands Instruction set arch Lossless v. lossy Info entropy & Huffman code Adders, decoders, Memory latches, ALUs, etc. Data Representation Data storage Principles ALUs, Registers, Program Counter, Instruction Register Network Distributed Systems Security Cryptography Standards & Protocols Inter-process Communication Searching, sorting, etc. Stacks, queues, maps, trees, graphs, … Big-O Formal models Finite automata regex

3 3

4 4 Theories of Computation What are the fundamental capabilities and limitations of computers? –Complexity theory: what makes some problems computationally hard and others easy? –Automata theory: definitions and properties of mathematical models of computation –Computability theory: what makes some problems solvable and others unsolvable?

5 5 Finite Automata Also known as finite state automata (FSA) or finite state machine (FSM) A simple mathematical model of a computer Applications: hardware design, compiler design, text processing

6 6 A First Example Touch-less hand-dryer DRYER OFF DRYER ON

7 7 A First Example Touch-less hand-dryer hand DRYER OFF DRYER ON hand

8 8 Finite Automata State Graphs The start state An accepting state A transition x A state

9 9 Finite Automata Transition:s 1  x s 2 –In state s 1 on input “ x ” go to state s 2 If end of input –If in accepting state => accept –Else => reject If no transition possible => reject

10 10 Language of a FA Language of finite automaton M: set of all strings accepted by M Example: Which of the following are in the language? –x, tmp2, 123, a?, 2apples A language is called a regular language if it is recognized by some finite automaton letter letter | digit S A

11 11 Example What is the language of this FA? + digit S B A -

12 12 Regular Expressions Regular expressions are used to describe regular languages Arithmetic expression example: (8+2)*3 Regular expression example: (\+|-)?[0-9]+

13 13 Example What is the language of this FA? Regular expression: (\+|-)?[0-9]+ + digit S B A -

14 14 Three Equivalent Representations Finite automata Regular expressions Regular languages Each can describe the others Theorem: For every regular expression, there is a deterministic finite-state automaton that defines the same language, and vice versa. Adapted from Jurafsky & Martin 2000

15 15 Regex Rules ? Zero or one occurrences of the preceding character/regex woodchucks? “ how much wood does a woodchuck chuck? ” behaviou?r “ behaviour is the British spelling of behavior ” * Zero or more occurrences of the preceding character/regex baa* ba, baa, baaa, baaaa … ba* b, ba, baa, baaa, baaaa … [ab]* , a, b, ab, ba, baaa, aaabbb, … [0-9][0-9]* any positive integer, or zero cat.*cat A string where “ cat ” appears twice anywhere + One or more occurrences of the preceding character/regex ba+ ba, baa, baaa, baaaa … {n} Exactly n occurrences of the preceding character/regex ba{3}baaa

16 16 Regex Rules * is greedy: Home Lazy (non-greedy) quantifier: Home

17 17 Regex Rules [] Disjunction (Union) [wW]ood “ how much wood does a Woodchuck chuck? ” [abcd]* “ you are a good programmer ” [A-Za-z0-9] (any letter or digit) [A-Za-z]* (any letter sequence) | Disjunction (Union) (cats?|dogs?)+ “ It ’ s raining cats and a dog. ” ( ) Grouping (gupp(y|ies))* “ His guppy is the king of guppies. ”

18 18 Regex Rules ^ $ \b Anchors (start/end of input string; word boundary) ^The “ The cat in the hat. ” ^The end\.$ “ The end. ” ^The.* end\.$ “ The bitter end. ” (the)* “ I saw him the other day. ” (\bthe\b)* “ I saw him the other day. ” Special rule: when ^ is FIRST WITHIN BRACKETS it means NOT [^A-Z]* (anything not an upper case letter)

19 19 Regex Rules \ Escape characters \. “ The + and \ characters are missing. ” \+ “ The + and \ characters are missing. ” \\ “ The + and \ characters are missing. ”

20 20 Operator Precedence What is the difference? –[a-z][a-z]|[0-9]* –[a-z]([a-z]|[0-9])* OperatorPrecedence ()highest * + ? {} sequences, anchors |lowest

21 21 Regex for Dollars No commas \$[0-9]+(\.[0-9][0-9])? With commas \$[0-9][0-9]?[0-9]?(,[0-9][0-9][0-9])*(\.[0-9][0-9])? With or without commas \$[0-9][0-9]?[0-9]?((,[0-9][0-9][0-9])*| [0-9]*) (\.[0-9][0-9])?

22 22 Regex in Python import re result = re.search(pattern, string) result = re.findall(pattern, string) result = re.match(pattern, string) Python documentation on regular expressions –http://docs.python.org/release/3.1.3/library/re.html –Some useful flags like IGNORECASE, MULTILINE, DOTALL, VERBOSE


Download ppt "1 i206: Lecture 18: Regular Expressions Marti Hearst Spring 2012."

Similar presentations


Ads by Google