Regular Expressions and Finite State Automata. Introduction Regular expressions are equivalent to Finite State Automata in recognizing regular languages,

Slides:



Advertisements
Similar presentations
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Advertisements

CSC 361NFA vs. DFA1. CSC 361NFA vs. DFA2 NFAs vs. DFAs NFAs can be constructed from DFAs using transitions: Called NFA- Suppose M 1 accepts L 1, M 2 accepts.
Regular Expressions and DFAs COP 3402 (Summer 2014)
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
Finite Automata CPSC 388 Ellen Walker Hiram College.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)
1 Regular Expressions and Automata September Lecture #2-2.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
PZ02B Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ02B - Regular grammars Programming Language Design.
CSC 361Finite Automata1. CSC 361Finite Automata2 Formal Specification of Languages Generators Grammars Context-free Regular Regular Expressions Recognizers.
CS Chapter 2. LanguageMachineGrammar RegularFinite AutomatonRegular Expression, Regular Grammar Context-FreePushdown AutomatonContext-Free Grammar.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Languages and Machines Unit two: Regular languages and Finite State Automata.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction
Chapter 2: Finite-State Machines Heshaam Faili University of Tehran.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
PZ02B Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ02B - Regular grammars Programming Language Design.
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
Regular Expressions CIS 361. Need finite descriptions of infinite sets of strings. Discover and specify “regularity”. The set of languages over a finite.
CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
CS 3813: Introduction to Formal Languages and Automata Chapter 2 Deterministic finite automata These class notes are based on material from our textbook,
CMSC 330: Organization of Programming Languages Theory of Regular Expressions Finite Automata.
CS 203: Introduction to Formal Languages and Automata
Donghyun (David) Kim Department of Mathematics and Physics North Carolina Central University 1 Chapter 1 Regular Languages Some slides are in courtesy.
Modeling Computation: Finite State Machines without Output
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2007.
using Deterministic Finite Automata & Nondeterministic Finite Automata
1 Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
CS 154 Formal Languages and Computability February 9 Class Meeting Department of Computer Science San Jose State University Spring 2016 Instructor: Ron.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Chapter 5 Finite Automata Finite State Automata n Capable of recognizing numerous symbol patterns, the class of regular languages n Suitable for.
BİL711 Natural Language Processing1 Regular Expressions & FSAs Any regular expression can be realized as a finite state automaton (FSA) There are two kinds.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2006.
Set, Alphabets, Strings, and Languages. The regular languages. Clouser properties of regular sets. Finite State Automata. Types of Finite State Automata.
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Compilers Lexical Analysis 1. while (y < z) { int x = a + b; y += x; } 2.
1 Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
More Details about Regular Expressions
G. Pullaiah College of Engineering and Technology
Two issues in lexical analysis
Chapter 2 FINITE AUTOMATA.
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
Compiler Design 2. Regular Expressions & Finite State Automata (FSA)
Lecture 5 Scanning.
Presentation transcript:

Regular Expressions and Finite State Automata

Introduction Regular expressions are equivalent to Finite State Automata in recognizing regular languages, the first step in the Chomsky hierarchy of formal languages The term regular expressions is also used to mean the extended set of string matching expressions used in many modern languages –Some people use the term regexp to distinguish this use Some parts of regexps are just syntactic extensions of regular expressions and can be implemented as a regular expression – other parts are significant extensions of the power of the language and are not equivalent to finite automata

Concepts and Notations Set : An unordered collection of unique elements S 1 = { a, b, c } S 2 = { 0, 1, …, 19 } empty set :  membership : x  S union : S 1  S 2 = { a, b, c, 0, 1, …, 19 } universe of discourse : U subset : S 1  U complement : if U = { a, b, …, z }, then S 1 ' = { d, e, …, z } = U - S 1 Alphabet : A finite set of symbols –Examples: Character sets: ASCII, ISO , UnicodeASCIIISO    = { a, b }  2  = { Spring, Summer, Autumn, Winter } String : A sequence of zero or more symbols from an alphabet –The empty string: 

Concepts and Notations Language : A set of strings over an alphabet –Also known as a formal language; may not bear any resemblance to a natural language, but could model a subset of one. –The language comprising all strings over an alphabet  is written as:  * Graph : A set of nodes (or vertices), some or all of which may be connected by edges. – An example: – A directed graph example: a b c

Regular Expressions A regular expression defines a regular language over an alphabet  : –  is a regular language: // – Any symbol from  is a regular language:  = { a, b, c} /a/ /b/ /c/ – Two concatenated regular languages is a regular language:  = { a, b, c} /ab/ /bc/ /ca/

Regular Expressions Regular language (continued): –The union (or disjunction) of two regular languages is a regular language:  = { a, b, c} /ab|bc/ /ca|bb/ –The Kleene closure (denoted by the Kleene star: *) of a regular language is a regular language:  = { a, b, c} /a*/ /(ab|ca)*/ –Parentheses group a sub-language to override operator precedence (and, we’ll see later, for “memory”).

Finite Automata Finite State Automaton a.k.a. Finite Automaton, Finite State Machine, FSA or FSM –An abstract machine which can be used to implement regular expressions (etc.). –Has a finite number of states, and a finite amount of memory (i.e., the current state). –Can be represented by directed graphs or transition tables

Finite-state Automata (1/23) Representation –An FSA may be represented as a directed graph; each node (or vertex) represents a state, and the edges (or arcs) connecting the nodes represent transitions. –Each state is labeled. –Each transition is labeled with a symbol from the alphabet over which the regular language represented by the FSA is defined, or with , the empty string. –Among the FSA’s states, there is a start state and at least one final state (or accepting state).

Finite-state Automata (2/23) q0q0 q1q1 q2q2 q3q3 q4q4  = { a, b, c } abca transition final state start state state Representation (continued) –An FSA may also be represented with a state-transition table. The table for the above FSA: Input State abc 01  1  2  2  3 34  4 

Finite-state Automata (3/23) Given an input string, an FSA will either accept or reject the input. –If the FSA is in a final (or accepting) state after all input symbols have been consumed, then the string is accepted (or recognized). –Otherwise (including the case in which an input symbol cannot be consumed), the string is rejected.

Finite-state Automata (3/23) q0q0 q1q1 q2q2 q3q3 q4q4  = { a, b, c } abca Input State abc 01  1  2  2  3 34  4  abca ccba abcac IS 1 : IS 2 : IS 3 :

Finite-state Automata (4/23) q0q0 q1q1 q2q2 q3q3 q4q4  = { a, b, c } abca Input State abc 01  1  2  2  3 34  4  abca ccba abcac IS 1 : IS 2 : IS 3 :

Finite-state Automata (5/23) q0q0 q1q1 q2q2 q3q3 q4q4  = { a, b, c } abca Input State abc 01  1  2  2  3 34  4  abca ccba abcac IS 1 : IS 2 : IS 3 :

Finite-state Automata (6/23) q0q0 q1q1 q2q2 q3q3 q4q4  = { a, b, c } abca Input State abc 01  1  2  2  3 34  4  abca ccba abcac IS 1 : IS 2 : IS 3 :

Finite-state Automata (7/23) q0q0 q1q1 q2q2 q3q3 q4q4  = { a, b, c } abca Input State abc 01  1  2  2  3 34  4  abca ccba abcac IS 1 : IS 2 : IS 3 :

Finite-state Automata (8/23) q0q0 q1q1 q2q2 q3q3 q4q4  = { a, b, c } abca Input State abc 01  1  2  2  3 34  4  abca ccba abcac IS 1 : IS 2 : IS 3 :

Finite-state Automata (9/23) q0q0 q1q1 q2q2 q3q3 q4q4  = { a, b, c } abca Input State abc 01  1  2  2  3 34  4  abca ccba abcac IS 1 : IS 2 : IS 3 :

Finite-state Automata (10/23) q0q0 q1q1 q2q2 q3q3 q4q4  = { a, b, c } abca Input State abc 01  1  2  2  3 34  4  abca ccba abcac IS 1 : IS 2 : IS 3 :

Finite-state Automata (11/23) q0q0 q1q1 q2q2 q3q3 q4q4  = { a, b, c } abca Input State abc 01  1  2  2  3 34  4  abca ccba abcac IS 1 : IS 2 : IS 3 :

Finite-state Automata (12/23) q0q0 q1q1 q2q2 q3q3 q4q4  = { a, b, c } abca Input State abc 01  1  2  2  3 34  4  abca ccba abcac IS 1 : IS 2 : IS 3 :

Finite-state Automata (13/23) q0q0 q1q1 q2q2 q3q3 q4q4  = { a, b, c } abca Input State abc 01  1  2  2  3 34  4  abca ccba abcac IS 1 : IS 2 : IS 3 :

Finite-state Automata (14/23) q0q0 q1q1 q2q2 q3q3 q4q4  = { a, b, c } abca Input State abc 01  1  2  2  3 34  4  abca ccba abcac IS 1 : IS 2 : IS 3 :

Finite-state Automata (22/23) An FSA defines a regular language over an alphabet  : –  is a regular language: – Any symbol from  is a regular language:  = { a, b, c} – Two concatenated regular languages is a regular language:  = { a, b, c} q0q0 b q0q0 q1q1 q0q0 b q1q1 q0q0 c q1q1 q1q1 c q2q2 q0q0 b

Finite-state Automata (23/23) regular language (continued): –The union (or disjunction) of two regular languages is a regular language:  = { a, b, c} –The Kleene closure (denoted by the Kleene star: *) of a regular language is a regular language:  = { a, b, c} q0q0 b q1q1 q0q0 c q1q1 q2q2 c q3q3 q0q0 b q1q1   q0q0 b q1q1 

Finite-state Automata (15/23) Determinism –An FSA may be either deterministic (DFSA or DFA) or non-deterministic (NFSA or NFA). An FSA is deterministic if its behavior during recognition is fully determined by the state it is in and the symbol to be consumed. –I.e., given an input string, only one path may be taken through the FSA. Conversely, an FSA is non-deterministic if, given an input string, more than one path may be taken through the FSA. –One type of non-determinism is  -transitions, i.e. transitions which consume the empty string (no symbols).

Finite-state Automata (16/23) An example NFA: q0q0 q1q1 q2q2 q3q3 q4q4  = { a, b, c } abca   c State Input abc  01  1  2  2 2  3,4  34  4  –The above NFA is equivalent to the regular expression /ab*ca?/.

Finite-state Automata (17/23) String recognition with an NFA: –Backup (or backtracking): remember choice points and revisit choices upon failure –Look-ahead: choose path based on foreknowlege about the input string and available paths –Parallelism: examine all choices simultaneously

Finite-state Automata (18/23) Recognition as search –Recognition can be viewed as selection of the correct path from all possible paths through an NFA (this set of paths is called the state-space) –Search strategy can affect efficiency: in what order should the paths be searched? Depth-first (LIFO [last in, first out]; stack) Breadth-first (FIFO [first in, first out]; queue) Depth-first uses memory more efficiently, but may enter into an infinite loop under some circumstances

Finite-state Automata (19/23) Conversion of NFAs to DFAs –Every NFA can be expressed as a DFA. /ab*ca?/ q0q0 q1q1 q2q2 q3q3 q4q4  = { a, b, c } abca   c State Input abc  01  1  2  2 2  3,4  34  4F  New StateState Input abc 0'01  1'1  2{3,4} 2'2  2{3,4} 3'F{3,4}F4  4'F4F  5  q0'q0' q1'q1' q2'q2' q3'q3' q4'q4' q5q5 a,b,c ac b a b,c b a c a Subset construction

Finite-state Automata (20/23) DFA minimization –Every regular language has a unique minimum-state DFA. –The basic idea: two states s and t are equivalent if for every string w, the transitions T(s, w) and T(t, w) are both either final or non-final. –An algorithm: Begin by enumerating all possible pairs of both final or both non- final states, then iteratively removing those pairs the transition pair for which (for any symbol) are either not equal or are not on the list. The list is complete when an iteration does not remove any pairs from the list. The minimum set of states is the partition resulting from the unions of the remaining members of the list, along with any original states not on the list.

Finite-state Automata (21/23) The minimum-state DFA for the DFA converted from the NFA for /ab*ca?/, without the “failure” state (labeled “5”), and with the states relabeled to the set Q = { q 0", q 1", q 2", q 3" }: q 0" q 1" q 2" q 3" ac b a

Finite Automata with Output Finite Automata may also have an output alphabet and an action at every state that may output an item from the alphabet Useful for lexical analyzers –As the FSA recognizes a token, it outputs the characters –When the FSA reaches a final state and the token is complete, the lexical analyzer can use Token value – output so far Token type – label of the output state

RegExps –The extended use of regular expressions is in many modern languages: Perl, php, Java, python, … –Can use regexps to specify the rules for any set of possible strings you want to match Sentences, addresses, ads, dialogs, etc –``Does this string match the pattern?'', or ``Is there a match for the pattern anywhere in this string?'' –Can also define operations to do something with the matched string, such as extract the text or substitute for it –Regular expression patterns are compiled into a executable code within the language

Regular Expressions Regexp syntax is a superset of the notation required to express a regular language. –Some examples and shortcuts: 1. /[abc]/ = /a|b|c/ Character class; disjunction 2. /[b-e]/ = /b|c|d|e/ Range in a character class 3. /[\012\015]/ = /\n|\r/ Octal characters; special escapes 4. /./ = /[\x00-\xFF]/ Wildcard; hexadecimal characters 5. /[^b-e]/ = /[\x00-af-\xFF]/ Complement of character class 6. /a*/ /[af]*/ /(abc)*/ Kleene star: zero or more 7. /a?/ = /a|/ /(ab|ca)?/ Zero or one 8. /a+/ /([a-zA-Z]1|ca)+/ Kleene plus: one or more 9. /a{8}/ /b{1,2}/ /c{3,}/ Counters: exact repeat quantification

Regular Expressions Anchors –Constrain the position(s) at which a pattern may match –Think of them as “extra” alphabet symbols, though they actually consume  (the zero-length string): – / ^ a/ Pattern must match at beginning of string – /a $ / Pattern must match at end of string – /\bword23\b/ “Word” boundary: /[a-zA-Z0-9_][^a-zA-Z0-9_]/ or /[^a-zA-Z0-9_][a-zA-Z0-9_]/ – /\B23\B/ “Word” non-boundary

Regular Expressions Escapes –A backslash “ \ ” placed before a character is said to “escape” (or “quote”) the character. There are six classes of escapes: 1. Numeric character representation: the octal or hexadecimal position in a character set: “ \012 ” = “ \xA ” 2. Meta-characters: The characters which are syntactically meaningful to regular expressions, and therefore must be escaped in order to represent themselves in the alphabet of the regular expression: “ [](){}|^$.?+*\ ” (note the inclusion of the backslash). 3. “Special” escapes (from the “C” language): newline: “ \n ” = “ \xA ” carriage return: “ \r ” = “ \xD ” tab: “ \t ” = “ \x9 ” formfeed: “ \f ” = “ \xC ”

Regular Expressions Escapes (continued) –Classes of escapes (continued): 4. Aliases: shortcuts for commonly used character classes. (Note that the capitalized version of these aliases refer to the complement of the alias’s character class): –whitespace: “ \s ” = “ [ \t\r\n\f\v] ” –digit: “ \d ” = “ [0-9] ” –word: “ \w ” = “ [a-zA-Z0-9_] ” –non-whitespace: “ \S ” = “ [^ \t\r\n\f] ” –non-digit: “ \D ” = “ [^0-9] ” –non-word: “ \W ” = “ [^a-zA-Z0-9_] ” 5. Memory/registers/back-references: “ \1 ”, “ \2 ”, etc. 6. Self-escapes: any character other than those which have special meaning can be escaped, but the escaping has no effect: the character still represents the regular language of the character itself.

Regular Expressions Memory/Registers/Back-references –Many regular expression languages include a memory/register/back-reference feature, in which sub- matches may be referred to later in the regular expression, and/or when performing replacement, in the replacement string: Perl: /(\w+)\s+\1\b/ matches a repeated word Python: re.sub(”(the\s+)the(\s+|\b)”,”\1”,string) removes the second of a pair of ‘the’s –Note: finite automata cannot be used to implement the memory feature.

Regular Expression Examples Character classes and Kleene symbols [A-Z] = one capital letter [0-9] = one numerical digit = s, ! or 9 [A-Z] matches G or W or E does not match GW or FA or h or fun [A-Z]+ = one or more consecutive capital letters matches GW or FA or CRASH [A-Z]? = zero or one capital letter [A-Z]* = zero, one or more consecutive capital letters matches on eat or EAT or I so, [A-Z]ate matches Gate, Late, Pate, Fate, but not GATE or gate and [A-Z]+ate matches: Gate, GRate, HEate, but not Grate or grate or STATE and [A-Z]*ate matches: Gate, GRate, and ate, but not STATE, grate or Plate

Regular Expression Examples (cont’d) [A-Za-z] = any single letter so [A-Za-z]+ matches on any word composed of only letters, but will not match on “words”: bi-weekly, or IBM325 they will match on bi, weekly, yes, SU and IBM a shortcut for [A-Za-z] is \w, which in Perl also includes _ so (\w)+ will match on Information, ZANY, rattskellar and jeuvbaew \s will match whitespace so (\w)+(\s)(\w+) will match real estate or Gen Xers

Regular Expression Examples (cont’d) Some longer examples: ([A-Z][a-z]+)\s([a-z0-9]+) matches: Intel c09yt745 but not IBM series5000 [A-Z]\w+\s\w+\s\w+[!] matches: The dog died! It also matches that portion of “ he said, “ The dog died! “ [A-Z]\w+\s\w+\s\w+[!]$ matches: The dog died! But does not match “he said, “ The dog died! “ because the $ indicates end of Line, and there is a quotation mark before the end of the line (\w+ats?\s)+ parentheses define a pattern as a unit, so the above expression will match: Fat cats eat Bats that Splat

Regular Expression Examples (cont’d) To match on part of speech tagged data: (\w+[-]?\w+\|[A-Z]+) will match on: bi-weekly|RB camera|NN announced|VBD (\w+\|V[A-Z]+) will match on: ruined|VBD singing|VBG Plant|VB says|VBZ (\w+\|VB[DN]) will match on: coddled|VBN Rained|VBD But not changing|VBG

Regular Expression Examples (cont’d) Phrase matching: a\|DT ([a-z]+\|JJ[SR]?) (\w+\|N[NPS]+) matches: a|DT loud|JJ noise|NN a|DT better|JJR Cheerios|NNPS (\w+\|DT) (\w+\|VB[DNG])* (\w+\|N[NPS]+)+ matches: the|DT singing|VBG elephant|NN seals|NNS an|DT apple|NN an|DT IBM|NP computer|NN the|DT outdated|VBD aging|VBG Commodore|NNNP computer|NN hardware|NN

Conclusion Both regular expressions and finite-state automata represent regular languages. The basic regular expression operations are: concatenation, union/disjunction, and Kleene closure. The regular expression language is a powerful pattern- matching tool. Any regular expression can be automatically compiled into an NFA, to a DFA, and to a unique minimum-state DFA. An FSA can use any set of symbols for its alphabet, including letters and words.