Download presentation
Presentation is loading. Please wait.
Published byBenjamin Harrington Modified over 8 years ago
1
Regular Expressions Finite State Machines Lexical Analysis
2
Processing English Consider the following two sentences Hi, I am 22 years old. I come from Alabama. 22 come Alabama I, old from am. Hi years I. Are they both correct? How do you know? Same words, numbers and punctuation What did you do first? 1.Find words, numbers and punctuation 2.Then, check order (grammar rules)
3
Finding Words and Numbers How did you find words, numbers and punctuation? You have a definition of what each is, or looks like For example, what is a number? a word? Although your are a bit more agile, the process was: 1.Start with first character 2.If letter, assume word; if digit, assume number 3.Scan left to right 1 character at a time, until punctuation mark (space, comma, etc.) 4.Recognize word or number 5.If no more characters, done; otherwise return to 1
4
Processing Code How do you process the following? What are the main parts in which to break the input? void quote() { print( "To iterate is human, to recurse divine." + " - L. Peter Deutsch" ); } Schemes: childOf(X,Y) marriedTo(X,Y) Facts: marriedTo('Zed','Bea'). marriedTo('Jack','Jill'). childOf('Jill','Zed'). childOf('Sue','Jack'). Rules: childOf(X,Y) :- childOf(X,Z), marriedTo(Y,Z). marriedTo(X,Y) :- marriedTo(Y,X). Queries: marriedTo('Bea','Zed')? childOf('Jill','Bea')? def addABC(x): s = “ABC” return x + s addABC(input(“String: ”))
5
Example def addABC ( x ) : s = “ABC” return x + s addABC ( input ( “String: ” ) )
6
What are the Parts? They are called TOKENS Process similar to English processing Lexical Analysis Input: A program in some language Output: A list of tokens (type, value, location)
7
Example Revisited Sample Input:Sample Output: def addABC(x): s = “ABC” return x + s addABC(input(“String: ”)) (FUNDEF,”def”,1) (ID,”addABC”,1) (LEFT_PAREN,”(”,1) (ID,”x”,1) (RIGHT_PAREN,”)”,1) (COLON,”:”,1) (ID,”s”,2) (ASSIGN,”=”,2) (STRING,”’ABC’”,2) (FUNRET,”return”,3) (ID,”x”,3) (OPERATOR,”+”,3) (ID,”s”,3) (ID,”addABC”,4) (LEFT_PAREN,”(”,4) …
8
Program Compilation Lexical Analysis is first step of process Program Compiler Code Lexical Analyzer Program Parser Tokens Code Generator Internal DataCode Keywords String literals Variables … Error messages Syntax AnalysisOr Interpreter (Executed directly)
9
Token Specification Regular Expressions Pattern description for strings Concatenation: abc -> “abc” Boolean OR: ab|ac -> “ab”, “ac” Kleene closure: ab * -> “a”, “ab”, “abbb”, etc. Optional: ab?c -> “ac”, “abc” One or more: ab + -> “ab”, “abbb” Group using () (a|b)c -> “ac”, “bc” (a|b) * c -> “c”, “ac”, “bc”, “bac”, “abaaabbbabbaaaaac”, etc.
10
RegEx Extensions Exactly n: a 3 b + -> “aaab”, “aaabb”, … [A-Z] = A|B|…|Z [ABC] = A|B|C [~aA] = any character but “a” or “A” \ = escape character (e.g., \* -> “*”) Whitespace characters \s, \t, \n, \v
11
Practice Exercises PE1 You are the security officer at your company. A new system of username and password is introduced to protect your systems. You have been tasked to specify what constitute valid usernames and passwords. You choose to use regular expressions for that purpose. You decide that valid usernames are any sequence of letters or digits, but they must start with a letter. You also decide that valid passwords are any sequence of letters, digits or special characters (=, !, ?, $, &), but they must contain at least one digit, and at least one special character. You may denote by L the set of letters and by D the set of digits. Special characters stand for themselves. Write one regular expression for usernames and one for passwords. ------------------ PE2 In C++, a line comment consists of two '/' followed by any string of characters (letters, numbers and spaces). An example would be: '// This is a 123 comment for 007' Using D to denote any digit from 0 to 9, L to denote any letter in the English alphabet, and S to denote a space, write a regular expression for line comments in C++. (let '/' denote itself).
12
Token Recognition Deterministic Finite State Machine A DFSM is a 5-tuple (Σ,S,s 0,δ,F) Σ: finite, non-empty set of symbols (input alphabet) S: finite, non-empty set of states s 0 : member of S designated as start state δ: state-transition function δ: S x Σ -> S F: subset of S (final states, may be empty)
13
FSM & RegEx abc a(b|c) ab* (a(b?c)) + abc Note the special double-circle designation of a final/accepting state. a a a b b b a c c c
14
Practice Exercises PE1 Recall your job as the security officer at your company. Now that valid usernames and passwords have been specified, you are tasked to ensure that only valid usernames and passwords are used. You choose to use finite state machines for that purpose. Design one finite state machine to accept only valid usernames and one to accept only valid passwords. ------------------ PE2 Assume that you have characters A and B. Design a finite state machine that recognizes/accepts any string that contains at least three A’s. ------------------ PE3 Assume that your language is composed of binary strings. Design a finite state machine with two accept states, one that is chosen if the input string is even, the other if it is odd.
15
CS 236 Coolness Factor! Design our own language Subset of Datalog (LP-like) Build an interpreter for our language Lexical Analyzer (Project 1) Parser (Project 2) Interpreter (Projects 3 and 4) Optimization (Project 5)
16
Designing a Language Define the tokens Elements of the language, punctuation, etc. For example, what are they in C++? Recognize the tokens (lexical analysis) Define the grammar Forms of correct sentences For example, what are they in C++? Recognize the grammar (parsing) Interpret and execute the program C++ is a bit too complicated for us…
17
Varied World Views fct boolean succeeds(person x) { if studies(x) return T else return F } fct personlist siblings(person x) { return x’s siblings } fct int square(int x) { return x * x } fct boolean succeeds(person x) { if studies(x) return T else return F } fct boolean sibling(person x, person y) { if y is x’s sibling return T else return F } fct boolean square(int x, int y) { if y == x * x return T else return F } Look up table or oracle No concerns with efficiency
18
Logic Programming (I) Assume: all functions are Boolean Compute using facts and rules Facts are the known true values of the functions Rules express relations among functions Example: studies(x), succeeds(x) Facts: studies(Matt), studies(Jenny) Rule: succeeds(x) :- studies(x) Closed-world Assumption
19
Logic Programming (II) Computing is like issuing queries First check if it can be answered with facts Second check if rules can be applied Examples studies(Alex)? NO (neither facts nor rules to establish it) studies(Matt)? YES (there is fact about that) succeeds(Jenny)? YES (no fact, but a rule that if Jenny studies then she succeeds and a fact that Jenny studies)
20
Functions of Several Arguments Examples loves(x,y), parent(x,y), grade(s,c,l) loves(x,y) :- married(x,y) Computing parent(Christophe, Samuel)? Yes, if there is a fact that matches parent(Christophe, X)? Yes, if there is a value of X that would cause it to match a fact – return value of X loves(X, Y)? Yes, if there are values of X and Y that would make this true, either by matching a fact or via rules (e.g., married(Christophe, Isabelle)) – return values of X and Y
21
When We Are Done Sample Program:Sample Execution: Schemes: snap(S,N,A,P) csg(C,S,G) cn(C,N) ncg(N,C,G) Facts: snap('12345','C. Brown','12 Apple St.','555-1234'). snap('22222','P. Patty','56 Grape Blvd','555-9999'). snap('33333','Snoopy','12 Apple St.','555-1234'). csg('CS101','12345','A'). csg('CS101','22222','B'). csg('CS101','33333','C'). csg('EE200','12345','B+'). csg('EE200','22222','B'). Rules: cn(C,N) :- snap(S,N,A,P),csg(C,S,G). ncg(N,C,G) :- snap(S,N,A,P),csg(C,S,G). Queries: cn('CS101',Name)? ncg('Snoopy',Course,Grade)? cn('CS101',Name)? Yes(3) Name='C. Brown' Name='P. Patty' Name='Snoopy' ncg('Snoopy',Course,Grade)? Yes(1) Course='CS101', Grade='C' From Rules From Facts
22
Project 1: Lexical Analyzer Sample Input:Sample Output: Queries: IsInRoomAtDH('Snoopy',R,'M',H) #SchemesFactsRules. (QUERIES,"Queries",1) (COLON,":",1) (ID,"IsInRoomAtDH",2) (LEFT_PAREN,"(",2) (STRING,"'Snoopy'",2) (COMMA,",",2) (ID,"R",2) (COMMA,",",2) (STRING,"'M'",2) (COMMA,",",2) (ID,"H",2) (RIGHT_PAREN,")",2) (COMMENT,"#SchemesFactsRules",3) (PERIOD,".",4) Total Tokens = 14 Define and find the tokens
23
Finite State Transducer Extended FSM: Γ: finite, non-empty set of symbols (output alphabet) δ: state-transition function δ: S x Σ -> S x Γ FST consumes input symbols and emits output symbols Lexical analyzer consumes raw characters emits tokens Note, textbook uses 2 functions
24
Practice Exercises PE1 Assume we define two types of comments. A line comment starts with # and ends at the next newline (EOL) or EOF. For example, # This is a line comment. A block comment starts with a #| and ends with a |#. If EOF is found before the end of the block comment, it is UNDEFINED. For example, #| This is a fancy multiline comment |# #| This is an illegal block comment because it ends with EOF. Design a FST to handle comments. It should emit “LC” when a line comment is encountered, “BC” when a block comment is encountered, and “UNDEF” when something unexpected happens.
25
First Glance at Project 1 TOKEN TYPEDESCRIPTION COMMAThe ',' character Q_MARKThe '?' character LEFT_PARENThe '(' character COLONThe ':' character C_DASHThe string “:-” SCHEMESThe string “Schemes” IDA letter followed by 0 or more letters or digits and is not a keyword (Schemes, Facts, Rules, Queries) STRINGAny sequence of characters enclosed in single quotes. Two single quotes denote an apostrophe within the string. For line-number counts, count all \n's within a string. A string token’s line number is the line where the string starts. If EOF is found before the end of the string, it is undefined This is only a partial list of tokens
26
Partial FST for Project 1 Note: STRING is incomplete (e.g., \n, ‘’’’, EOF not handled) (No Line Numbers)
27
Scanner Algorithm Given FSMs D 1, …, D n while input is not empty do s i := the longest prefix that some D i accepts; k := | s i |; if k > 0 then j := min {i:|s i |=k and D i reads s i }; remove s j from input; perform the j th action else move one character from input to output (bad) end
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.