4a 4a Lexical analysis CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

4b Lexical analysis Finite Automata
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Finite-State Machines with No Output
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 1 Chapter 4 Chapter 4 Lexical analysis.
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis Natawut Nupairoj, Ph.D.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Lexical Analyzer in Perspective
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
Lexical Analysis – Part I EECS 483 – Lecture 2 University of Michigan Monday, September 11, 2006.
By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability.
CPS 506 Comparative Programming Languages Syntax Specification.
Introduction to Lex Ying-Hung Jiang
CSc 453 Lexical Analysis (Scanning)
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Lexical Analysis.
1st Phase Lexical Analysis
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
1 Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
More yacc. What is yacc – Tool to produce a parser given a grammar – YACC (Yet Another Compiler Compiler) is a program designed to compile a LALR(1) grammar.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Department of Software & Media Technology
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
CS 3304 Comparative Languages
Lexical Analyzer in Perspective
CS510 Compiler Lecture 2.
Lecture 2 Lexical Analysis
Chapter 3 Lexical Analysis.
Lexical Analysis (Sections )
Finite-State Machines (FSMs)
4 Lexical analysis.
CSc 453 Lexical Analysis (Scanning)
Finite-State Machines (FSMs)
4 Lexical analysis.
4a Lexical analysis.
4 Lexical analysis.
פרק 3 ניתוח לקסיקאלי תורת הקומפילציה איתן אביאור.
Review: Compiler Phases:
4a Lexical analysis.
4 Lexical analysis.
Lecture 4: Lexical Analysis & Chomsky Hierarchy
CS 3304 Comparative Languages
Lecture 5 Scanning.
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

4a 4a Lexical analysis CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.

Concepts Overview of syntax and semantics Step one: lexical analysis –Lexical scanning –Regular expressions –DFAs and FSAs –Lex

This is an overview of the standard process of turning a text file into an executable program.

Lexical analysis in perspective LEXICAL ANALYZER – Scans Input – Removes whitespace, newlines, … – Identifies Tokens – Creates Symbol Table – Inserts Tokens into symbol table – Generates Errors – Sends Tokens to Parser source program token get next token PARSER – Performs Syntax Analysis – Actions Dictated by Token Order – Updates Symbol Table Entries – Creates Abstract Rep. of Source – Generates Errors LEXICAL ANALYZER: Transforms character stream to token stream. Also called scanner, lexer, linear analysis

Where we are Total=price+tax; Total=price+tax; Lexical analyzer Parser price id + id Expr assignment = id tax

Basic lexical analysis terms Token –A classification for a common set of strings –Examples:,,,, etc. Pattern –The rules which characterize the set of strings for a token –Typically defined via regular expressions Lexeme –Character sequence that matches pattern a token –Identifiers: x, count, name, foo32, etc… –Integers: -12, 101, 0, … –Open paren: )

Examples: token, lexeme, pattern if (price + gst – rebate <= 10.00) gift := false TokenlexemeInformal description of pattern if Lparen(( IdentifierpriceString consists of letters and numbers and starts with a letter operator++ identifiergstString consists of letters and numbers and starts with a letter operator-- identifierrebateString consists of letters and numbers and starts with a letter Operator<=Less than or equal to constant10.00Any numeric constant rparen)) identifiergiftString consists of letters and numbers and starts with a letter Operator:=Assignment symbol identifierfalseString consists of letters and numbers and starts with a letter

Regular expression (REs) Scanners are based on regular expressions that define simple patterns Simpler and less expressive than BNF Examples of a regular expression letter: a|b|c|...|z|A|B|C...|Z digit: 0|1|2|3|4|5|6|7|8|9 identifier: letter (letter | digit)* Basic operations are (1) set union, (2) concatenation and (3) Kleene closureKleene Plus: parentheses, naming patterns No recursion!

Regular expression (REs) Example: letter: a|b|c|...|z|A|B|C...|Z digit: 0|1|2|3|4|5|6|7|8|9 identifier: letter (letter | digit)* letter ( letter | digit ) * concatenation: one pattern followed by another set union: one pattern or another Kleene closure: zero or more repetions of a pattern

Regular expressions are extremely useful in many applications. Mastering them will serve you well.

Another view… "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems.” -- Jamie Zawinski (1997) alt.religion.emacs

Formal language operations OperationNotationDefinition Example L={a, b} M={0,1} union of L and M L  ML  M = {s | s is in L or s is in M} {a, b, 0, 1} concatenation of L and M LM LM = {st | s is in L and t is in M} {a0, a1, b0, b1} Kleene closure of L L*L* denotes zero or more concatenations of L All the strings consists of “a” and “b”, plus the empty string. {ε, a, b, aa, bb, ab, ba, aaa, …} positive closure L+L+ denotes “one or more concatenations of “ L All the strings consists of “a” and “b”. {a, b, aa, bb, ab, ba, aaa, …}

Regular expression Let  be an alphabet, r a regular expression then L(r) is the language that is characterized by the rules of r Definition of regular expression –ε is a regular expression that denotes the language {ε} –If a is in , a is a regular expression that denotes {a} –Let r & s be regular expressions with languages L(r) & L(s) » (r) | (s) is a regular expression  L(r)  L(s) » (r)(s) is a regular expression  L(r) L(s) » (r)* is a regular expression  (L(r))* It is an inductive definition! A regular language is a language that can be defined by a regular expression

RE example revisited Examples of regular expression Letter: a|b|c|...|z|A|B|C...|Z Digit: 0|1|2|3|4|5|6|7|8|9 Identifier: letter (letter | digit)* Q: why it is an regular expression? –Because it only uses the operations of union, concatenation and Kleene closure Being able to name patterns is just syntactic sugar Using parentheses to group things is just syntactic sugar provided we specify the precedence and associatively of the operators (i.e., |, * and “concat”)

+: Another common operator The + operator is commonly used to mean “one or more repetitions” of a pattern For example, letter + means one or more letters We can always do without this, e.g. letter + is equivalent to letter letter * So the + operator is just syntactic sugar

Precedence of operators In interpreting a regular expression Parens scope sub-expressions * and + have the highest precedence Concatenation comes next | is lowest. All the operators are left associative Example –(A) | ((B)* (C)) is equivalent to A | B * C –What strings does this generate or match? Either an A or any number of Bs followed by a C

Epsilon: more syntactic sugar Sometimes we’d like a token that represents nothing This makes a regular expression matching more complex, but can be useful We use the lower case Greek letter epsilon (ε) for this special token Example: digit: 0|1|2|3|4|5|6|7|8|9|0 sign: +|-|ε int: sign digit+

Properties of regular expressions PropertyDescription r|s = s|r| is commutative r|(s|t) = (r|s)|t| is associative (rs)t=r(st)Concatenation is associative r(s|t)=rs | rt (s|t)r=sr | tr Concatenation distributes over |... We can easily determine some basic properties of the operators involved in building regular expressions

RE: Still more syntactic sugar Zero or one instance –L? = L|ε –Examples »Optional_fraction .digits|ε »optional_fraction  (.digits)? Character classes –[abc] = a|b|c –[a-z] = a|b|c...|z Systems having RE support (e.g., Java, Python, Lex, Emacs) vary in the features supported and often in the notation –But tend to be very similar

Regular grammars / expressions Regular grammars and regular expressions are equivalent –Every regular expression can be expressed by regular grammar –Every regular grammar can be expressed by regular expression Example: an identifier must begin with a letter and can be followed by any number of letters and digits Regular expressionRegular grammar ID: LETTER (LETTER | DIGIT)*ID  LETTER ID_REST ID_REST  LETTER ID_REST | DIGIT ID_REST | EMPTY

Formal definition of tokens A set of tokens is a set of strings over an alphabet {read, write, +, -, *, /, :=, 1, 2, …, 10, …, 3.45e-3, …} A set of tokens is a regular set that can be defined by using a regular expression For every regular set, there is a finite automaton (FA) that can recognize it –Aka deterministic Finite State Machine (FSM) –i.e. determine whether a string belongs to the set or not –Scanners extract tokens from source code in the same way DFAs determine membership

FSM = FA Finite state machine and finite automaton are different names for the same conceptFinite state machinefinite automaton The concept is important and useful in almost every aspect of computer science Provides abstract way to define a process that –Has a finite set of states it can be in, with a special statr state and a set of accepting states –Gets a sequence of inputs –Each input causes process to go from its current state to a new state (which might be the same!) –If after the input ends, we are in one of a set of accepting state, the input is accepted by the FA

Example An FA that determines whether a binary number has an odd or even number of 0's, where S1 is an accepting state. Incoming arrow identifies start state Double circle iden- tifies accepting state(s) transition label is input that triggers it State names (e.g., S 1, S 2 ) for convenience For this FA inputs are expected to be a 0 or 1

Deterministic finite automaton (DFA) A DFA has only one choice for a given input in every state No states with two arcs matching same input Is this a DFA?

Deterministic finite automaton (DFA) If an input symbol matches no arc for current state, input is not accepted This FA accepts only binary numbers that are multiples of three Is this a DFA?

REs can be represented as DFAs Regular expression for a simple identifier Letter: a|b|c|...|z|A|B|C...|Z Digit: 0|1|2|3|4|5|6|7|8|9 Identifier: letter (letter | digit)* * * letter 0,1,2,3,4…9 Marking state with a * is another way to identify accepting state This DFA recognizes identifiers

RE < CFG Every language that can be described by a RE can be described by a CFG Some languages can be described by a CFG but not by a RE – for example the set of palidromes made up of as and bs: S -> a S a | b S b | a | aa | b | bb

DIG Token Definition Numeric literals in Pascal, e.g. 1, 123, , 10e-3, 3.14e4 Definition of token unsignedNum DIG  0|1|2|3|4|5|6|7|8|9 unsignedInt  DIG DIG* unsignedNum  unsignedInt ((. unsignedInt) |  ) ((e ( + | – |  ) unsignedInt) |  ) Note: –Recursion restricted to leftmost or rightmost position on LHS –Parentheses used to avoid ambiguity * * * * DIG. * * e + - FAs with epsilons are NFAs NFAs are harder to implement, use backtracking Every NFA can be rewritten as a DFA (gets larger, though)

Simple Problem Read characters consisting of as and bs, one at a time. If it contains a double aa, print accepted else rejected. An abstract solution to this can be expressed as a DFA a 13* b b a a, b 2 Start state An accepting state The DFA state transitions can be encoded as a table which specifies the new state for a given current state and input a b input current state

State transition table import sys state = 1 ok = [3] trans = {1:{'a':2,'b':1}, 2:{'a':3,'b':1}, 3:{'a':3,'b':3}} for char in sys.argv[1]: state = trans[state][char] print 'accepted' if state in ok else 'rejected’ 13* b b a a, b 2 Start state An accepting state a b input current state State transition table, initial state and set of accepting states represent the DFA

Scanner Generators E.g. lex, flex Take a table as input, return scanner program that extracts tokens from character stream Useful programming utility, especially when coupled with a parser generator (e.g., yacc) Standard in Unix

Lex Lexical analyzer generator –It writes a lexical analyzer Assumes each token matches a regular expression Needs –set of regular expressions –for each expression an action Produces a highly optimized C program Automatically handles many tricky problems flex is the gnu version of the venerable unix tool lex

Lex example lexccfoolex foo.lfoolex.cfoolex tokens input > flex -ofoolex.c foo.l > cc -ofoolex foolex.c -lfl >more input begin if size>10 then size * end > foolex < input Keyword: begin Keyword: if Identifier: size Operator: > Integer: 10 (10) Keyword: then Identifier: size Operator: * Operator: - Float: (3.1415) Keyword: end

Examples The examples to follow can be access on gl See /afs/umbc.edu/users/f/i/finin/pub/lex % ls -l /afs/umbc.edu/users/f/i/finin/pub/lex total 8 drwxr-xr-x 2 finin faculty 2048 Sep 27 13:31 aa drwxr-xr-x 2 finin faculty 2048 Sep 27 13:32 defs drwxr-xr-x 2 finin faculty 2048 Sep 27 11:35 footranscanner drwxr-xr-x 2 finin faculty 2048 Sep 27 11:34 simplescanner

A Lex Program … definitions … % … rules … % … subroutines … DIG [0-9] ID [a-z][a-z0-9]* % {DIG}+ printf("Integer\n”); {DIG}+"."{DIG}* printf("Float\n”); {ID} printf("Identifier\n”); [ \t\n]+ /* skip whitespace */. printf(“Huh?\n"); % main(){yylex();}

Simplest Example %.|\n ECHO; % main() { yylex(); } No definitions One rule Minimal wrapper Echoes input

% (a|b)*aa(a|b)* {printf( " Accept %s\n ", yytext);} [a|b]+ {printf( " Reject %s\n ", yytext);}.|\n ECHO; % main() {yylex();} Strings containing aa

Rules Each has a rule has a pattern and an action Patterns are regular expression Only one action is performed –Action corresponding to the pattern matched is performed –If several patterns match, one correspond- ing to the longest sequence is chosen –Among the rules whose patterns match the same number of characters, the first rule is preferred

Definitions Definitions block allows you to name a RE If name in curly braces in a rule, the RE will be substituted DIG [0-9] % {DIG}+ printf("int: %s\n", yytext); {DIG}+"."{DIG}* printf("float: %s\n", yytext);. /* skip anything else */ % main(){yylex();}

/* scanner for a toy Pascal-like language */ %{ #include /* needed for call to atof() */ %} DIG [0-9] ID [a-z][a-z0-9]* % {DIG}+ printf("Integer: %s (%d)\n", yytext, atoi(yytext)); {DIG}+"."{DIG}* printf("Float: %s (%g)\n", yytext, atof(yytext)); if|then|begin|end printf("Keyword: %s\n",yytext); {ID} printf("Identifier: %s\n",yytext); "+"|"-"|"*"|"/" printf("Operator: %s\n",yytext); "{"[^}\n]*"}" /* skip one-line comments */ [ \t\n]+ /* skip whitespace */. printf("Unrecognized: %s\n",yytext); % main(){yylex();}

xcharacter 'x'.any character except newline [xyz]character class, in this case, matches either an 'x', a 'y', or a 'z' [abj-oZ]character class with a range in it; matches 'a', 'b', any letter from 'j' through 'o', or 'Z' [^A-Z]negated character class, i.e., any character but those in the class, e.g. any character except an uppercase letter. [^A-Z\n] any character EXCEPT an uppercase letter or a newline r*zero or more r's, where r is any regular expression r+one or more r's r?zero or one r's (i.e., an optional r) {name} expansion of the "name" definition "[xy]\"foo" the literal string: '[xy]"foo' (note escaped ") \xif x is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI-C interpretation of \x. Otherwise, a literal 'x' (e.g., escape) rs RE r followed by RE s (e.g., concatenation) r|s either an r or an s >end-of-file Flex RE syntax