Download presentation
1
CS 426 Compiler Construction
2. Lexical scanning
2
Outline 1. Introduction 2. How to build a lexical scanner
What is a lexical scanner and why build it? 2. How to build a lexical scanner 2.1 Manually 2.2 Automatically 2.2.1 Regular expressions 2.2.2 From Res to NFA to DFA
3
1. Introduction What is a lexical scanner and why build it?
4
Lexical scanning (1) Although grammars can go all the way to characters, parsers typically are built assuming terminals like <identifier>, <integer constant>, <character constant>, and <floating point constant>. When a keyword, such as if and for, appears in the grammar, the terminal (i.e., the element of VT) is the whole keyword, rather that each of its letters. Also, spaces and tabs are typically ignored.
5
Lexical scanning (2) Because of this, during parsing, the compiler makes use of a subroutine called lexical analyzer, lexical scanner, or simply lexer It reads the character string forming the program and returns each time One terminal of the grammar and When needed, the lexeme (or a pointer to that the symbol table entry containing the lexeme) For example,123.5E10 is a lexeme that is interpreted as an instance of the terminal <floating point constant>. For the keyword if, the lexeme does not have to be returned. It is the same as the name of the terminal.
6
Lexical scanning (3) Consider the program
#include <stdio.h> pp1() { float a[100][100]; int i,j; for (i=1;i++;i<=50) for (j=1;j++;j<=50) a[i][j]=0; } It is stored as the following sequence of characters: # i n c l u d e < s t d i o . h >\n p p 1 ( )\n { \n f l o a t a [ ] [ ] ;\n i n t i , j ;\n f o r ( i = 1 ; i + + ; i < = 5 )\n f o r ( j = 1 ; j ; j < = 5 0 )\n a [ i ] [ j ] = 0 ;\n }\n Want the lexical scanner to produce something like the following sequence of <tokentype, lexeme> pairs: <sharp,>, <include,>, <less_than,>, <id,”stdio”>, <dot,>, <id, “h”>, <greater than, >, <id, “pp1”>, <leftparen,>, <rightparen,>, <leftcurbracket,>, <float,>, <id,”a”>, <leftSquareBracket,>, <number,100>,
7
Lexical scanning (4) Notice that not all pairs have a lexeme (the actual characters that conform the token) because their values are constant. Tokentype is usually an integer constant.
8
Lexical analyzer in context
token Lexical Analyzer Parser source program getNextToken Symbol Table
9
Why lexical scanning as a separate module
Not only to simplify parser construction, but also To accelerate parsing
10
2. How to build a lexical scanner
11
How to build a lexical scanner
By hand. Not conceptually difficult but laborious. Using a scanner generator (flex, lex). This works most of the time, but some languages have complications. In the absence of complications, regular expressions can be used to describe token types and used to define scanners.
12
Complications (1) In FORTRAN spaces are ignored.
For example, the following is a valid program S U B R O U T I N E FOO(X,Y) R E A L A12(100),B12(100) D O 10 I = 1, 100 ... Earlier programs were written by hand and transcribed by card punch “typists”. It seems that introducing unwanted spaces was frequent. That is why FORTRAN was designed that way.
13
Complications (2) DO I = 1, 9 ! Here “DO” is a keyword and “I”
The problem with ignoring spaces is that token type is dependent on the context: DO I = 1, 9 ! Here “DO” is a keyword and “I” ! an index variable DO I = 9 ! Here “DOI” is a variable. This also introduces impediments in to extend FORTRAN. Thus, DOALL cannot be a valid keyworkd to represent a different type of DO loop. It would be impossible to distinguish between “DOALL” I=1,100 and “DO” ALLI=1,100
14
Complications (3) In PL/1 there were no reserved words. So, the following is a valid PL/1 statement: IF ELSE THEN THEN = ELSE; ELSE ELSE = THEN
15
2.1 A manually written scanner
16
A manually written lexer
An example of manually written scanner is presented next. Afterwards, we will formalize the description of tokens and will show how to automate the creation of lexers taking advantage of the formalization. For the example, we assume a language with identifiers and integer constants. Keywords are reserved so that an identifier cannot have the name of a keyword. Other tokens are single character (i.e. no tokens of the form “<=“)
17
Token getNextToken(){ /. skip spaces
Token getNextToken(){ /* skip spaces */ for (;;peek=getchar()){ if (is blankortab(peek)) /*do nothing */ else if (peek == newLine)line++ else break } /* get constant */ if (isDigit(peek)){ v=0 while(isdigit(peek)) {v+=v*10+value(peek); peek=getnext()} return token(<num,v>) /* get id and keywords */ if (isletter(peek)){ s=“” while (isletter(peek)||isdigit(peek))){s ||= peek; peek=getnext()} w=words.get(s); if (w != null) return w; words.store(key=s, value=<id,s>); return token(<id,s>) /* peek is a token */ token t=token(peek); peek=nextchar(); return t;
18
2.2 Automatic construction
19
2.2.1 Formal specification
20
The foundations There are three ways to describe elements of the language that form the tokens. Type 3 or regular grammars (whose productions are of the form A → aB or A → a where A is in VN and a in VT). Regular expressions Deterministic (DFA) and nondeterministic (NFA) finite automata These three ways are equivalent in power, and it is possible to convert automatically from any form to any other form.
21
Regular expressions (1)
We start by showing how to define new languages using two operations on existing languages: Union: L ∪ D ={s | s ∈ L or s ∈ D} Catenation: LD = { st | s ∈ L and t ∈ D} Catenation can be repeated zero or more times: Li = {s1s2…si | sk ∈ L for 1 ≤ k ≤ i} L0 ={𝜖} L* = 𝑖=0 ∞ 𝐿 𝑖 L+ = 𝑖=1 ∞ 𝐿 𝑖
22
Regular expressions (2)
Regular expressions (RE) are defined over an alphabet (denoted ∑ ). We say that L(r) is the language represented by a regular expression r. Then, given regular expressions r and s, 𝜖 is a regular expression with L(𝜖)={𝜖} ∀ a ∈ ∑ , a is a regular expression with L(a)={a} r|s represents L(r) ∪ L(s) rs represents L(r)L(s) r* represents L(r)* The order of precedence is *, catenation, |. Parenthesis can be used like in arithmetic expressions.
23
Keyword: “else” or “if” or “begin” or …
Example: Keyword Keyword: “else” or “if” or “begin” or … ‘else’ | ‘if’ | ‘begin’ | . . . Note: ‘else’ abbreviates ‘e’’l’’s’’e’ CS 143 Lecture 3
24
Integer: a non-empty string of digits
Example: Integers Integer: a non-empty string of digits CS 143 Lecture 3
25
Identifier: strings of letters or digits, starting with a letter
Example: Identifier Identifier: strings of letters or digits, starting with a letter letter = ‘A’ | | ‘Z’ | ‘a’ | | ‘z’ identifier = letter (letter + digit)* Is (letter* | digit*) the same? CS 143 Lecture 3
26
Whitespace: a non-empty sequence of blanks, newlines, and tabs
Example: Whitespace Whitespace: a non-empty sequence of blanks, newlines, and tabs CS 143 Lecture 3
27
Example: Phone Numbers
Regular expressions are all around you! Consider (650) CS 143 Lecture 3
28
Example: Email Addresses
Consider CS 143 Lecture 3
29
Example: Unsigned Pascal Numbers
CS 143 Lecture 3
30
2.2.2 Automatic implementation of Lexical Analysis From REs to NFAs to DFAs
31
Key properties For every RE, there is a DFA that recognizes the language defined by that RE For every NFA M1 there is a DFA M2 for which L(M2) = L(M1) Thompson’s Construction: Systematically generate an NFA for a given Reg. Ex. Subset construction algorithm: Converts an NFA to an equivalent DFA Key: identify sets of states of NFA that have similar behavior, and make each set a single state of the DFA
32
Lexical scanner Source program tokens DFA transition table
Scanner generator Minimum state DFA Regular expression RE NFA DFA Thompson’s Construction Subset Construction Minimization Algorithm
33
Simulate DFA Linear scan of input file Two buffer pointers:
Start of current token Current input symbol Remember symbol for last accepting state Execute code for matched pattern Multiple patterns may match same token Input buffer DFA simulator Transition table
34
Regular Expressions → Lexical Spec. (1)
Write a regular expression for each type of token type Number = digit | Keyword = ‘if’ | ‘else’ | … Identifier = letter (letter | digit)* OpenPar = ‘(‘ …
35
Regular Expressions → Lexical Spec. (2)
Construct R matching all types of tokens R = Keyword | Identifier | Number | … = R1 | R2 | …
36
Regular Expressions → Lexical Spec. (3)
Let input be x1…xn For 1 i n check x1…xi L(R) If success, then we know that x1…xi L(Rj) for some j Remove x1…xi from input and go to (3)
37
Ambiguities (1) There are ambiguities in the algorithm
How much input is used? What if x1…xi L(R) and also x1…xK L(R) Rule: Pick longest possible string in L(R)
38
Ambiguities (2) Rule: use rule listed first (j if j < k)
Which token is used? What if x1…xi L(Rj) and also x1…xi L(Rk) Rule: use rule listed first (j if j < k) Treats “if” as a keyword, not an identifier
39
Error Handling What if Problem: Can’t just get stuck … Solution:
No rule matches a prefix of input ? Problem: Can’t just get stuck … Solution: Write a rule matching all “bad” strings Put it last (lowest priority)
40
Summary Regular expressions provide a concise notation for string patterns Use in lexical analysis requires small extensions To resolve ambiguities To handle errors Good algorithms known Require only single pass over the input Few operations per character (table lookup)
41
Finite Automata Regular expressions = specification
Finite automata = implementation A finite automaton consists of An input alphabet A set of states S A start state n A set of accepting states F S A set of transitions state input state
42
In state s1 on input “a” go to state s2
Finite Automata Transition s1 a s2 Is read In state s1 on input “a” go to state s2 If end of input and in accepting state => accept Otherwise => reject
43
Finite Automata State Graphs
The start state An accepting state a A transition
44
A Simple Example A finite automaton that accepts only “1” 1
45
Another Simple Example
A finite automaton accepting any number of 1’s followed by a single 0 Alphabet: {0,1} 1
46
And Another Example Alphabet {0,1} What language does this recognize?
1 1 1
47
Epsilon Moves Another kind of transition: -moves
B Machine can move from state A to state B without reading input
48
Deterministic and Nondeterministic Automata
Deterministic Finite Automata (DFA) One transition per input per state No -moves Nondeterministic Finite Automata (NFA) Can have multiple transitions for one input in a given state Can have -moves
49
Execution of Finite Automata
A DFA can take only one path through the state graph Completely determined by input NFAs can choose Whether to make -moves Which of multiple transitions for a single input to take
50
Acceptance of NFAs An NFA can get into multiple states Input:
1 Input: 1 Rule: NFA accepts if it can get to a final state
51
NFA vs. DFA (1) NFAs and DFAs recognize the same set of languages (regular languages) DFAs are faster to execute There are no choices to consider
52
NFA vs. DFA (2) For a given language NFA can be simpler than DFA
1 NFA 1 DFA 1 1 DFA can be exponentially larger than NFA
53
Regular Expressions to Finite Automata
High-level sketch NFA Regular expressions DFA Lexical Specification Table-driven Implementation of DFA
54
Regular Expressions to NFA (1)
For each kind of rexp, define an NFA Notation: NFA for rexp M M For For input a a
55
Regular Expressions to NFA (2)
For AB A B For A | B B A
56
Regular Expressions to NFA (3)
For A* A
57
Example of RegExp -> NFA conversion
Consider the regular expression (1 | 0)*1 The NFA is 1 C E 1 A B G H I J D F
58
NFA to DFA: The Trick Simulate the NFA Each state of DFA Start state
= a non-empty subset of states of the NFA Start state the set of NFA states reachable through -moves from NFA start state Add a transition S a S’ to DFA iff S’ is the set of NFA states reachable from any state in S after seeing the input a, considering -moves as well
59
NFA to DFA. Remark An NFA may be in many states at any time
How many different states ? If there are N states, the NFA must be in some subset of those N states How many subsets are there? 2N - 1 = finitely many
60
NFA -> DFA Example 1 1 1 1 1 C E A B G H I J D F
G H I J D F FGHIABCD 1 ABCDHI 1 1 EJGHIABCD
61
Implementation A DFA can be implemented by a 2D table T
One dimension is “states” Other dimension is “input symbol” For every transition Si a Sk define T[i,a] = k DFA “execution” If in state Si and input a, read T[i,a] = k and skip to state Sk Very efficient
62
Table Implementation of a DFA
T 1 S 1 1 U 1 S T U
63
Implementation (Cont.)
NFA -> DFA conversion is at the heart of tools such as flex But, DFAs can be huge In practice, flex-like tools trade off speed for space in the choice of NFA and DFA representations
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.