Lecture 2 Lexical Analysis

Lecture 2 Lexical Analysis
CSCE 531 Compiler Construction Lecture 2 Lexical Analysis Topics Sample Simple Compiler Operations on strings Regular expressions Finite Automata Readings: January 18, 2018

Overview Last Time Today’s Lecture References
A little History Compilers vs Interpreter Data-Flow View of Compilers Regular Languages Course Pragmatics Today’s Lecture Why Study Compilers? xx References Chapter 2, Chapter 3 Assignment Due Wednesday Jan 18 3.3a; 3.5a,b; 3.6a,b,c; 3.7a; 3.8b

A Simple Compiler for Expressions
Chapter Two Overview Structure of the simple compiler, really just translator for infix expressions  postfix Grammars Parse Trees Syntax directed Translation Predictive Parsing Translator for Simple Expressions Grammar Rewritten grammar (equivalent one better for pred. parsing) Parsing modules fig 2.24 Specification of Translator fig 2.35 Structure of translator fig 2.36

Grammars Grammar (or a context free grammar more correctly) has
A set of tokens also known as terminals A set of nonterminals A set of productions of the form nonterminal  sequence of tokens and/or nonterminals A special nonterminal the start symbol. Example E  E + E E  E * E E  digit

Derivations A derivation is a sequence of rewriting of a string of grammar symbols using the productions in a grammar. We use the symbol  to denote that one string of grammar symbols is obtained by rewritting another using a production X Y if there is a production N  β where The nonterminal N occurs in the sequence X of Grammar symbols And Y is the same as X except β replaces the N Example E  E+E  d+E  d+ E*E  d+ E+E*E  d+d+E*E  d+d+d*E  d+d+d*d

Language generated by a grammar
* is the formally the transitive, reflexive closure of the relation  s1 * s2 informally if s2 can be derived in zero or more steps from s1 The language generated by the Grammar G is denoted L(G) and is defined by L(G) = {w ε T* | S * w} Where T* is any finite string of terminals. Example: S  aS | b L(G)={b, ab, aab, …}

Parse Trees A graphical presentation of a derivation, satisfying
Root is the start symbol Each leaf is a token or ε (note different font from text) Each interior node is a nonterminal If A is a parent with children X1 , X2 … Xn then A  X1X2 … Xn is a production Note for each derivation there is a unique parse tree. However, for any parse tree there are many corresponding derivations. A leftmost derivation is a derivation in which at each step the leftmost nonterminal is replaced. E *lm E + E *lm id + E *lm id + E * E *lm id + id * E *lm id + id * id

Ambiguity E  E + E | E * E | ‘(‘ E ‘)’ | id

The Empty String ε ε = the string with no characters S  Sa | a S  Sa | ε

Equivalent Grammars

Syntax directed Translation
Frequently the rewritting by a production will be called a reduction or reducing by the particular production. Syntax directed translation attaches action (code) that are done when the reductions are performed Example E  E + T {print(‘+’);} E  E - T {print(‘-’);} E  T T  0 {print(‘0’);} T  1 {print(‘1’);} … T  9 {print(‘9’);}

Fig 2.1

Quadruples Quadruples Result Left operand Operator Right operand

Fig 2.3 Dataflow model of compiler

Fig 2.4 Intermediate forms for a loop
Parse Tree Quadruples do i = i + 1; while ( a[i] < v);“ Parse Tree Quadruples

Left Factoring

Specification of the translator
S  L eof figure 2.38 L  E ; L L  ε E  T E’ E’  + T { print(‘+’); } E’ E’  - T { print(‘-’); } E’ E  ε T  F T’ T’  * F { print(‘*’); } T’ T’  / F { print(‘/’); } T’ T  ε F  ( E ) F  id { print(id.lexeme);} F  num { print(num.value);}

Translating to code E  T E’ E’  + T { print(‘+’); } E’
Expr() { int t; term(); while(1) switch(lookahead){ case ‘+’: case ‘-’: t = lookahead; match(lookahead); term(); emit(t, NONE); continue; …

Recursive-descent parsing is a top-down method of syntax analysis in which a set of recursive procedures is used to process the input. One procedure is associated with each nonterminal of a grammar. Here, we consider a simple form of recursive-descent parsing, called predictive parsing, in which the lookahead symbol unambiguously determines the ow of control through the procedure body for each nonterminal. The sequence of procedure calls during the analysis of an input string implicitly de nes a parse tree for the input, and can be used to build an explicit parse tree, if desired.

stmt  for ( optexpr ; optexpr ; optexpr ) stmt each nonterminal leads to a call of its procedure, in the following sequence of calls: match(for); match( ‘(‘ ); optexpr (); match( ‘;’); optexpr (); match(‘;’); optexpr (); match( ‘)’ ); stmt ();

First(alpha); nullable
FIRST( α ) to be the set of terminals that appear as the first symbol of one or more strings of terminals generated from α. S  aS | b | ε FIRST(a) = {a} FIRST(S) = { a, ε } Note if XX1X2 … XnX then FIRST(X) contains FIRST(X1) If X1 * ε then FIRST(X) contains FIRST(X2) If X1X2 … Xi * ε then FIRST(X) contains FIRST(Xi+1) A string w is nullable if w * ε

Using First to direct parsing
Lookahead = next_token If A  α is a production and the lookahead is in FIRST(α) then reduce by the production A  α For predictive parsing to work if there are two (or more) productions A  α A  β Then FIRST(α) ∩ FIRST(β) must be empty

Overview of the Code Figure 2.36
~matthews/public/csce531

Semantic Actions – Translate To Postfix
expr  expr + term { print( ‘+’ ) } | expr - term { print( ‘-’ ) } | term term  term * factor { print( ‘*’ ) } | term / factor { print( ‘/’ ) } | factor factor  ( expr ) | num { print(num:value ) } | id { print( id.lexeme) }

Trace x * 2 + z Parse Tree Leftmost derivation
Production Action E *lm F

Fig 2.46

Operations on Strings A language over an alphabet is a set of strings of characters from the alphabet. Operations on strings: let x=x1x2…xn and t=t1t2…tm then Concatenation: xt =x1x2…xnt1t2…tm Alternation: x | t = either x1x2…xn or t1t2…tm

Operations on Sets of Strings
For these let S = {s1, s2, … sm} and R = {r1, r2, … rn} Alternation: S | T = S U T = {s1, s2, … sm, r1, r2, … rn } Concatenation: ST ={st | where s Є S and t Є T} = { s1r1, s1r2, … s1rn, s2r1, … s2rn, … smr1, … smrn} Power: S2 = S S, S3= S2 S, Sn =Sn-1 S What is S0? Kleene Closure: S* = U∞i=0 Si , note S0 = is in S*

Operations cont. Kleene Closure
Powers: S2 = S S S3= S2 S … Sn =Sn-1 S What is S0? Kleene Closure: S* = U∞i=0 Si , note S0 = is in S*

Examples of Operations on Sets of Strings
For these let S = {a,b,c} and R = {t,u} Alternation: S | T = S U T = {a,b,c,t,u } Concatenation: ST ={st | where s Є S and t Є T} = { at, au, bt, bu, ct, cu} Power: S2 = { aa, ab, ac, ba, bb, bc, ca, cb, cc} S3= { aaa, aab, aac, … ccc} elements Kleene closure: S* = {any string of any length of a’s, b’s and c’s}

Examples of Operations on Sets of Strings

Regular Expressions For a given alphabet Σ the following are regular expressions: If a Є Σ then a is a regular expression and L(a) = { a } Є is a regular expression and L(Є) = { Є } Φ is a regular expression and L(Φ) = Φ And if s and t are regular expressions denoting languages L(s) and L(t) respectively then st is a regular expression and L(st) = L(s) L(t) s | t is a regular expression and L(s | t) = L(s) U L(t) s* is a regular expression and L(s*) = L(s)*

Why Regular Expressions?
We use regular expressions to describe the tokens Examples: Reg expr for C identifiers C identifiers? Any string of letters, underscores and digits that start with a letter or underscore ID reg expr = (letter | underscore) (letter | underscore | digit)* Or more explicitly ID reg expr = ( a|b|…|z|_)(a|b|…z|_|0|1…|9)*

Pop Quiz Given r and s are regular expressions then
What is rЄ ? r | Є ? Describe the Language denoted by 0*110* Describe the Language denoted by (0|1)*110* Give a regular expression for the language of 0’s and 1’s such that end in a 1 Give a regular expression for the language of 0’s and 1’s such that every 0 is followed by a 1

Recognizers of Regular Languages
To develop efficient lexical analyzers (scanners) we will rely on a mathematical model called finite automata, similar to the state machines that you have probably seen. In particular we will use deterministic finite automata, DFAs. The construction of a lexical analyzer will then proceed as: Identify all tokens Develop regular expressions for each Convert the regular expressions to finite automata Use the transition table for the finite automata as the basis for the scanner We will actually use the tools lex and/or flex for steps 3 and 4.

Transition Diagram for a DFA
Start in state s0 then if the input is “f” make transition to state s1. The from state s1 if the input is “o” make transition to state s2. And from state s2 if the input is “r” make transition to state s3. The double circle denotes an “accepting state” which means we recognized the token. Actually there is a missing state and transition

Now what about “fort” The string “fort” is an identifier, not the keyword “for” followed by “t.” Thus we can’t really recognize the token until we see a terminator – whitespace or a special symbol ( one of ,;(){}[]

Deterministic Finite Automata
A Deterministic finite automaton (DFA) is a mathematical model that consists of 1. a set of states S 2. a set of input symbols ∑ , the input alphabet 3. a transition function δ: S x ∑  S that for each state and each input maps to the next state 4. a state s0 that is distinguished as the start state 5. a set of states F distinguished as accepting (or final) states

DFA to recognize keyword “for”
Σ= {a,b,c …z, A,B,…Z,0,…9,’,’, ‘;’, …} S = {s0, s1, s2, s3, sdead} s0, is the start state SF = {s3} δ given by the table below f o r Others s0 s1 sdead s2 s3

Language Accepted by a DFA
A string x0x1…xn is accepted by a DFA M = (Σ, S, s0, δ, SF) if si+1= δ(si, xi) for i=0,1, …n and sn+1 Є SF i.e. if x0x1…xn determines a path through the state diagram for the DFA that ends in an Accepting State. Then the language accepted by the DFA M = (Σ, S, s0, δ, SF), denoted L(M) is the set of all strings accepted by M.

What is the Language Accepted by…

Non-Deterministic Finite Automata
What does deterministic mean? In a Non-Deterministic Finite Automata (NFA) we relax the restriction that the transition function δ maps every state and every element of the alphabet to a unique state, i.e. δ: S x ∑  S An NFA can: Have multiple transitions from a state for the same input Have Є transitions, where a transition from one state to another can be accomplished without consuming an input character Not have transitions defined for every state and every input Note for NFAs δ: S x ∑  2S where is the power set of S

Language Accepted by an NFA
A string x0x1…xn is accepted by an NFA M = (Σ, S, s0, δ, SF) if si+1= δ(si, xi) for i=0,1, …n and sn+1 Є SF i.e. if x0x1…xn can determines a path through the state diagram for the NFA that ends in an Accepting State, taking Є where ever necessary. Then the language accepted by the DFA M = (Σ, S, s0, δ, SF), denoted L(M) is the set of all strings accepted by M.

Language Accepted by an NFA

Thompson Construction
For any regular expression R construct an NFA, M, that accepts the language denoted by R, i.e., L(M) = L(R).

Lecture 2 Lexical Analysis

Similar presentations

Presentation on theme: "Lecture 2 Lexical Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 2 Lexical Analysis

Similar presentations

Presentation on theme: "Lecture 2 Lexical Analysis"— Presentation transcript:

Similar presentations

About project

Feedback