Lecture 2 Lexical Analysis

Slides:



Advertisements
Similar presentations
CPSC Compiler Tutorial 4 Midterm Review. Deterministic Finite Automata (DFA) Q: finite set of states Σ: finite set of “letters” (input alphabet)
Advertisements

Lecture # 8 Chapter # 4: Syntax Analysis. Practice Context Free Grammars a) CFG generating alternating sequence of 0’s and 1’s b) CFG in which no consecutive.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Topics Automata Theory Grammars and Languages Complexities
Lecture 2 Lexical Analysis Topics Sample Simple Compiler Operations on strings Regular expressions Finite AutomataReadings: January 11, 2006 CSCE 531 Compiler.
CSE 413 Programming Languages & Implementation Hal Perkins Autumn 2012 Context-Free Grammars and Parsing 1.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
Topic #3: Lexical Analysis
BİL 744 Derleyici Gerçekleştirimi (Compiler Design)1 Syntax Analyzer Syntax Analyzer creates the syntactic structure of the given source program. This.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Grammars CPSC 5135.
Lexical Analyzer (Checker)
Topic #2: Infix to Postfix EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
Introduction to Parsing
CPS 506 Comparative Programming Languages Syntax Specification.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
LESSON 04.
Unit-3 Parsing Theory (Syntax Analyzer) PREPARED BY: PROF. HARISH I RATHOD COMPUTER ENGINEERING DEPARTMENT GUJARAT POWER ENGINEERING & RESEARCH INSTITUTE.
1 A Simple Syntax-Directed Translator CS308 Compiler Theory.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen Department of Computer Science University of Texas-Pan American.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Deterministic Finite Automata Nondeterministic Finite Automata.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Syntax Analysis By Noor Dhia Syntax analysis:- Syntax analysis or parsing is the most important phase of a compiler. The syntax analyzer considers.
Last Chapter Review Source code characters combination lexemes tokens pattern Non-Formalization Description Formalization Description Regular Expression.
CS 3304 Comparative Languages
Lexical Analyzer in Perspective
lec02-parserCFG May 8, 2018 Syntax Analyzer
CS510 Compiler Lecture 2.
Lecture 2 Lexical Analysis
Parsing & Context-Free Grammars
CS 404 Introduction to Compiler Design
Programming Languages Translator
Finite-State Machines (FSMs)
Syntax Specification and Analysis
Table-driven parsing Parsing performed by a finite state machine.
Finite-State Machines (FSMs)
Compiler Construction
CS314 – Section 5 Recitation 3
CS 363 Comparative Programming Languages
Syntax Analysis Sections :.
Lexical and Syntax Analysis
(Slides copied liberally from Ruth Anderson, Hal Perkins and others)
Context-Free Grammars
REGULAR LANGUAGES AND REGULAR GRAMMARS
CSCE 531 Compiler Construction
Review: Compiler Phases:
Chapter 2: A Simple One Pass Compiler
R.Rajkumar Asst.Professor CSE
Finite Automata and Formal Languages
4b Lexical analysis Finite Automata
Finite Automata & Language Theory
CS 3304 Comparative Languages
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
4b Lexical analysis Finite Automata
Context-Free Grammars
Parsing & Context-Free Grammars Hal Perkins Summer 2004
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
lec02-parserCFG May 27, 2019 Syntax Analyzer
Context-Free Grammars
Parsing & Context-Free Grammars Hal Perkins Autumn 2005
Lecture 5 Scanning.
COMPILER CONSTRUCTION
Presentation transcript:

Lecture 2 Lexical Analysis CSCE 531 Compiler Construction Lecture 2 Lexical Analysis Topics Sample Simple Compiler Operations on strings Regular expressions Finite Automata Readings: January 18, 2018

Overview Last Time Today’s Lecture References A little History Compilers vs Interpreter Data-Flow View of Compilers Regular Languages Course Pragmatics Today’s Lecture Why Study Compilers? xx References Chapter 2, Chapter 3 Assignment Due Wednesday Jan 18 3.3a; 3.5a,b; 3.6a,b,c; 3.7a; 3.8b

A Simple Compiler for Expressions Chapter Two Overview Structure of the simple compiler, really just translator for infix expressions  postfix Grammars Parse Trees Syntax directed Translation Predictive Parsing Translator for Simple Expressions Grammar Rewritten grammar (equivalent one better for pred. parsing) Parsing modules fig 2.24 Specification of Translator fig 2.35 Structure of translator fig 2.36

Grammars Grammar (or a context free grammar more correctly) has A set of tokens also known as terminals A set of nonterminals A set of productions of the form nonterminal  sequence of tokens and/or nonterminals A special nonterminal the start symbol. Example E  E + E E  E * E E  digit

Derivations A derivation is a sequence of rewriting of a string of grammar symbols using the productions in a grammar. We use the symbol  to denote that one string of grammar symbols is obtained by rewritting another using a production X Y if there is a production N  β where The nonterminal N occurs in the sequence X of Grammar symbols And Y is the same as X except β replaces the N Example E  E+E  d+E  d+ E*E  d+ E+E*E  d+d+E*E  d+d+d*E  d+d+d*d

Language generated by a grammar * is the formally the transitive, reflexive closure of the relation  s1 * s2 informally if s2 can be derived in zero or more steps from s1 The language generated by the Grammar G is denoted L(G) and is defined by L(G) = {w ε T* | S * w} Where T* is any finite string of terminals. Example: S  aS | b L(G)={b, ab, aab, …}

Parse Trees A graphical presentation of a derivation, satisfying Root is the start symbol Each leaf is a token or ε (note different font from text) Each interior node is a nonterminal If A is a parent with children X1 , X2 … Xn then A  X1X2 … Xn is a production Note for each derivation there is a unique parse tree. However, for any parse tree there are many corresponding derivations. A leftmost derivation is a derivation in which at each step the leftmost nonterminal is replaced. E *lm E + E *lm id + E *lm id + E * E *lm id + id * E *lm id + id * id

Ambiguity E  E + E | E * E | ‘(‘ E ‘)’ | id

The Empty String ε ε = the string with no characters S  Sa | a S  Sa | ε

Equivalent Grammars

Syntax directed Translation Frequently the rewritting by a production will be called a reduction or reducing by the particular production. Syntax directed translation attaches action (code) that are done when the reductions are performed Example E  E + T {print(‘+’);} E  E - T {print(‘-’);} E  T T  0 {print(‘0’);} T  1 {print(‘1’);} … T  9 {print(‘9’);}

Fig 2.1

Quadruples Quadruples Result Left operand Operator Right operand

Fig 2.3 Dataflow model of compiler

Fig 2.4 Intermediate forms for a loop Parse Tree Quadruples do i = i + 1; while ( a[i] < v);“ Parse Tree Quadruples

Left Factoring

Specification of the translator S  L eof figure 2.38 L  E ; L L  ε E  T E’ E’  + T { print(‘+’); } E’ E’  - T { print(‘-’); } E’ E  ε T  F T’ T’  * F { print(‘*’); } T’ T’  / F { print(‘/’); } T’ T  ε F  ( E ) F  id { print(id.lexeme);} F  num { print(num.value);}

Translating to code E  T E’ E’  + T { print(‘+’); } E’ Expr() { int t; term(); while(1) switch(lookahead){ case ‘+’: case ‘-’: t = lookahead; match(lookahead); term(); emit(t, NONE); continue; …

Recursive-descent parsing is a top-down method of syntax analysis in which a set of recursive procedures is used to process the input. One procedure is associated with each nonterminal of a grammar. Here, we consider a simple form of recursive-descent parsing, called predictive parsing, in which the lookahead symbol unambiguously determines the ow of control through the procedure body for each nonterminal. The sequence of procedure calls during the analysis of an input string implicitly de nes a parse tree for the input, and can be used to build an explicit parse tree, if desired.

stmt  for ( optexpr ; optexpr ; optexpr ) stmt each nonterminal leads to a call of its procedure, in the following sequence of calls: match(for); match( ‘(‘ ); optexpr (); match( ‘;’); optexpr (); match(‘;’); optexpr (); match( ‘)’ ); stmt ();

First(alpha); nullable FIRST( α ) to be the set of terminals that appear as the first symbol of one or more strings of terminals generated from α. S  aS | b | ε FIRST(a) = {a} FIRST(S) = { a, ε } Note if XX1X2 … XnX then FIRST(X) contains FIRST(X1) If X1 * ε then FIRST(X) contains FIRST(X2) If X1X2 … Xi * ε then FIRST(X) contains FIRST(Xi+1) A string w is nullable if w * ε

Using First to direct parsing Lookahead = next_token If A  α is a production and the lookahead is in FIRST(α) then reduce by the production A  α For predictive parsing to work if there are two (or more) productions A  α A  β Then FIRST(α) ∩ FIRST(β) must be empty

Overview of the Code Figure 2.36 ~matthews/public/csce531

Semantic Actions – Translate To Postfix expr  expr + term { print( ‘+’ ) } | expr - term { print( ‘-’ ) } | term term  term * factor { print( ‘*’ ) } | term / factor { print( ‘/’ ) } | factor factor  ( expr ) | num { print(num:value ) } | id { print( id.lexeme) }

Trace x * 2 + z Parse Tree Leftmost derivation Production Action E *lm F

Fig 2.46

Operations on Strings A language over an alphabet is a set of strings of characters from the alphabet. Operations on strings: let x=x1x2…xn and t=t1t2…tm then Concatenation: xt =x1x2…xnt1t2…tm Alternation: x | t = either x1x2…xn or t1t2…tm

Operations on Sets of Strings For these let S = {s1, s2, … sm} and R = {r1, r2, … rn} Alternation: S | T = S U T = {s1, s2, … sm, r1, r2, … rn } Concatenation: ST ={st | where s Є S and t Є T} = { s1r1, s1r2, … s1rn, s2r1, … s2rn, … smr1, … smrn} Power: S2 = S S, S3= S2 S, Sn =Sn-1 S What is S0? Kleene Closure: S* = U∞i=0 Si , note S0 = is in S*

Operations cont. Kleene Closure Powers: S2 = S S S3= S2 S … Sn =Sn-1 S What is S0? Kleene Closure: S* = U∞i=0 Si , note S0 = is in S*

Examples of Operations on Sets of Strings For these let S = {a,b,c} and R = {t,u} Alternation: S | T = S U T = {a,b,c,t,u } Concatenation: ST ={st | where s Є S and t Є T} = { at, au, bt, bu, ct, cu} Power: S2 = { aa, ab, ac, ba, bb, bc, ca, cb, cc} S3= { aaa, aab, aac, … ccc} 27 elements Kleene closure: S* = {any string of any length of a’s, b’s and c’s}

Examples of Operations on Sets of Strings

Regular Expressions For a given alphabet Σ the following are regular expressions: If a Є Σ then a is a regular expression and L(a) = { a } Є is a regular expression and L(Є) = { Є } Φ is a regular expression and L(Φ) = Φ And if s and t are regular expressions denoting languages L(s) and L(t) respectively then st is a regular expression and L(st) = L(s) L(t) s | t is a regular expression and L(s | t) = L(s) U L(t) s* is a regular expression and L(s*) = L(s)*

Why Regular Expressions? We use regular expressions to describe the tokens Examples: Reg expr for C identifiers C identifiers? Any string of letters, underscores and digits that start with a letter or underscore ID reg expr = (letter | underscore) (letter | underscore | digit)* Or more explicitly ID reg expr = ( a|b|…|z|_)(a|b|…z|_|0|1…|9)*

Pop Quiz Given r and s are regular expressions then What is rЄ ? r | Є ? Describe the Language denoted by 0*110* Describe the Language denoted by (0|1)*110* Give a regular expression for the language of 0’s and 1’s such that end in a 1 Give a regular expression for the language of 0’s and 1’s such that every 0 is followed by a 1

Recognizers of Regular Languages To develop efficient lexical analyzers (scanners) we will rely on a mathematical model called finite automata, similar to the state machines that you have probably seen. In particular we will use deterministic finite automata, DFAs. The construction of a lexical analyzer will then proceed as: Identify all tokens Develop regular expressions for each Convert the regular expressions to finite automata Use the transition table for the finite automata as the basis for the scanner We will actually use the tools lex and/or flex for steps 3 and 4.

Transition Diagram for a DFA Start in state s0 then if the input is “f” make transition to state s1. The from state s1 if the input is “o” make transition to state s2. And from state s2 if the input is “r” make transition to state s3. The double circle denotes an “accepting state” which means we recognized the token. Actually there is a missing state and transition

Now what about “fort” The string “fort” is an identifier, not the keyword “for” followed by “t.” Thus we can’t really recognize the token until we see a terminator – whitespace or a special symbol ( one of ,;(){}[]

Deterministic Finite Automata A Deterministic finite automaton (DFA) is a mathematical model that consists of 1. a set of states S 2. a set of input symbols ∑ , the input alphabet 3. a transition function δ: S x ∑  S that for each state and each input maps to the next state 4. a state s0 that is distinguished as the start state 5. a set of states F distinguished as accepting (or final) states

DFA to recognize keyword “for” Σ= {a,b,c …z, A,B,…Z,0,…9,’,’, ‘;’, …} S = {s0, s1, s2, s3, sdead} s0, is the start state SF = {s3} δ given by the table below f o r Others s0 s1 sdead s2 s3

Language Accepted by a DFA A string x0x1…xn is accepted by a DFA M = (Σ, S, s0, δ, SF) if si+1= δ(si, xi) for i=0,1, …n and sn+1 Є SF i.e. if x0x1…xn determines a path through the state diagram for the DFA that ends in an Accepting State. Then the language accepted by the DFA M = (Σ, S, s0, δ, SF), denoted L(M) is the set of all strings accepted by M.

What is the Language Accepted by…

Non-Deterministic Finite Automata What does deterministic mean? In a Non-Deterministic Finite Automata (NFA) we relax the restriction that the transition function δ maps every state and every element of the alphabet to a unique state, i.e. δ: S x ∑  S An NFA can: Have multiple transitions from a state for the same input Have Є transitions, where a transition from one state to another can be accomplished without consuming an input character Not have transitions defined for every state and every input Note for NFAs δ: S x ∑  2S where is the power set of S

Language Accepted by an NFA A string x0x1…xn is accepted by an NFA M = (Σ, S, s0, δ, SF) if si+1= δ(si, xi) for i=0,1, …n and sn+1 Є SF i.e. if x0x1…xn can determines a path through the state diagram for the NFA that ends in an Accepting State, taking Є where ever necessary. Then the language accepted by the DFA M = (Σ, S, s0, δ, SF), denoted L(M) is the set of all strings accepted by M.

Language Accepted by an NFA

Thompson Construction For any regular expression R construct an NFA, M, that accepts the language denoted by R, i.e., L(M) = L(R).