Module 2 Compiler and their Working Software Construction Lecture 10,11 and 12.

Module 2 Compiler and their Working Software Construction Lecture 10,11 and 12

2 What are Compilers  Translate information from one representation to another  Usually information = program  Typical Compilers: VC, VC++, GCC, JavaC FORTRAN, Pascal, VB  Translators Word to PDF PDF to Postscript

3 Source Code  Optimized for human readability  Matches human notions of grammar  Uses named constructs such as variables and procedures

4 How to Translate  Translation is a complex process  source language and generated code are very different  Need to structure the translation

5 Two-pass Compiler Front End Back End source code IR machine code errors Use an intermediate representation (IR) Front end maps legal source code into IR Back end maps IR into target machine code

6 The Front-End Modules  Scanner (also called Lexical analyzer)  Parser scannerparser source code tokensIR errors

7 Scanner  Maps character stream into words – basic unit of syntax  Produces pairs – a word and its part of speech scannerparser source code tokens IR errors

8 Scanner  Example x = x + y becomes token type word we call the pair “ ” a “token” typical tokens: number, identifier, +, -, new, while, if

9 Parser scannerparser source code tokensIR errors Recognizes context-free syntax and reports errors Guides context-sensitive (“semantic”) analysis Builds IR for source program

What is Context Free Syntax  To understand this we should have base of context free grammar  It is a set of write and rules such as 10

11 Context-Free Grammars  Context-free syntax is specified with a grammar G=(S,N,T,P)  S is the start symbol  N is a set of non-terminal symbols  T is set of terminal symbols or words  P is a set of productions or rewrite rules

12 Context-Free Grammars Grammar for expressions 1.goal→expr 2.expr → expr op term 3. |term 4.term→ number 5. | id 6.op → + 7. | -

13 The Front End  For this CFG S = goal T = { number, id, +, -} N = { goal, expr, term, op} P = { 1, 2, 3, 4, 5, 6, 7}

14 Context-Free Grammars  Given a CFG, we can derive sentences by repeated substitution  Consider the sentence (expression) x + 2 – y

15 Derivation Production Result goal 1 expr 2 expr op term 5 expr op y 7 expr – y 2 expr op term – y 4 expr op 2 – y 6 expr + 2 – y 3 term + 2 – y 5 x + 2 – y

16 The Front End  To recognize a valid sentence in some CFG, we reverse this process and build up a parse  A parse can be represented by a tree: parse tree or syntax tree

17 Parse Production Result goal 1 expr 2 expr op term 5 expr op y 7 expr – y 2 expr op term – y 4 expr op 2 – y 6 expr + 2 – y 3 term + 2 – y 5 x + 2 – y

18 Syntax Tree x+2-y goal expr term op expr termopexpr term – +

19 Abstract Syntax Trees  The parse tree contains a lot of unneeded information.  Compilers often use an abstract syntax tree (AST).

20 Abstract Syntax Trees  This is much more concise  AST summarizes grammatical structure without the details of derivation  ASTs are one kind of intermediate representation (IR) – +

21 Three-pass Compiler  Intermediate stage for code improvement or optimization  Analyzes IR and rewrites (or transforms) IR  Primary goal is to reduce running time of the compiled code  May also improve space usage, power consumption,...  Must preserve “meaning” of the code. Front End Source code machine code errors Middle End Back End IR

Lexical Analysis Scanner scannerparser source code tokens IR errors

23 Lexical Analysis  The task of the scanner is to take a program written in some programming language as a stream of characters and break it into a stream of tokens.  This activity is called lexical analysis.  The lexical analyzer partition input string into substrings, called words, and classifies them according to their role  Output of lexical analysis is a stream of tokens

24 Tokens Example: if( i == j ) z = 0; else z = 1;  Input is just a sequence of characters : if( \b i == j \n\t....

25 Tokens Goal:  partition input string into substrings  classify them according to their role  A token is a syntactic category  Natural language: “He wrote the program”  Words: “He”, “wrote”, “the”, “program” Programming language: “if(b == 0) a = b”  Words: “if”, “(”, “b”, “==”, “0”, “)”, “a”, “=”, “b”

26 Tokens  Identifiers: x y11 maxsize  Keywords: if else while for  Integers: 2 1000 -44 5L  Floats: 2.0 0.0034 1e5  Symbols: ( ) + * / { } ==  Strings: “enter x” “error”

27 How to Describe Tokens?  Regular Languages are the most popular for specifying tokens Simple and useful theory Easy to understand Efficient implementations

28 Example of Languages Alphabet = English characters Language = English sentences Alphabet = ASCII Language = C++ programs, Java, C#

29 RecapRecap Tokens: strings of characters representing lexical units of programs such as identifiers, numbers, operators. Regular Expressions: concise description of tokens. A regular expression describes a set of strings. Language L(R): set of strings represented by a regular expression R. L(R) is the language denoted by regular expression R.

30 Regular Expression R|S= either R or S RS= R followed by S (concatenation) R*= concatenation of R zero or more times (R*=  |R|RR|RRR...) R?=  | R (zero or one R) R + = RR* (one or more R) [abc]= a|b|c (any of listed) [a-z]= a|b|....|z (range) [^ab]= c|d|... (anything but ‘a’‘b’)

31 How to Use REs  We need mechanism to determine if an input string w belongs to L(R), the language denoted by regular expression R.

32 AcceptorAcceptor  Such a mechanism is called an acceptor. input string language w L acceptor yes, if w  L no, if w  L

33 Finite Automata (FA)  Specification: Regular Expressions  Implementation: Finite Automata A finite automaton accepts a string if we can follow transitions labelled with characters in the string from start state to some accepting state

SYNTACTIC VS SEMANTIC ANALYSIS

Syntactic Analysis  Natural language analogy: consider the sentence Hewrote the program Hewrotetheprogram nounverbarticlenoun subjectpredicateobject sentence

Syntactic Analysis  Programming language if( b <= 0 )a = b bool expr assignment if-statement

Syntactic Analysis int* foo(int i, int j)) { for(k=0; i j; ) fi( i > j ) return j; } extra parenthesis Missing expression not a keyword

Semantic Analysis  Grammatically correct Hewrotethecomputer nounverbarticlenoun subjectpredicateobject sentence

Semantic Analysis int* foo(int i, int j) { for(k=0; i < j; j++ ) if( i < j-2 ) sum = sum+i return sum; } undeclared var return type mismatch

Role of the Parser  Not all sequences of tokens are program.  Parser must distinguish between valid and invalid sequences of tokens. What we need An expressive way to describe the syntax An acceptor mechanism that determines if input token stream satisfies the syntax Parsing is the process of discovering a derivation for some sentence Mathematical model of syntax – a grammar G. Algortihm for testing membership in L(G).

Backus-Naur Form (BNF)  Context-free grammars are (often) given by BNF expressions (Backus-Naur Form)  Grammar rules in a similar form were first used in the description of the Algol60 Language.  The notation was developed by John Backus and adapted by Peter Naur for the Algol60 report.  Thus the term Backus-Naur Form (BNF).  The meta-symbols of BNF are: definition or description  ::= meaning "is defined as"  | meaning "or"  angle brackets used to surround category names. optional items are enclosed in meta symbols [ and ]

Meta-symbols of BNF  optional items are enclosed in meta symbols [ and ]   example: ::= if then [ else ] end if ;  repetitive items (zero or more times) are enclosed in meta symbols { and }, example: ::= { | }  terminals of only one character are surrounded by quotes (") to distinguish them from meta-symbols, example: ::= { ";" }  In recent text books, terminal and non-terminal symbols are distingue by using bold faces for terminals and suppressing around non-terminals. This improves greatly the readability.  The example then becomes:  if_statement ::= if boolean_expression then  statement_sequence  [ else  statement_sequence ]  end if ";"

More Useful Grammar 1expr → expr op expr 2 | num 3 | id 4op → + 5 |– 6 | * 7 | /

Derivation: x – 2 * y RuleSentential Form - expr 1expr op expr 2 op expr 5 – expr 1 – expr op expr 2 – op expr 6 –  expr 3 – 

Derivation  Such a process of rewrites is called a derivation.  Process or discovering a derivations is called parsing  At each step, we choose a non-terminal to replace  Different choices can lead to different derivations.  Two derivations are of interest 1.Leftmost derivation 2.Rightmost derivation

Derivations  Leftmost derivation: replace leftmost non- terminal (NT) at each step  Rightmost derivation: replace rightmost NT at each step  The example on the preceding slides was leftmost derivation  There is also a rightmost derivation

Rightmost Derivation RuleSentential Form - expr 1expr op expr 3expr op 6 expr  1 expr op expr  2 expr op  5 expr –  3 – 

Derivations  The two derivations produce different parse trees.  The parse trees imply different evaluation orders!

Parse Trees G E E op E E E x – 2 * y Leftmost derivation evaluation order x – ( 2 * y )

Parse Trees G E op evaluation order (x – 2 ) * y E x – E E op E 2 * y Rightmost derivation

Precedence  These two derivations point out a problem with the grammar  It has no notion of precedence, or implied order of evaluation To add precedence  Create a non-terminal for each level of precedence  Isolate corresponding part of grammar  Force parser to recognize high precedence subexpressions first.

Precedence For algebraic expressions  Multiplication and division, first. (level one)  Subtraction and addition, next (level two)

Precedence This grammar is larger  Takes more rewriting to reach some of the terminal symbols  But it encodes expected precedence  Produces same parse tree under leftmost and rightmost derivations  Let’s see how it parses x – 2 * y

Precedence x – 2 * y 1Goal → expr 2 → expr + term 3 | expr – term 4 | term 5 → term  factor 6 | term / factor 7 | factor 8 → number 9 | Id RuleSentential Form - Goal 1expr 3expr – term 5 expr – term  factor 9 expr – term  7 expr – factor  8 expr –  4 term –  7 factor –  9 –  The rightmost derivation

Parse Trees G E F T TF – * T E T evaluation order x – ( 2 * y )

Precedence  Both leftmost and rightmost derivations give the same expression  Because the grammar directly encodes the desired precedence.

Parsing Techniques

Top-down parsers  Start at the root of the parse tree and grow towards leaves.  Pick a production and try to match the input  Bad “pick”  may need to backtrack  Some grammars are backtrack-free.

Top-down parsers Also called LL parsing L means that tokens are read left to right L means that the parser constructs a leftmost derivation.

Parsing Techniques Bottom-up parsers  Start at the leaves and grow toward root  As input is consumed, encode possibilities in an internal state.  Start in a state valid for legal first tokens  Bottom-up parsers handle a large class of grammars  Preferred method in practice

Bottom-up Parsing Also called LR parsing  L means that tokens are read left to right  R means that the parser constructs a rightmost derivation.

Top-Down Parser  A top-down parser starts with the root of the parse tree.  The root node is labeled with the goal symbol of the grammar

Top-Down Parsing Algorithm  Construct the root node of the parse tree  Repeat until the fringe [ leaves] of the parse tree matches input string  At a node labeled A, select a production with A on its lhs  for each symbol on its rhs, construct the appropriate child  When a terminal symbol is added to the fringe and it does not match the fringe, backtrack  Find the next node to be expanded

Top-Down Parsing  The key is picking right production in step 1.  That choice should be guided by the input string

PSentential Forminput - Goal x – 2 * yx – 2 * y 1expr x – 2 * yx – 2 * y 2expr + term x – 2 * yx – 2 * y 4term + term x – 2 * yx – 2 * y 7factor + term x – 2 * yx – 2 * y 9 + term x – 2 * yx – 2 * y 9 x – 2 * yx – 2 * y

This worked well except that “–” does not match “ + ” PSentential Forminput - Goal x – 2 * yx – 2 * y 1expr x – 2 * yx – 2 * y 2expr + term x – 2 * yx – 2 * y 4term + term x – 2 * yx – 2 * y 7factor + term x – 2 * yx – 2 * y 9 + term x – 2 * yx – 2 * y 9 x – 2 * yx – 2 * y

The parser must backtrack to here PSentential Forminput - Goal x – 2 * yx – 2 * y 1expr x – 2 * yx – 2 * y 2expr + term x – 2 * yx – 2 * y 4term + term x – 2 * yx – 2 * y 7factor + term x – 2 * yx – 2 * y 9 + term x – 2 * yx – 2 * y 9 x – 2 * yx – 2 * y

This time the “ – ” and “ – ” matched PSentential Forminput - Goal x – 2 * yx – 2 * y 1expr x – 2 * yx – 2 * y 2expr – term x – 2 * yx – 2 * y 4term – term x – 2 * yx – 2 * y 7factor – term x – 2 * yx – 2 * y 9 – term x – 2 * yx – 2 * y 9 x – 2 * yx – 2 * y

We can advance past “–” to look at “ 2 ” PSentential Forminput - Goal x – 2 * yx – 2 * y 1expr x – 2 * yx – 2 * y 2expr – term x – 2 * yx – 2 * y 4term – term x – 2 * yx – 2 * y 7factor – term x – 2 * yx – 2 * y 9 – term x – 2 * yx – 2 * y 9 x – 2 * yx – 2 * y - x –  2 * y

Now, we need to expand “ term ” PSentential Forminput - Goal x – 2 * yx – 2 * y 1expr x – 2 * yx – 2 * y 2expr – term x – 2 * yx – 2 * y 4term – term x – 2 * yx – 2 * y 7factor – term x – 2 * yx – 2 * y 9 – term x – 2 * yx – 2 * y 9 x – 2 * yx – 2 * y - x –  2 * y

PSentential Forminput - – term x –  2 * y 7 – factor x –  2 * y 9 – x –  2 * y - – x – 2  * y “ 2 ” matches “ 2 ” We have more input but no non-terminals left to expand

 The expansion terminated too soon   Need to backtrack PSentential Forminput - – term x –  2 * y 7 – factor x –  2 * y 9 – x –  2 * y - – x – 2  * y

PSentential Forminput - – term x –  2 * y 5 – term * factor x –  2 * y 7 – factor * factor x –  2 * y 8 – * factor x –  2 * y - – * factor x – 2  * y - – * factor x – 2 *  y 9 – * x – 2 *  y - – * x – 2 * y  Success! We matched and consumed all the input

Another Possible Parse PSentential Forminput - Goal x – 2 * yx – 2 * y 1expr x – 2 * yx – 2 * y 2expr +term x – 2 * yx – 2 * y 2expr +term +term x – 2 * yx – 2 * y 2expr +term +term +term x – 2 * yx – 2 * y 2expr +term +term +term +.... x – 2 * yx – 2 * y consuming no input!! Wrong choice of expansion leads to non-termination Parser must make the right choice

Left Recursion Top-down parsers cannot handle left-recursive grammars

Left Recursion  Our expression grammar is left recursive.  This can lead to non-termination in a top-down parser  Non-termination is bad in any part of a compiler  For a top-down parser, any recursion must be a right recursion  We would like to convert left recursion to right recursion  To remove left recursion, we transform the grammar

Eliminating Left Recursion Consider a grammar fragment: A → A  |  where neither  nor  starts with A.

Eliminating Left Recursion We can rewrite this as: A →  A' A' →  A' |  where A' is a new non-terminal

Eliminating Left Recursion A →  A ' A' →  A' |   This accepts the same language but uses only right recursion

Eliminating Left Recursion The expression grammar we have been using contains two cases of left- recursion

Eliminating Left Recursion expr → expr + term | expr – term | term → term * factor | term ∕ factor | factor

Eliminating Left Recursion Applying the transformation yields expr → term expr' expr' → + term expr' | – term expr' | 

Eliminating Left Recursion Applying the transformation yields term → factor term' term' → * factor term' | ∕ factor term' | 

Eliminating Left Recursion  These fragments use only right recursion  A top-down parser will terminate using them.

Predictive Parsing  If a top down parser picks the wrong production, it may need to backtrack  Alternative is to look ahead in input and use context to pick correctly  How much lookahead is needed?  In general, an arbitrarily large amount  Fortunately, large classes of CFGs can be parsed with limited lookahead  Most programming languages constructs fall in those subclasses

LL[1]....LL[K] PARSING  scan input from Left to right  do a Leftmost derivation  use 1.. k symbols of lookahead  is a top-down parsing technique

 FURTHER IN ADVANCE COURSE …….  COMPILER CONSTRUCTION 7 TH SEMESTER

Module 2 Compiler and their Working Software Construction Lecture 10,11 and 12.

Similar presentations

Presentation on theme: "Module 2 Compiler and their Working Software Construction Lecture 10,11 and 12."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Module 2 Compiler and their Working Software Construction Lecture 10,11 and 12.

Similar presentations

Presentation on theme: "Module 2 Compiler and their Working Software Construction Lecture 10,11 and 12."— Presentation transcript:

Similar presentations

About project

Feedback