Structure of programming languages

Slides:



Advertisements
Similar presentations
Chapter 3 Syntax Analysis
Advertisements

Lexical and Syntactic Analysis Here, we look at two of the tasks involved in the compilation process –Given source code, we need to first break it into.
Yu-Chen Kuo1 Chapter 2 A Simple One-Pass Compiler.
COMPUTER PROGRAMMING. Data Types “Hello world” program Does it do a useful work? Writing several lines of code. Compiling the program. Executing the program.
Syntax and Semantics Structure of programming languages.
Parsing Chapter 4 Parsing2 Outline Top-down v.s. Bottom-up Top-down parsing Recursive-descent parsing LL(1) parsing LL(1) parsing algorithm First.
Prof. Bodik CS 164 Lecture 51 Building a Parser I CS164 3:30-5:00 TT 10 Evans.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Parsing Jaruloj Chongstitvatana Department of Mathematics and Computer Science Chulalongkorn University.
Syntax and Semantics Structure of programming languages.
Bernd Fischer RW713: Compiler and Software Language Engineering.
Lexical Analyzer in Perspective
Syntax and Semantics Structure of programming languages.
Variables and Data Types.  Variable: Portion of memory for storing a determined value.  Could be numerical, could be character or sequence of characters.
Syntax (2).
Overview of Previous Lesson(s) Over View  In our compiler model, the parser obtains a string of tokens from the lexical analyzer & verifies that the.
1 CSC 1111 Introduction to Computing using C++ C++ Basics (Part 1)
Structure of programming languages
Syntax and Semantics Structure of programming languages.
CS 3304 Comparative Languages
C++ Lesson 1.
Asst.Prof.Dr. Tayfun ÖZGÜR
Variables, Identifiers, Assignments, Input/Output
Lexical Analyzer in Perspective
Parsing #1 Leonidas Fegaras.
Data types Data types Basic types
Lecture 2 Lexical Analysis
A Simple Syntax-Directed Translator
Scanner Scanner Introduction to Compilers.
Chapter 3 Lexical Analysis.
Tutorial On Lex & Yacc.
CS 326 Programming Languages, Concepts and Implementation
Programming Languages Translator
CS510 Compiler Lecture 4.
Chapter 2 :: Programming Language Syntax
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Textbook:Modern Compiler Design
Short introduction to compilers
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
Table-driven parsing Parsing performed by a finite state machine.
Parsing.
PROGRAMMING LANGUAGES
Compiler Construction
4 (c) parsing.
CS 3304 Comparative Languages
Basic Program Analysis: AST
Lexical and Syntax Analysis
Top-Down Parsing CS 671 January 29, 2008.
CSC 4181 Compiler Construction Parsing
Character Set The character set of C represents alphabet, digit or any symbol used to represent information. Types Character Set Uppercase Alphabets A,
Programming Languages 2nd edition Tucker and Noonan
Parsers and control structures
Lecture 7: Introduction to Parsing (Syntax Analysis)
R.Rajkumar Asst.Professor CSE
CS 3304 Comparative Languages
Lecture 4: Lexical Analysis & Chomsky Hierarchy
Scanner Scanner Introduction to Compilers.
Variables, Identifiers, Assignments, Input/Output
Designing a Predictive Parser
CS 3304 Comparative Languages
Scanner Scanner Introduction to Compilers.
Chapter 2 :: Programming Language Syntax
2. Second Step for Learning C++ Programming • Data Type • Char • Float
Programming Language C Language.
Chapter 2 :: Programming Language Syntax
Scanner Scanner Introduction to Compilers.
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Variables in C Topics Naming Variables Declaring Variables
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Presentation transcript:

Structure of programming languages Syntax and Semantics

Lexical Analysis

Lexical Analysis A scanner groups input characters into tokens input: x = x * (acc+123) token value identifier x equal = star * left-paren ( identifier acc plus + integer 123 right-paren ) Tokens are typically represented by numbers

Example Token Informal description Sample lexemes if Characters i, f if else Characters e, l, s, e else comparison <=, != < or > or <= or >= or == or != id Letter followed by letter and digits pi, score, D2 number Any numeric constant 3.14159, 0, 6.02e23 literal Anything but “ sorrounded by “ “core dumped” printf(“total = %d\n”, score);

Lexical Analysis The process of converting a character stream into a corresponding sequence of meaningful symbols (called tokens or lexemes) is called tokenizing, lexing or lexical analysis. A program that performs this process is called a tokenizer, lexer, or scanner. In Scheme, we tokenize (set! x (+ x 1)) as ( set! x ( + x 1 ) )‏ Similarly, in Java, we tokenize System.out.println("Hello World!"); as System . out . println ( "Hello World!" ) ;

Lexical Analysis Lexical analyzer splits it into tokens Token = sequence of characters (symbolic name) representing a single terminal symbol Identifiers: myVariable … Literals: 123 5.67 true … Keywords: char sizeof … Operators: + - * / … Punctuation: ; , } { … Discards whitespace and comments

Tasks of a Scanner A typical scanner: recognizes the keywords of the language these are the reserved words that have a special meaning in the language, such as the word class in Java recognizes special characters, such as ( and ), or groups of special characters, such as := and == recognizes identifiers, integers, reals, decimals, strings, etc ignores whitespaces (tabs, blanks, etc) and comments recognizes and processes special directives (such as the #include "file" directive in C) and macros

Examples of Tokens in C Tokens Lexemes identifier Age, grade,Temp, zone, q1 number 3.1416, -498127,987.76412097 string “A cat sat on a mat.”, “90183654” open parentheses ( close parentheses ) Semicolon ; reserved word if IF, if, If, iF

Examples of Tokens in C Lexical analyzer usually represents each token by a unique integer code “+” { return(PLUS); } // PLUS = 401 “-” { return(MINUS); } // MINUS = 402 “*” { return(MULT); } // MULT = 403 “/” { return(DIV); } // DIV = 404 Some tokens require regular expressions [a-zA-Z_][a-zA-Z0-9_]* { return (ID); } // identifier [1-9][0-9]* { return(DECIMALINT); } 0[0-7]* { return(OCTALINT); } (0x|0X)[0-9a-fA-F]+ { return(HEXINT); }

Reserved Keywords in C auto, break, case, char, const, continue, default, do, double, else, enum, extern, float, for, goto, if, int, long, register, return, short, signed, sizeof, static, struct, switch, typedef, union, unsigned, void, volatile, wchar_t, while C++ added a bunch: bool, catch, class, dynamic_cast, inline, private, protected, public, static_cast, template, this, virtual and others Each keyword is mapped to its own token

Whitespace Whitespace is any space, tab, end-of-line character (or characters), or character sequence inside a comment No token may contain embedded whitespace (unless it is a character or string literal) Example: >= one token > = two tokens

Redefining Identifiers can be dangerous program confusing; const true = false; begin if (a<b) = true then f(a) else …

LEXICAL ANALYSIS Lexical Errors Deleting an extraneous character Inserting a missing character Replacing an incorrect character by a correct character Transposing two adjacent characters(such as , fi=>if) Pre-scanning

Finite Automata for the Lexical Tokens 1 2 a- z 0 - 9 1 2 i f 3 1 2 0 - 9 IF ID NUM 0 - 9 1 2 3 4 5 1 2 4 3 5 a- z \n - blank, etc. . 2 1 any but \n . REAL White space (and comment starting with ‘- -’) error (Appel, pp. 21)

Scanning Pictorial representation of a Pascal scanner as a finite automaton

Parsing Process Call the scanner to get tokens Build a parse tree from the stream of tokens A parse tree shows the syntactic structure of the source program. Add information about identifiers in the symbol table Report error, when found, and recover from the error

Parsing Parsing is a process that constructs a syntactic structure (i.e. parse tree) from the stream of tokens. We already learn how to describe the syntactic structure of a language using (context-free) grammar. So, a parser only need to do this? Stream of tokens Parser Parse tree Context-free grammar

NUM(4) TIMES LPAR NUM(2) PLUS NUM(3) RPAR Example Decaf 4*(2+3) Parser input NUM(4) TIMES LPAR NUM(2) PLUS NUM(3) RPAR Parser output (AST): * NUM(4) + NUM(2) NUM(3)

Parse tree for the example EXPR EXPR EXPR NUM(4) TIMES LPAR NUM(2) PLUS NUM(3) RPAR leaves are tokens

IF LPAR ID EQ ID RPAR LBR ID AS INT SEMI RBR Another example Decaf if (x == y) { a=1; } Parser input IF LPAR ID EQ ID RPAR LBR ID AS INT SEMI RBR Parser output (AST): IF-THEN == ID = INT

Parse tree for the example STMT BLOCK STMT EXPR EXPR IF LPAR ID == ID RPAR LBR ID = INT SEMI RBR leaves are tokens

Context-Free Grammars Expression grammar with precedence and associativity

Context-Free Grammars Parse tree for expression grammar (with precedence) for 3 + 4 * 5

Top–Down Parsing Bottom–Up Parsing A parse tree is created from root to leaves Tracing leftmost derivation Two types: Backtracking parser Predictive parser A parse tree is created from leaves to root Tracing rightmost derivation More powerful than top-down parsing

Top-down Parsing What does a parser need to decide? How to guess? Which production rule is to be used at each point of time ? How to guess? What is the guess based on? What is the next token? Reserved word if, open parentheses, etc. What is the structure to be built? If statement, expression, etc.

Top-down Parsing Why is it difficult? Cannot decide until later Next token: if Structure to be built: St St  MatchedSt | UnmatchedSt UnmatchedSt  if (E) St| if (E) MatchedSt else UnmatchedSt MatchedSt  if (E) MatchedSt else MatchedSt |... Production with empty string Next token: id Structure to be built: par par  parList |  parList  exp , parList | exp

LL Parsing Example (average program) read A read B sum := A + B write sum write sum / 2 We start at the top and predict needed productions on the basis of the current left-most non-terminal in the tree and the current input token

LL Parsing Parse tree for the average program

LL Parsing Table-driven LL parsing: you have a big loop in which you repeatedly look up an action in a two-dimensional table based on current leftmost non-terminal and current input token. The actions are (1) match a terminal (2) predict a production (3) announce a syntax error

Concept of LL(1) Parsing Simulate leftmost derivation of the input. Keep part of sentential form in the stack. If the symbol on the top of stack is a terminal, try to match it with the next input token and pop it out of stack. If the symbol on the top of stack is a nonterminal X, replace it with Y if we have a production rule X  Y. Which production will be chosen, if there are both X  Y and X  Z ?

Example of LL(1) Parsing E TX FNX (E)NX (TX)NX (FNX)NX (nNX)NX (nX)NX (nATX)NX (n+TX)NX (n+FNX)NX (n+(E)NX)NX (n+(TX)NX)NX (n+(FNX)NX)NX (n+(nNX)NX)NX (n+(nX)NX)NX (n+(n)NX)NX (n+(n)X)NX (n+(n))NX (n+(n))MFNX (n+(n))*FNX (n+(n))*nNX (n+(n))*nX (n+(n))*n ( T N ( n + ) * $ E X F + F A n ) E  T X X  A T X |  A  + | - T  F N N  M F N |  M  * F  ( E ) | n N T T ( N X M E * X Finished F ) F n N N T X E $

LL(1) Parsing Algorithm Push the start symbol into the stack WHILE stack is not empty ($ is not on top of stack) and the stream of tokens is not empty (the next input token is not $) SWITCH (Top of stack, next token) CASE (terminal a, a): Pop stack; Get next token CASE (nonterminal A, terminal a): IF the parsing table entry M[A, a] is not empty THEN Get A X1 X2 ... Xn from the parsing table entry M[A, a] Pop stack; Push Xn ... X2 X1 into stack in that order ELSE Error CASE ($,$): Accept OTHER: Error

LL Parsing Better yet, languages (since Pascal) generally employ explicit end-markers, which eliminate this problem In Modula-2, for example, one says: if A = B then if C = D then E := F end else G := H end Ada says 'end if'; other languages say 'fi'

LL Parsing One problem with end markers is that they tend to bunch up. In Pascal you say if A = B then … else if A = C then … else if A = D then … else if A = E then … else ...; With end markers this becomes if A = B then … else if A = C then … else if A = D then … else if A = E then … else ...; end; end; end; end;

Consider the context-free grammar S  a X X  b X | b Y Y  c Detail how LL parser will parse the sentence abbc

Example <string exp> ::= <partial string> | <string exp> & <partial string> <partial string> ::= <partial string> - <nested string> | <nested string> <nested string> ::= <basic string> | < nested string> * <basic string> <basic string> ::= <basic string> <letter> |  <letter> ::= A|B|C|D|E|F|G|H|I|J|Y|Z & * -  

Consider the context-free grammar S  a X X  b X | b Y Y  c Detail how an LL parser will parse the sentence abbc

Example <string exp> ::= <partial string> | <string exp> & <partial string> <partial string> ::= <partial string> - <nested string> | <nested string> <nested string> ::= <basic string> | < nested string> * <basic string> <basic string> ::= <basic string> <letter> |  <letter> ::= A|B|C|D|E|F|G|H|I|J|Y|Z & * -  

Use one LL method to draw the parse tree for expression Example <string exp> ::= <partial string> | <string exp> & <partial string> <partial string> ::= <partial string> - <nested string> | <nested string> <nested string> ::= <basic string> | < nested string> * <basic string> <basic string> ::= <basic string> <letter> |  <letter> ::= A|B|C|D|E|F|G|H|I|J|Y|Z Use one LL method to draw the parse tree for expression F * A & G - Y

Postfix, Prefix, Mixfix in Java and C Increment and decrement: x++, --y x = ++x + x++ legal syntax, undefined semantics! Ternary conditional (conditional-expr) ? (then-expr) : (else-expr); Example: int min(int a, int b) { return (a<b) ? a : b; } This is an expression, NOT an if-then-else command What is the type of this expression?