Structure of programming languages

Structure of programming languages
Syntax and Semantics

Lexical Analysis

Lexical Analysis A scanner groups input characters into tokens
input: x = x * (acc+123) token value identifier x equal = star * left-paren ( identifier acc plus + integer 123 right-paren ) Tokens are typically represented by numbers

Example Token Informal description Sample lexemes
if Characters i, f if else Characters e, l, s, e else comparison <=, != < or > or <= or >= or == or != id Letter followed by letter and digits pi, score, D2 number Any numeric constant , 0, 6.02e23 literal Anything but “ sorrounded by “ “core dumped” printf(“total = %d\n”, score);

Lexical Analysis The process of converting a character stream into a corresponding sequence of meaningful symbols (called tokens or lexemes) is called tokenizing, lexing or lexical analysis. A program that performs this process is called a tokenizer, lexer, or scanner. In Scheme, we tokenize (set! x (+ x 1)) as ( set! x ( x ) )‏ Similarly, in Java, we tokenize System.out.println("Hello World!"); as System . out . println ( "Hello World!" ) ;

Lexical Analysis Lexical analyzer splits it into tokens
Token = sequence of characters (symbolic name) representing a single terminal symbol Identifiers: myVariable … Literals: true … Keywords: char sizeof … Operators: * / … Punctuation: ; , } { … Discards whitespace and comments

Tasks of a Scanner A typical scanner:
recognizes the keywords of the language these are the reserved words that have a special meaning in the language, such as the word class in Java recognizes special characters, such as ( and ), or groups of special characters, such as := and == recognizes identifiers, integers, reals, decimals, strings, etc ignores whitespaces (tabs, blanks, etc) and comments recognizes and processes special directives (such as the #include "file" directive in C) and macros

Examples of Tokens in C Tokens Lexemes identifier
Age, grade,Temp, zone, q1 number 3.1416, , string “A cat sat on a mat.”, “ ” open parentheses ( close parentheses ) Semicolon ; reserved word if IF, if, If, iF

Examples of Tokens in C Lexical analyzer usually represents each token by a unique integer code “+” { return(PLUS); } // PLUS = 401 “-” { return(MINUS); } // MINUS = 402 “*” { return(MULT); } // MULT = 403 “/” { return(DIV); } // DIV = 404 Some tokens require regular expressions [a-zA-Z_][a-zA-Z0-9_]* { return (ID); } // identifier [1-9][0-9]* { return(DECIMALINT); } 0[0-7]* { return(OCTALINT); } (0x|0X)[0-9a-fA-F] { return(HEXINT); }

Reserved Keywords in C auto, break, case, char, const, continue, default, do, double, else, enum, extern, float, for, goto, if, int, long, register, return, short, signed, sizeof, static, struct, switch, typedef, union, unsigned, void, volatile, wchar_t, while C++ added a bunch: bool, catch, class, dynamic_cast, inline, private, protected, public, static_cast, template, this, virtual and others Each keyword is mapped to its own token

Whitespace Whitespace is any space, tab, end-of-line character (or characters), or character sequence inside a comment No token may contain embedded whitespace (unless it is a character or string literal) Example: >= one token > = two tokens

Redefining Identifiers can be dangerous
program confusing; const true = false; begin if (a<b) = true then f(a) else …

LEXICAL ANALYSIS Lexical Errors Deleting an extraneous character
Inserting a missing character Replacing an incorrect character by a correct character Transposing two adjacent characters(such as , fi=>if) Pre-scanning

Finite Automata for the Lexical Tokens
1 2 a- z 0 - 9 1 2 i f 3 1 2 0 - 9 IF ID NUM 0 - 9 1 2 3 4 5 1 2 4 3 5 a- z \n - blank, etc. . 2 1 any but \n . REAL White space (and comment starting with ‘- -’) error (Appel, pp. 21)

Scanning Pictorial representation of a Pascal scanner as a finite automaton

Parsing Process Call the scanner to get tokens
Build a parse tree from the stream of tokens A parse tree shows the syntactic structure of the source program. Add information about identifiers in the symbol table Report error, when found, and recover from the error

Parsing Parsing is a process that constructs a syntactic structure (i.e. parse tree) from the stream of tokens. We already learn how to describe the syntactic structure of a language using (context-free) grammar. So, a parser only need to do this? Stream of tokens Parser Parse tree Context-free grammar

NUM(4) TIMES LPAR NUM(2) PLUS NUM(3) RPAR
Example Decaf 4*(2+3) Parser input NUM(4) TIMES LPAR NUM(2) PLUS NUM(3) RPAR Parser output (AST): * NUM(4) + NUM(2) NUM(3)

Parse tree for the example
EXPR EXPR EXPR NUM(4) TIMES LPAR NUM(2) PLUS NUM(3) RPAR leaves are tokens

IF LPAR ID EQ ID RPAR LBR ID AS INT SEMI RBR
Another example Decaf if (x == y) { a=1; } Parser input IF LPAR ID EQ ID RPAR LBR ID AS INT SEMI RBR Parser output (AST): IF-THEN == ID = INT

Parse tree for the example
STMT BLOCK STMT EXPR EXPR IF LPAR ID == ID RPAR LBR ID = INT SEMI RBR leaves are tokens

Context-Free Grammars
Expression grammar with precedence and associativity

Context-Free Grammars
Parse tree for expression grammar (with precedence) for * 5

Top–Down Parsing Bottom–Up Parsing
A parse tree is created from root to leaves Tracing leftmost derivation Two types: Backtracking parser Predictive parser A parse tree is created from leaves to root Tracing rightmost derivation More powerful than top-down parsing

Top-down Parsing What does a parser need to decide? How to guess?
Which production rule is to be used at each point of time ? How to guess? What is the guess based on? What is the next token? Reserved word if, open parentheses, etc. What is the structure to be built? If statement, expression, etc.

Top-down Parsing Why is it difficult? Cannot decide until later
Next token: if Structure to be built: St St  MatchedSt | UnmatchedSt UnmatchedSt  if (E) St| if (E) MatchedSt else UnmatchedSt MatchedSt  if (E) MatchedSt else MatchedSt |... Production with empty string Next token: id Structure to be built: par par  parList |  parList  exp , parList | exp

LL Parsing Example (average program) read A read B sum := A + B
write sum write sum / 2 We start at the top and predict needed productions on the basis of the current left-most non-terminal in the tree and the current input token

LL Parsing Parse tree for the average program

LL Parsing Table-driven LL parsing: you have a big loop in which you repeatedly look up an action in a two-dimensional table based on current leftmost non-terminal and current input token. The actions are (1) match a terminal (2) predict a production (3) announce a syntax error

Concept of LL(1) Parsing
Simulate leftmost derivation of the input. Keep part of sentential form in the stack. If the symbol on the top of stack is a terminal, try to match it with the next input token and pop it out of stack. If the symbol on the top of stack is a nonterminal X, replace it with Y if we have a production rule X  Y. Which production will be chosen, if there are both X  Y and X  Z ?

Example of LL(1) Parsing
E TX FNX (E)NX (TX)NX (FNX)NX (nNX)NX (nX)NX (nATX)NX (n+TX)NX (n+FNX)NX (n+(E)NX)NX (n+(TX)NX)NX (n+(FNX)NX)NX (n+(nNX)NX)NX (n+(nX)NX)NX (n+(n)NX)NX (n+(n)X)NX (n+(n))NX (n+(n))MFNX (n+(n))*FNX (n+(n))*nNX (n+(n))*nX (n+(n))*n ( T N ( n + ) * $ E X F + F A n ) E  T X X  A T X |  A  + | - T  F N N  M F N |  M  * F  ( E ) | n N T T ( N X M E * X Finished F ) F n N N T X E $

LL(1) Parsing Algorithm
Push the start symbol into the stack WHILE stack is not empty ($ is not on top of stack) and the stream of tokens is not empty (the next input token is not $) SWITCH (Top of stack, next token) CASE (terminal a, a): Pop stack; Get next token CASE (nonterminal A, terminal a): IF the parsing table entry M[A, a] is not empty THEN Get A X1 X2 ... Xn from the parsing table entry M[A, a] Pop stack; Push Xn ... X2 X1 into stack in that order ELSE Error CASE ($,$): Accept OTHER: Error

LL Parsing Better yet, languages (since Pascal) generally employ explicit end-markers, which eliminate this problem In Modula-2, for example, one says: if A = B then if C = D then E := F end else G := H end Ada says 'end if'; other languages say 'fi'

LL Parsing One problem with end markers is that they tend to bunch up. In Pascal you say if A = B then … else if A = C then … else if A = D then … else if A = E then … else ...; With end markers this becomes if A = B then … else if A = C then … else if A = D then … else if A = E then … else ...; end; end; end; end;

Consider the context-free grammar
S  a X X  b X | b Y Y  c Detail how LL parser will parse the sentence abbc

Example <string exp> ::= <partial string> | <string exp> & <partial string> <partial string> ::= <partial string> - <nested string> | <nested string> <nested string> ::= <basic string> | < nested string> * <basic string> <basic string> ::= <basic string> <letter> |  <letter> ::= A|B|C|D|E|F|G|H|I|J|Y|Z & * -

Consider the context-free grammar
S  a X X  b X | b Y Y  c Detail how an LL parser will parse the sentence abbc

Example <string exp> ::= <partial string> | <string exp> & <partial string> <partial string> ::= <partial string> - <nested string> | <nested string> <nested string> ::= <basic string> | < nested string> * <basic string> <basic string> ::= <basic string> <letter> |  <letter> ::= A|B|C|D|E|F|G|H|I|J|Y|Z & * -

Use one LL method to draw the parse tree for expression
Example <string exp> ::= <partial string> | <string exp> & <partial string> <partial string> ::= <partial string> - <nested string> | <nested string> <nested string> ::= <basic string> | < nested string> * <basic string> <basic string> ::= <basic string> <letter> |  <letter> ::= A|B|C|D|E|F|G|H|I|J|Y|Z Use one LL method to draw the parse tree for expression F * A & G - Y

Postfix, Prefix, Mixfix in Java and C
Increment and decrement: x++, --y x = ++x + x++ legal syntax, undefined semantics! Ternary conditional (conditional-expr) ? (then-expr) : (else-expr); Example: int min(int a, int b) { return (a<b) ? a : b; } This is an expression, NOT an if-then-else command What is the type of this expression?

Structure of programming languages

Similar presentations

Presentation on theme: "Structure of programming languages"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Structure of programming languages

Similar presentations

Presentation on theme: "Structure of programming languages"— Presentation transcript:

Similar presentations

About project

Feedback