Download presentation
Presentation is loading. Please wait.
1
Structure of programming languages
Syntax and Semantics
2
Lexical Analysis
3
Lexical Analysis A scanner groups input characters into tokens
input: x = x * (acc+123) token value identifier x equal = star * left-paren ( identifier acc plus + integer 123 right-paren ) Tokens are typically represented by numbers
4
Example Token Informal description Sample lexemes
if Characters i, f if else Characters e, l, s, e else comparison <=, != < or > or <= or >= or == or != id Letter followed by letter and digits pi, score, D2 number Any numeric constant , 0, 6.02e23 literal Anything but “ sorrounded by “ “core dumped” printf(“total = %d\n”, score);
5
Lexical Analysis The process of converting a character stream into a corresponding sequence of meaningful symbols (called tokens or lexemes) is called tokenizing, lexing or lexical analysis. A program that performs this process is called a tokenizer, lexer, or scanner. In Scheme, we tokenize (set! x (+ x 1)) as ( set! x ( x ) ) Similarly, in Java, we tokenize System.out.println("Hello World!"); as System . out . println ( "Hello World!" ) ;
6
Lexical Analysis Lexical analyzer splits it into tokens
Token = sequence of characters (symbolic name) representing a single terminal symbol Identifiers: myVariable … Literals: true … Keywords: char sizeof … Operators: * / … Punctuation: ; , } { … Discards whitespace and comments
7
Tasks of a Scanner A typical scanner:
recognizes the keywords of the language these are the reserved words that have a special meaning in the language, such as the word class in Java recognizes special characters, such as ( and ), or groups of special characters, such as := and == recognizes identifiers, integers, reals, decimals, strings, etc ignores whitespaces (tabs, blanks, etc) and comments recognizes and processes special directives (such as the #include "file" directive in C) and macros
8
Examples of Tokens in C Tokens Lexemes identifier
Age, grade,Temp, zone, q1 number 3.1416, , string “A cat sat on a mat.”, “ ” open parentheses ( close parentheses ) Semicolon ; reserved word if IF, if, If, iF
9
Examples of Tokens in C Lexical analyzer usually represents each token by a unique integer code “+” { return(PLUS); } // PLUS = 401 “-” { return(MINUS); } // MINUS = 402 “*” { return(MULT); } // MULT = 403 “/” { return(DIV); } // DIV = 404 Some tokens require regular expressions [a-zA-Z_][a-zA-Z0-9_]* { return (ID); } // identifier [1-9][0-9]* { return(DECIMALINT); } 0[0-7]* { return(OCTALINT); } (0x|0X)[0-9a-fA-F] { return(HEXINT); }
10
Reserved Keywords in C auto, break, case, char, const, continue, default, do, double, else, enum, extern, float, for, goto, if, int, long, register, return, short, signed, sizeof, static, struct, switch, typedef, union, unsigned, void, volatile, wchar_t, while C++ added a bunch: bool, catch, class, dynamic_cast, inline, private, protected, public, static_cast, template, this, virtual and others Each keyword is mapped to its own token
11
Whitespace Whitespace is any space, tab, end-of-line character (or characters), or character sequence inside a comment No token may contain embedded whitespace (unless it is a character or string literal) Example: >= one token > = two tokens
12
Redefining Identifiers can be dangerous
program confusing; const true = false; begin if (a<b) = true then f(a) else …
13
LEXICAL ANALYSIS Lexical Errors Deleting an extraneous character
Inserting a missing character Replacing an incorrect character by a correct character Transposing two adjacent characters(such as , fi=>if) Pre-scanning
14
Finite Automata for the Lexical Tokens
1 2 a- z 0 - 9 1 2 i f 3 1 2 0 - 9 IF ID NUM 0 - 9 1 2 3 4 5 1 2 4 3 5 a- z \n - blank, etc. . 2 1 any but \n . REAL White space (and comment starting with ‘- -’) error (Appel, pp. 21)
15
Scanning Pictorial representation of a Pascal scanner as a finite automaton
16
Parsing Process Call the scanner to get tokens
Build a parse tree from the stream of tokens A parse tree shows the syntactic structure of the source program. Add information about identifiers in the symbol table Report error, when found, and recover from the error
17
Parsing Parsing is a process that constructs a syntactic structure (i.e. parse tree) from the stream of tokens. We already learn how to describe the syntactic structure of a language using (context-free) grammar. So, a parser only need to do this? Stream of tokens Parser Parse tree Context-free grammar
18
NUM(4) TIMES LPAR NUM(2) PLUS NUM(3) RPAR
Example Decaf 4*(2+3) Parser input NUM(4) TIMES LPAR NUM(2) PLUS NUM(3) RPAR Parser output (AST): * NUM(4) + NUM(2) NUM(3)
19
Parse tree for the example
EXPR EXPR EXPR NUM(4) TIMES LPAR NUM(2) PLUS NUM(3) RPAR leaves are tokens
20
IF LPAR ID EQ ID RPAR LBR ID AS INT SEMI RBR
Another example Decaf if (x == y) { a=1; } Parser input IF LPAR ID EQ ID RPAR LBR ID AS INT SEMI RBR Parser output (AST): IF-THEN == ID = INT
21
Parse tree for the example
STMT BLOCK STMT EXPR EXPR IF LPAR ID == ID RPAR LBR ID = INT SEMI RBR leaves are tokens
22
Context-Free Grammars
Expression grammar with precedence and associativity
23
Context-Free Grammars
Parse tree for expression grammar (with precedence) for * 5
24
Top–Down Parsing Bottom–Up Parsing
A parse tree is created from root to leaves Tracing leftmost derivation Two types: Backtracking parser Predictive parser A parse tree is created from leaves to root Tracing rightmost derivation More powerful than top-down parsing
25
Top-down Parsing What does a parser need to decide? How to guess?
Which production rule is to be used at each point of time ? How to guess? What is the guess based on? What is the next token? Reserved word if, open parentheses, etc. What is the structure to be built? If statement, expression, etc.
26
Top-down Parsing Why is it difficult? Cannot decide until later
Next token: if Structure to be built: St St MatchedSt | UnmatchedSt UnmatchedSt if (E) St| if (E) MatchedSt else UnmatchedSt MatchedSt if (E) MatchedSt else MatchedSt |... Production with empty string Next token: id Structure to be built: par par parList | parList exp , parList | exp
27
LL Parsing Example (average program) read A read B sum := A + B
write sum write sum / 2 We start at the top and predict needed productions on the basis of the current left-most non-terminal in the tree and the current input token
28
LL Parsing Parse tree for the average program
29
LL Parsing Table-driven LL parsing: you have a big loop in which you repeatedly look up an action in a two-dimensional table based on current leftmost non-terminal and current input token. The actions are (1) match a terminal (2) predict a production (3) announce a syntax error
30
Concept of LL(1) Parsing
Simulate leftmost derivation of the input. Keep part of sentential form in the stack. If the symbol on the top of stack is a terminal, try to match it with the next input token and pop it out of stack. If the symbol on the top of stack is a nonterminal X, replace it with Y if we have a production rule X Y. Which production will be chosen, if there are both X Y and X Z ?
31
Example of LL(1) Parsing
E TX FNX (E)NX (TX)NX (FNX)NX (nNX)NX (nX)NX (nATX)NX (n+TX)NX (n+FNX)NX (n+(E)NX)NX (n+(TX)NX)NX (n+(FNX)NX)NX (n+(nNX)NX)NX (n+(nX)NX)NX (n+(n)NX)NX (n+(n)X)NX (n+(n))NX (n+(n))MFNX (n+(n))*FNX (n+(n))*nNX (n+(n))*nX (n+(n))*n ( T N ( n + ) * $ E X F + F A n ) E T X X A T X | A + | - T F N N M F N | M * F ( E ) | n N T T ( N X M E * X Finished F ) F n N N T X E $
32
LL(1) Parsing Algorithm
Push the start symbol into the stack WHILE stack is not empty ($ is not on top of stack) and the stream of tokens is not empty (the next input token is not $) SWITCH (Top of stack, next token) CASE (terminal a, a): Pop stack; Get next token CASE (nonterminal A, terminal a): IF the parsing table entry M[A, a] is not empty THEN Get A X1 X2 ... Xn from the parsing table entry M[A, a] Pop stack; Push Xn ... X2 X1 into stack in that order ELSE Error CASE ($,$): Accept OTHER: Error
33
LL Parsing Better yet, languages (since Pascal) generally employ explicit end-markers, which eliminate this problem In Modula-2, for example, one says: if A = B then if C = D then E := F end else G := H end Ada says 'end if'; other languages say 'fi'
34
LL Parsing One problem with end markers is that they tend to bunch up. In Pascal you say if A = B then … else if A = C then … else if A = D then … else if A = E then … else ...; With end markers this becomes if A = B then … else if A = C then … else if A = D then … else if A = E then … else ...; end; end; end; end;
35
Consider the context-free grammar
S a X X b X | b Y Y c Detail how LL parser will parse the sentence abbc
36
Example <string exp> ::= <partial string> | <string exp> & <partial string> <partial string> ::= <partial string> - <nested string> | <nested string> <nested string> ::= <basic string> | < nested string> * <basic string> <basic string> ::= <basic string> <letter> | <letter> ::= A|B|C|D|E|F|G|H|I|J|Y|Z & * -
37
Consider the context-free grammar
S a X X b X | b Y Y c Detail how an LL parser will parse the sentence abbc
38
Example <string exp> ::= <partial string> | <string exp> & <partial string> <partial string> ::= <partial string> - <nested string> | <nested string> <nested string> ::= <basic string> | < nested string> * <basic string> <basic string> ::= <basic string> <letter> | <letter> ::= A|B|C|D|E|F|G|H|I|J|Y|Z & * -
39
Use one LL method to draw the parse tree for expression
Example <string exp> ::= <partial string> | <string exp> & <partial string> <partial string> ::= <partial string> - <nested string> | <nested string> <nested string> ::= <basic string> | < nested string> * <basic string> <basic string> ::= <basic string> <letter> | <letter> ::= A|B|C|D|E|F|G|H|I|J|Y|Z Use one LL method to draw the parse tree for expression F * A & G - Y
40
Postfix, Prefix, Mixfix in Java and C
Increment and decrement: x++, --y x = ++x + x++ legal syntax, undefined semantics! Ternary conditional (conditional-expr) ? (then-expr) : (else-expr); Example: int min(int a, int b) { return (a<b) ? a : b; } This is an expression, NOT an if-then-else command What is the type of this expression?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.