Download presentation
1
Structure of programming languages
Syntax and Semantics
2
Introduction Syntax: the form or structure of the expressions, statements, and program units Semantics: the meaning of the expressions, statements, and program units
3
Introduction Syntax and semantics provide a language’s definition
Users of a language definition Other language designers Implementers Programmers (the users of the language)
4
Syntax of a Languages Motivation for using a subset of C: Grammar
Language (pages) Reference Pascal 5 Jensen & Wirth C Kernighan & Richie C Stroustrup Java 14 Gosling, et. al. The grammar fits on one page, so it’s a far better tool for studying language design.
5
C Grammar: Statements Program int main ( ) { Declarations Statements } Declarations { Declaration } Declaration Type Identifier [ [ Integer ] ] { , Identifier [ [ Integer ] ] } ; Type int | bool | float | char Statements { Statement } Statement ; | Block | Assignment | IfStatement | WhileStatement Block { Statements } Assignment Identifier [ [ Expression ] ] = Expression ; IfStatement if ( Expression ) Statement [ else Statement ] WhileStatement while ( Expression ) Statement
6
C Grammar: Expressions
Expression Conjunction { || Conjunction } Conjunction Equality { && Equality } Equality Relation [ EquOp Relation ] EquOp == | != Relation Addition [ RelOp Addition ] RelOp < | <= | > | >= Addition Term { AddOp Term } AddOp + | - Term Factor { MulOp Factor } MulOp * | / | % Factor [ UnaryOp ] Primary UnaryOp - | ! Primary Identifier [ [ Expression ] ] | Literal | ( Expression ) | Type ( Expression )
7
Describing Syntax: Terminology
A sentence is a string of characters over some alphabet A language is a set of sentences A lexeme is the lowest level syntactic unit of a language (e.g., *, sum, begin) A token is a category of lexemes (e.g., identifier)
8
Formal Definition of Languages
Recognizers A recognition device reads input strings of the language and decides whether the input strings belong to the language Example: syntax analysis part of a compiler Generators A device that generates sentences of a language One can determine if the syntax of a particular sentence is correct by comparing it to the structure of the generator
9
Lexical Analysis The process of converting a character stream into a corresponding sequence of meaningful symbols (called tokens or lexemes) is called tokenizing, lexing or lexical analysis. A program that performs this process is called a tokenizer, lexer, or scanner. In Scheme, we tokenize (set! x (+ x 1)) as ( set! x ( x ) ) Similarly, in Java, we tokenize System.out.println("Hello World!"); as System . out . println ( "Hello World!" ) ;
10
Lexical Analysis Lexical analyzer splits it into tokens
Token = sequence of characters (symbolic name) representing a single terminal symbol Identifiers: myVariable … Literals: true … Keywords: char sizeof … Operators: * / … Punctuation: ; , } { … Discards whitespace and comments
11
Examples of Tokens in C Tokens Lexemes identifier
Age, grade,Temp, zone, q1 number 3.1416, , string “A cat sat on a mat.”, “ ” open parentheses ( close parentheses ) Semicolon ; reserved word if IF, if, If, iF
12
Examples of Tokens in C Lexical analyzer usually represents each token by a unique integer code “+” { return(PLUS); } // PLUS = 401 “-” { return(MINUS); } // MINUS = 402 “*” { return(MULT); } // MULT = 403 “/” { return(DIV); } // DIV = 404 Some tokens require regular expressions [a-zA-Z_][a-zA-Z0-9_]* { return (ID); } // identifier [1-9][0-9]* { return(DECIMALINT); } 0[0-7]* { return(OCTALINT); } (0x|0X)[0-9a-fA-F] { return(HEXINT); }
13
Reserved Keywords in C auto, break, case, char, const, continue, default, do, double, else, enum, extern, float, for, goto, if, int, long, register, return, short, signed, sizeof, static, struct, switch, typedef, union, unsigned, void, volatile, wchar_t, while C++ added a bunch: bool, catch, class, dynamic_cast, inline, private, protected, public, static_cast, template, this, virtual and others Each keyword is mapped to its own token
14
Whitespace Whitespace is any space, tab, end-of-line character (or characters), or character sequence inside a comment No token may contain embedded whitespace (unless it is a character or string literal) Example: >= one token > = two tokens
15
Redefining Identifiers can be dangerous
program confusing; const true = false; begin if (a<b) = true then f(a) else …
16
Parsing Process Call the scanner to get tokens
Build a parse tree from the stream of tokens A parse tree shows the syntactic structure of the source program. Add information about identifiers in the symbol table Report error, when found, and recover from the error
17
Grammars A meta-language is a language used to define other languages
A grammar is a meta-language used to define the syntax of a language. It consists of: Finite set of terminal symbols Finite set of non-terminal symbols Finite set of production rules Start symbol Language = (possibly infinite) set of all sequences of symbols that can be derived by applying production rules starting from the start symbol Backus-Naur Form (BNF)
18
Describing Syntax Backus-Naur Form and Context-Free Grammars (BNF)
Most widely known method for describing programming language syntax Extended BNF Improves readability and writability of BNF
19
Backus-Naur Form (BNF)
Invented by John Backus to describe Algol 58 BNF is equivalent to context-free grammars BNF is a meta-language used to describe another language In BNF, abstractions are used to represent classes of syntactic structures--they act like syntactic variables (also called non-terminal symbols)
20
Example: Binary Digits
Consider the grammar: binaryDigit 0 binaryDigit 1 or equivalently: binaryDigit 0 | 1 Here, | is a metacharacter that separates alternatives.
21
Example: Decimal Numbers
Grammar for unsigned decimal integers Terminal symbols: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 Non-terminal symbols: Digit, Integer Production rules: Integer Digit | Integer Digit Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Start symbol: Integer Can derive any unsigned integer using this grammar Language = set of all unsigned decimal integers
22
Derivation of 352 as an Integer
Production rules: Integer Digit | Integer Digit Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Integer Integer Digit Integer 2 Integer Digit 2 Integer 5 2 Digit 5 2 3 5 2 Rightmost derivation At each step, the rightmost non-terminal is replaced
23
Parse Tree A labeled tree in which
the interior nodes are labeled by nonterminals leaf nodes are labeled by terminals the children of an interior node represent a replacement of the associated nonterminal in a derivation corresponding to a derivation E + id E F *
24
Parse Tree for 352 as an Integer
25
Abstract Syntax Tree for If Statement
) exp true ( if else elsePart return if true st return
26
Arithmetic Expression Grammar
The following grammar defines the language of arithmetic expressions with 1-digit integers, addition, and subtraction. Expr Expr + Term | Expr – Term | Term Term 0 | ... | 9 | ( Expr )
27
Parse of the String 5-4+3
28
BNF Fundamentals Non-terminals: BNF abstractions
Terminals: lexemes and tokens Grammar: a collection of rules Examples of BNF rules: <ident_list> → identifier | identifier, <ident_list> <if_stmt> → if <logic_expr> then <stmt>
29
Regular Expressions x character x \x escaped character, e.g., \n
{ name } reference to a name M | N M or N M N M followed by N M* 0 or more occurrences of M M+ 1 or more occurrences of M [x1 … xn] One of x1 … xn Example: [aeiou] – vowels, [0-9] - digits
30
An Example Grammar <program> <stmts>
<stmts> <stmt> | <stmt> ; <stmts> <stmt> <var> = <expr> <var> a | b | c | d <expr> <term> + <term> | <term> - <term> <term> <var> | const
31
Derivation A derivation is a repeated application of rules, starting with the start symbol and ending with a sentence (all terminal symbols), e.g., <program> => <stmts> => <stmt> => <var> = <expr> => a = <expr> => a = <term> + <term> => a = <var> + <term> => a = b + <term> => a = b + const
32
Examples S SS | (S)S | () | S SS (S)SS (S)S(S)S (S)S(())S
((())())(()) E E O E | (E) | id O + | - | * | / E E O E (E) O E (E O E) O E * ((E O E) O E) O E ((id O E)) O E) O E ((id + E)) O E) O E ((id + id)) O E) O E * ((id + id)) * id) + id
33
Leftmost Derivation Rightmost Derivation
Each step of the derivation is a replacement of the leftmost nonterminals in a sentential form. E E O E (E) O E (E O E) O E (id O E) O E (id + E) O E (id + id) O E (id + id) * E (id + id) * id Each step of the derivation is a replacement of the rightmost nonterminals in a sentential form. E E O E E O id E * id (E) * id (E O E) * id (E O id) * id (E + id) * id (id + id) * id
34
Parse Tree A hierarchical representation of a derivation
<program> <stmts> <stmt> <var> = <expr> a <term> + <term> <var> const b
35
CFG For Floating Point Numbers
::= stands for production rule; <…> are non-terminals; | represents alternatives for the right-hand side of a production rule Sample parse tree:
36
Ambiguity in Expressions
Which operation is to be done first? solved by precedence An operator with higher precedence is done before one with lower precedence. An operator with higher precedence is placed in a rule (logically) further from the start symbol. solved by associativity If an operator is right-associative (or left-associative), an operand in between 2 operators is associated to the operator to the right (left). Right-associated : W + (X + (Y + Z)) Left-associated : ((W + X) + Y) + Z
37
Associativity and Precedence
A grammar can be used to define associativity and precedence among the operators in an expression. E.g., + and - are left-associative operators in mathematics; * and / have higher precedence than + and - . Consider the grammar G1: Expr -> Expr + Term | Expr – Term | Term Term -> Term * Factor | Term / Factor | Term % Factor | Factor Factor -> Primary ** Factor | Primary Primary -> 0 | ... | 9 | ( Expr )
38
Precedence (cont’d) E E + E | E - E | F F F * F | F / F | X
X ( E ) | id (id1 + id2) * id3 * id4 * F X + E id4 id3 ) ( id1 id2
39
Precedence E E + E E E - E E E * E E E / E E id E F
F F * F F F / F F id + E * id3 id2 id1 E + * F id3 id2 id1
40
Associativity and Precedence
for Grammar Precedence Associativity Operators 3 right ** left * / % \ left Note: These relationships are shown by the structure of the parse tree: highest precedence at the bottom, and left-associativity on the left at each level.
41
Associativity of Operators
Operator associativity can also be indicated by a grammar <expr> -> <expr> + <expr> | const (ambiguous) <expr> -> <expr> + const | const (unambiguous) <expr> <expr> <expr> + const <expr> + const const
42
Associativity Left-associative operators E E + F | E - F | F
/ F X + E id3 ) ( * id1 id4 id2 Left-associative operators E E + F | E - F | F F F * X | F / X | X X ( E ) | id (id1 + id2) * id3 / id4 = (((id1 + id2) * id3) / id4)
43
Ambiguity in Grammars A grammar is ambiguous if and only if it generates a sentential form that has two or more distinct parse trees
44
Example of Dangling Else
With which ‘if’ does the following ‘else’ associate? if (x < 0) if (y < 0) y = y - 1; else y = 0; Answer: either one!
45
An Ambiguous Expression Grammar
<expr> <expr> <op> <expr> | const <op> / | - <expr> <expr> <expr> <op> <expr> <expr> <op> <op> <expr> <expr> <op> <expr> <expr> <op> <expr> const - const / const const - const / const
46
Example: Ambiguous Grammar
E E + E E E - E E E * E E E / E E id + E * id3 id2 id1 * E id2 + id1 id3
47
Solving the dangling else ambiguity
Algol 60, C, C++: associate each else with closest if; use {} or begin…end to override. Algol 68, Modula, Ada: use explicit delimiter to end every conditional (e.g., if…fi) Java: rewrite the grammar to limit what can appear in a conditional: IfThenStatement -> if ( Expression ) Statement IfThenElseStatement -> if ( Expression ) StatementNoShortIf else Statement The category StatementNoShortIf includes all except IfThenStatement.
48
Ambiguity in Expressions
Which operation is to be done first? solved by precedence An operator with higher precedence is done before one with lower precedence. An operator with higher precedence is placed in a rule (logically) further from the start symbol. solved by associativity If an operator is right-associative (or left-associative), an operand in between 2 operators is associated to the operator to the right (left). Right-associated : W + (X + (Y + Z)) Left-associated : ((W + X) + Y) + Z
49
An Unambiguous Expression Grammar
If we use the parse tree to indicate precedence levels of the operators, we cannot have ambiguity <expr> <expr> - <term> | <term> <term> <term> / const| const <expr> <expr> - <term> <term> <term> / const const const
50
Extended BNF Optional parts are placed in brackets [ ]
<proc_call> -> ident [(<expr_list>)] Alternative parts of RHSs are placed inside parentheses and separated via vertical bars <term> → <term> (+|-) const Repetitions (0 or more) are placed inside braces { } <ident> → letter {letter|digit}
51
BNF and EBNF BNF <expr> <expr> + <term> EBNF
<term> <term> * <factor> | <term> / <factor> | <factor> EBNF <expr> <term> {(+ | -) <term>} <term> <factor> {(* | /) <factor>}
52
Extended Backus-Naur Form (EBNF)
Kleene’s Star/ Kleene’s Closure Seq ::= St {; St} Seq ::= {St ;} St Optional Part IfSt ::= if ( E ) St [else St] E ::= F [+ E ] | F [- E]
53
Finite State Automata Set of states
Usually represented as graph nodes Input alphabet + unique “end of program” symbol State transition function Usually represented as directed graph edges (arcs) Automaton is deterministic if, for each state and each input symbol, there is at most one outgoing arc from the state labeled with the input symbol Unique start state One or more final (accepting) states
54
Syntax Diagrams Graphical representation of EBNF rules Examples
nonterminals: terminals: sequences and choices: Examples X ::= (E) | id Seq ::= {St ;} St E ::= F [+ E ] IfSt id ( ) E id St ; F + E
55
DFA for C Identifiers
56
Syntax Diagram for Expressions with Addition
57
Quiz Convert EBNF to BNF S A { b A} A a [b] A
58
Quiz Prove that the grammar is ambiguous <S> <A>
<A> <A> * <A> | <id> <id> x | y | z
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.