Structure of programming languages Syntax and Semantics
Introduction Syntax: the form or structure of the expressions, statements, and program units Semantics: the meaning of the expressions, statements, and program units
Introduction Syntax and semantics provide a language’s definition Users of a language definition Other language designers Implementers Programmers (the users of the language)
Syntax of a Languages Motivation for using a subset of C: Grammar Language (pages) Reference Pascal 5 Jensen & Wirth C 6 Kernighan & Richie C++ 22 Stroustrup Java 14 Gosling, et. al. The grammar fits on one page, so it’s a far better tool for studying language design.
C Grammar: Statements Program int main ( ) { Declarations Statements } Declarations { Declaration } Declaration Type Identifier [ [ Integer ] ] { , Identifier [ [ Integer ] ] } ; Type int | bool | float | char Statements { Statement } Statement ; | Block | Assignment | IfStatement | WhileStatement Block { Statements } Assignment Identifier [ [ Expression ] ] = Expression ; IfStatement if ( Expression ) Statement [ else Statement ] WhileStatement while ( Expression ) Statement
C Grammar: Expressions Expression Conjunction { || Conjunction } Conjunction Equality { && Equality } Equality Relation [ EquOp Relation ] EquOp == | != Relation Addition [ RelOp Addition ] RelOp < | <= | > | >= Addition Term { AddOp Term } AddOp + | - Term Factor { MulOp Factor } MulOp * | / | % Factor [ UnaryOp ] Primary UnaryOp - | ! Primary Identifier [ [ Expression ] ] | Literal | ( Expression ) | Type ( Expression )
Describing Syntax: Terminology A sentence is a string of characters over some alphabet A language is a set of sentences A lexeme is the lowest level syntactic unit of a language (e.g., *, sum, begin) A token is a category of lexemes (e.g., identifier)
Formal Definition of Languages Recognizers A recognition device reads input strings of the language and decides whether the input strings belong to the language Example: syntax analysis part of a compiler Generators A device that generates sentences of a language One can determine if the syntax of a particular sentence is correct by comparing it to the structure of the generator
Lexical Analysis The process of converting a character stream into a corresponding sequence of meaningful symbols (called tokens or lexemes) is called tokenizing, lexing or lexical analysis. A program that performs this process is called a tokenizer, lexer, or scanner. In Scheme, we tokenize (set! x (+ x 1)) as ( set! x ( + x 1 ) ) Similarly, in Java, we tokenize System.out.println("Hello World!"); as System . out . println ( "Hello World!" ) ;
Lexical Analysis Lexical analyzer splits it into tokens Token = sequence of characters (symbolic name) representing a single terminal symbol Identifiers: myVariable … Literals: 123 5.67 true … Keywords: char sizeof … Operators: + - * / … Punctuation: ; , } { … Discards whitespace and comments
Examples of Tokens in C Tokens Lexemes identifier Age, grade,Temp, zone, q1 number 3.1416, -498127,987.76412097 string “A cat sat on a mat.”, “90183654” open parentheses ( close parentheses ) Semicolon ; reserved word if IF, if, If, iF
Examples of Tokens in C Lexical analyzer usually represents each token by a unique integer code “+” { return(PLUS); } // PLUS = 401 “-” { return(MINUS); } // MINUS = 402 “*” { return(MULT); } // MULT = 403 “/” { return(DIV); } // DIV = 404 Some tokens require regular expressions [a-zA-Z_][a-zA-Z0-9_]* { return (ID); } // identifier [1-9][0-9]* { return(DECIMALINT); } 0[0-7]* { return(OCTALINT); } (0x|0X)[0-9a-fA-F]+ { return(HEXINT); }
Reserved Keywords in C auto, break, case, char, const, continue, default, do, double, else, enum, extern, float, for, goto, if, int, long, register, return, short, signed, sizeof, static, struct, switch, typedef, union, unsigned, void, volatile, wchar_t, while C++ added a bunch: bool, catch, class, dynamic_cast, inline, private, protected, public, static_cast, template, this, virtual and others Each keyword is mapped to its own token
Whitespace Whitespace is any space, tab, end-of-line character (or characters), or character sequence inside a comment No token may contain embedded whitespace (unless it is a character or string literal) Example: >= one token > = two tokens
Redefining Identifiers can be dangerous program confusing; const true = false; begin if (a<b) = true then f(a) else …
Parsing Process Call the scanner to get tokens Build a parse tree from the stream of tokens A parse tree shows the syntactic structure of the source program. Add information about identifiers in the symbol table Report error, when found, and recover from the error
Grammars A meta-language is a language used to define other languages A grammar is a meta-language used to define the syntax of a language. It consists of: Finite set of terminal symbols Finite set of non-terminal symbols Finite set of production rules Start symbol Language = (possibly infinite) set of all sequences of symbols that can be derived by applying production rules starting from the start symbol Backus-Naur Form (BNF)
Describing Syntax Backus-Naur Form and Context-Free Grammars (BNF) Most widely known method for describing programming language syntax Extended BNF Improves readability and writability of BNF
Backus-Naur Form (BNF) Invented by John Backus to describe Algol 58 BNF is equivalent to context-free grammars BNF is a meta-language used to describe another language In BNF, abstractions are used to represent classes of syntactic structures--they act like syntactic variables (also called non-terminal symbols)
Example: Binary Digits Consider the grammar: binaryDigit 0 binaryDigit 1 or equivalently: binaryDigit 0 | 1 Here, | is a metacharacter that separates alternatives.
Example: Decimal Numbers Grammar for unsigned decimal integers Terminal symbols: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 Non-terminal symbols: Digit, Integer Production rules: Integer Digit | Integer Digit Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Start symbol: Integer Can derive any unsigned integer using this grammar Language = set of all unsigned decimal integers
Derivation of 352 as an Integer Production rules: Integer Digit | Integer Digit Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Integer Integer Digit Integer 2 Integer Digit 2 Integer 5 2 Digit 5 2 3 5 2 Rightmost derivation At each step, the rightmost non-terminal is replaced
Parse Tree A labeled tree in which the interior nodes are labeled by nonterminals leaf nodes are labeled by terminals the children of an interior node represent a replacement of the associated nonterminal in a derivation corresponding to a derivation E + id E F *
Parse Tree for 352 as an Integer
Abstract Syntax Tree for If Statement ) exp true ( if else elsePart return if true st return
Arithmetic Expression Grammar The following grammar defines the language of arithmetic expressions with 1-digit integers, addition, and subtraction. Expr Expr + Term | Expr – Term | Term Term 0 | ... | 9 | ( Expr )
Parse of the String 5-4+3
BNF Fundamentals Non-terminals: BNF abstractions Terminals: lexemes and tokens Grammar: a collection of rules Examples of BNF rules: <ident_list> → identifier | identifier, <ident_list> <if_stmt> → if <logic_expr> then <stmt>
Regular Expressions x character x \x escaped character, e.g., \n { name } reference to a name M | N M or N M N M followed by N M* 0 or more occurrences of M M+ 1 or more occurrences of M [x1 … xn] One of x1 … xn Example: [aeiou] – vowels, [0-9] - digits
An Example Grammar <program> <stmts> <stmts> <stmt> | <stmt> ; <stmts> <stmt> <var> = <expr> <var> a | b | c | d <expr> <term> + <term> | <term> - <term> <term> <var> | const
Derivation A derivation is a repeated application of rules, starting with the start symbol and ending with a sentence (all terminal symbols), e.g., <program> => <stmts> => <stmt> => <var> = <expr> => a = <expr> => a = <term> + <term> => a = <var> + <term> => a = b + <term> => a = b + const
Examples S SS | (S)S | () | S SS (S)SS (S)S(S)S (S)S(())S ((())())(()) E E O E | (E) | id O + | - | * | / E E O E (E) O E (E O E) O E * ((E O E) O E) O E ((id O E)) O E) O E ((id + E)) O E) O E ((id + id)) O E) O E * ((id + id)) * id) + id
Leftmost Derivation Rightmost Derivation Each step of the derivation is a replacement of the leftmost nonterminals in a sentential form. E E O E (E) O E (E O E) O E (id O E) O E (id + E) O E (id + id) O E (id + id) * E (id + id) * id Each step of the derivation is a replacement of the rightmost nonterminals in a sentential form. E E O E E O id E * id (E) * id (E O E) * id (E O id) * id (E + id) * id (id + id) * id
Parse Tree A hierarchical representation of a derivation <program> <stmts> <stmt> <var> = <expr> a <term> + <term> <var> const b
CFG For Floating Point Numbers ::= stands for production rule; <…> are non-terminals; | represents alternatives for the right-hand side of a production rule Sample parse tree:
Ambiguity in Expressions Which operation is to be done first? solved by precedence An operator with higher precedence is done before one with lower precedence. An operator with higher precedence is placed in a rule (logically) further from the start symbol. solved by associativity If an operator is right-associative (or left-associative), an operand in between 2 operators is associated to the operator to the right (left). Right-associated : W + (X + (Y + Z)) Left-associated : ((W + X) + Y) + Z
Associativity and Precedence A grammar can be used to define associativity and precedence among the operators in an expression. E.g., + and - are left-associative operators in mathematics; * and / have higher precedence than + and - . Consider the grammar G1: Expr -> Expr + Term | Expr – Term | Term Term -> Term * Factor | Term / Factor | Term % Factor | Factor Factor -> Primary ** Factor | Primary Primary -> 0 | ... | 9 | ( Expr )
Precedence (cont’d) E E + E | E - E | F F F * F | F / F | X X ( E ) | id (id1 + id2) * id3 * id4 * F X + E id4 id3 ) ( id1 id2
Precedence E E + E E E - E E E * E E E / E E id E F F F * F F F / F F id + E * id3 id2 id1 E + * F id3 id2 id1
Associativity and Precedence for Grammar Precedence Associativity Operators 3 right ** 2 left * / % \ 1 left + - Note: These relationships are shown by the structure of the parse tree: highest precedence at the bottom, and left-associativity on the left at each level.
Associativity of Operators Operator associativity can also be indicated by a grammar <expr> -> <expr> + <expr> | const (ambiguous) <expr> -> <expr> + const | const (unambiguous) <expr> <expr> <expr> + const <expr> + const const
Associativity Left-associative operators E E + F | E - F | F / F X + E id3 ) ( * id1 id4 id2 Left-associative operators E E + F | E - F | F F F * X | F / X | X X ( E ) | id (id1 + id2) * id3 / id4 = (((id1 + id2) * id3) / id4)
Ambiguity in Grammars A grammar is ambiguous if and only if it generates a sentential form that has two or more distinct parse trees
Example of Dangling Else With which ‘if’ does the following ‘else’ associate? if (x < 0) if (y < 0) y = y - 1; else y = 0; Answer: either one!
An Ambiguous Expression Grammar <expr> <expr> <op> <expr> | const <op> / | - <expr> <expr> <expr> <op> <expr> <expr> <op> <op> <expr> <expr> <op> <expr> <expr> <op> <expr> const - const / const const - const / const
Example: Ambiguous Grammar E E + E E E - E E E * E E E / E E id + E * id3 id2 id1 * E id2 + id1 id3
Solving the dangling else ambiguity Algol 60, C, C++: associate each else with closest if; use {} or begin…end to override. Algol 68, Modula, Ada: use explicit delimiter to end every conditional (e.g., if…fi) Java: rewrite the grammar to limit what can appear in a conditional: IfThenStatement -> if ( Expression ) Statement IfThenElseStatement -> if ( Expression ) StatementNoShortIf else Statement The category StatementNoShortIf includes all except IfThenStatement.
Ambiguity in Expressions Which operation is to be done first? solved by precedence An operator with higher precedence is done before one with lower precedence. An operator with higher precedence is placed in a rule (logically) further from the start symbol. solved by associativity If an operator is right-associative (or left-associative), an operand in between 2 operators is associated to the operator to the right (left). Right-associated : W + (X + (Y + Z)) Left-associated : ((W + X) + Y) + Z
An Unambiguous Expression Grammar If we use the parse tree to indicate precedence levels of the operators, we cannot have ambiguity <expr> <expr> - <term> | <term> <term> <term> / const| const <expr> <expr> - <term> <term> <term> / const const const
Extended BNF Optional parts are placed in brackets [ ] <proc_call> -> ident [(<expr_list>)] Alternative parts of RHSs are placed inside parentheses and separated via vertical bars <term> → <term> (+|-) const Repetitions (0 or more) are placed inside braces { } <ident> → letter {letter|digit}
BNF and EBNF BNF <expr> <expr> + <term> EBNF <term> <term> * <factor> | <term> / <factor> | <factor> EBNF <expr> <term> {(+ | -) <term>} <term> <factor> {(* | /) <factor>}
Extended Backus-Naur Form (EBNF) Kleene’s Star/ Kleene’s Closure Seq ::= St {; St} Seq ::= {St ;} St Optional Part IfSt ::= if ( E ) St [else St] E ::= F [+ E ] | F [- E]
Finite State Automata Set of states Usually represented as graph nodes Input alphabet + unique “end of program” symbol State transition function Usually represented as directed graph edges (arcs) Automaton is deterministic if, for each state and each input symbol, there is at most one outgoing arc from the state labeled with the input symbol Unique start state One or more final (accepting) states
Syntax Diagrams Graphical representation of EBNF rules Examples nonterminals: terminals: sequences and choices: Examples X ::= (E) | id Seq ::= {St ;} St E ::= F [+ E ] IfSt id ( ) E id St ; F + E
DFA for C Identifiers
Syntax Diagram for Expressions with Addition
Quiz Convert EBNF to BNF S A { b A} A a [b] A
Quiz Prove that the grammar is ambiguous <S> <A> <A> <A> * <A> | <id> <id> x | y | z