Chapter 3 Describing Syntax and Semantics The syntax of a programming language is the part of the language definition that says how programs look: their form and structure, such as its expressions, statements, and so on. The semantics of a programming language is the part of the language definition that says what programs do: the behavior and meaning of those expressions, statements. For example: The syntax of a C if statement is if (<expr>) <statement> The semantics of this statement form means if the current value of the expression <expr> is true, the embedded statement <statement> is selected for execution.
2. A Grammar Example for English 2.1 <A> for an article (a, the) and express our definition: <A> a | the 2.2 <N> for a noun (dog, cat, or rat): <N> dog | cat | rat 2.3 <NP> for a noun phrase (an article followed by a noun): <NP> <A> <N> 2.4 <V> for a verb (loves, hates, or eats) <V> loves | hates | eats 2.5 <S> for a sentence (a noun phrase, followed by a verb, followed by another noun phrase: <S> <NP> <V> <NP>
2.6 Grammar defining a small subset of unpunctuated English <S> <NP> <V> <NP> <NP> <A> <N> <V> loves | hates | eats <N> dog | cat | rat <A> a | the How does such a grammar define a language? Think of the grammar as a set of rules that say how to build a tree.
<S> <NP> <V> <A> <N> the dog loves cat <S> <NP> <V> <A> <N> a cat eats the rat
3 The Basic Concepts of Describing Syntax 3.1 Sentences (statements) – The strings of a language. The syntax rules of the language specify which strings of characters in the language. 3.2 lexemes – The small units (identifier, literals, operators, and special words). 3.3 Token – A category of the language lexemes. 3.4 For example: Statement index = 2 * count + 17; Lexemes Tokens index identifier = equal_sign 2 int_literal * mult_op count identifier + plus_op 17 int_literal ; semicolon
Recursive Grammar 4 Formal Methods of Describing Syntax 4.1 A Grammar Example for a Programming Language <exp> <exp>+<exp> | <exp>*<exp> | ( <exp> ) | a | b | c expression: a expression: a + b expression: a + b * c expression: ( ( a + b ) * c ) 4.2 Parse trees: Hierarchical syntactic structures <exp> ( <exp> ) Recursive Grammar <exp> <exp> * c ( <exp> ) <exp> + <exp> a b
4.3 Backus-Naur Form and Context-Free Grammars John Backus and Noam Chomsky (The middle to late 1950’s) proposed a method (grammar) for describing syntax. Start symbol <S> <NP> <V> <NP> production <NP> <A> <N> <V> loves | hates | eats Non-terminal <N> dog | cat | rat symbols <A> a | the tokens (Lexemes): terminal symbols
The grammar has four important parts: The Non-terminal symbols are strings enclosed in angle brackets, such as <NP>. The Non-terminal symbols of a grammar often correspond to different kinds of language constructs The grammar designates one of the non-terminal symbols as the root of the parse tree: the start symbol. A production (rule) consists of a left-hand side (LHS) : a single non-terminal symbol a arrow : a right-hand side (RHS) : a sequence of one or more things, each of which can be either a token or a non-terminal symbol. The special symbol | is used to separate the right-hand sides. Tokens and Lexemes : Terminal symbols (the smallest units of syntax).
The example: The grammar for a simple language of expressions with three variables. <exp> <exp>+<exp> | <exp>*<exp> | ( <exp> ) | a | b | c This grammar can be written in a different way, without the | notation: <exp> <exp>+<exp> <exp> <exp>*<exp> <exp> ( <exp> ) <exp> a <exp> b <exp> c The special non-terminal symbol <empty> : Generate an empty string – a string of no tokens if – then statement with an optional else part might be defined like this: <if-stmt> if <expr> them <stmt> <else-part> <else-part> else <stmt> | <empty>
Context-free Grammars: The grammars are context-free grammars, the children of a node in the parse tree depend only on that node’s non-terminal symbol; they do not depend on the context of neighboring nodes in the tree. Metalanguage: a language is used to describe another language. BNF is a metalanguage.
4.4 An example for writing Grammars: float a; boolean (bool)b, c, d; int e=1, f, g = 1+2; Step 1: Divide the problem into smaller pieces. The major components: the type name the list of variables the final semicolon Non-terminal symbols <var-dec> <type-name> <declarator-list> <var-dec> <type-name> <declarator-list>;
Step 2: List the primitive types <type-name> boolean (bool) | byte | short | int | | long | char | float | double Step 3: Define the declarator list A declarator list is a list of one or more declarators, followed by a comma, followed by a (smaller) declarator list. <declarator-list> <declartor> | <declarator> , <declarator-list> Step 4: Define a declarator It is a variable name followed by, optionally, by a equal sign and an expression. <declarator> <variable-name> | <variable-name> = <expr>
4.5 Examples: A Grammar for a Small Language <program> begin <stmt_list> end <stmt_list> <stmt> | <stmt> ; <stmt_list> <stmt> <var> = <expression> <var> A | B | C <expression> <var> + <var> | <var> - <var> | <var> Remarks: Only one kind of statement: assignment. A program consists of begin and end A list of assignment statements separated by semicolons An expression: a variable, add two variables, or subtract one variable from another one. Only three variable names: A, B, C.
Leftmost derivations: <program> => begin <stmt_list> end => begin <stmt> ; <stmt_list> end => begin <var> = <expression>; <stmt_list> end => begin A = <expression>; <stmt_list> end => begin A = <var> + <var>; <stmt_list> end => begin A = B + <var>; <stmt_list> end => begin A = B + C; <stmt_list> end => begin A = B + C; <stmt> end => begin A = B + C; <var> = <expression> end => begin A = B + C; B = <expression> end => begin A = B + C; B = C end In this derivation, the replaced non-terminal is always left-most non-terminal. (right-most non-terminal or neither left-most nor right-most)
A Grammar for Simple Assignment Statements <assign> <id> = <expr> <id> A | B | C <expr> <id> + <expr> | <id> * <expr> | ( <expr> ) | <id> A = B * ( A + C ) is generated by the leftmost derivation: <assign> => <id> = <expr> => A = <id> * <expr> => A = B * <expr> => A = B * ( <expr> ) => A = B * ( <id> + <expr> ) => A = B * ( A + <expr> ) => A = B * ( A + <id> ) => A = B * ( A + C )
A parse tree for the simple statement A = B * (A + C)
The statement: A = B + C * A has two distinct parse trees. Ambiguity: A grammar that generates a sentence for which there are two or more distinct parse trees is said to be ambiguous. An ambiguous Grammar for Simple Assignment Statements <assign> <id> = <expr> <id> A | B | C <expr> <expr> + <expr> | <expr> * <expr> | ( <expr> ) | <id> The statement: A = B + C * A has two distinct parse trees. By using Operator Precedence and Associativity of Operator, we solve the ambiguous problem.
Two distinct parse trees for the same sentence, A = B + C * A
5.1 Operator Precedence: An operator in an expression is generated lower in the parse tree (and therefore must be evaluated first) has precedence over an operator produced higher up in the tree. The left parse tree indicates The multiplication operator has precedence over the addition operator. The right parse tree indicates The addition operator has precedence over the multiplication operator. A grammar can be written to define different operators (the addition and multiplication) in a higher to lower ordering.
Derive the statement A = B + C * A by using the grammar An unambiguous Grammar for Simple Assignment Statements <assign> <id> = <expr> <id> A | B | C <expr> <expr> + <term> | <term> <term> <term> * <factor> | <factor> <factor> ( <expr> ) | <id> Derive the statement A = B + C * A by using the grammar
The leftmost derivation: <assign> => <id> = <expr> => A = <expr> => A = <expr> + <term> => A = <term> + <term> => A = <factor> + <term> => A = <id> + <term> => A = B + <term> => A = B + <term> * <factor> => A = B + <factor> * <factor> => A = B + <id> * <factor> => A = B + C * <factor> => A = B + C * <id> => A = B + C * A
The rightmost derivation: <assign> => <id> = <expr> => <id> = <expr> + <term> => <id> = <expr> + <term> * <factor> => <id> = <expr> + <term> * <id> => <id> = <expr> + <term> * A => <id> = <expr> + <factor> * A => <id> = <expr> + <id> * A => <id> = <expr> + C * A => <id> = <term> + C * A => <id> = <factor> + C * A => <id> = <id> + C * A => <id> = B + C * A => A = B + C * A
A unique parse tree for A = B + C * A using an unambiguous grammar
5.2 Associativity of Operators: The parse trees for expressions with two or more adjacent occurrences of operators with equal precedence have those occurrences in proper hierarchical order. Example: Derive this statement A = B + C + A by using the grammar (the Leftmost) and generate its parse tree. This parse tree shows the left addition is lower than the right addition (the left associative) A = B + C + A
The leftmost derivation: <assign> => <id> = <expr> => A = <expr> => A = <expr> + <term> => A = <expr> + <term> + <term> => A = <term> + <term> + <term> => A = <factor> + <term> + <term> => A = <id> + <term> + <term> => A = B + <term> + <term> => A = B + <factor> + <term> => A = B + <id> + <term> => A = B + C + <term> => A = B + C + <factor> => A = B + C + <id> => A = B + C + A
A parse tree for A = (B + C) + A illustrating the left associativity of addition The leftmost derivation:
The rightmost derivation: <assign> => <id> = <expr> => <id> = <expr> + <term> => <id> = <expr> + <factor> => <id> = <expr> + <id> => <id> = <expr> + A => <id> = <expr> + <term> + A => <id> = <expr> + <factor> + A => <id> = <expr> + <id> + A => <id> = <expr> + C + A => <id> = <term> + C + A => <id> = <factor> + C + A => <id> = <id> + C + A => <id> = B + C + A => A = B + C + A
A parse tree for A = (B + C) + A illustrating the left associativity of addition The rightmost derivation:
Left recursive: when a BNF rule has its LHS also appearing at the beginning of its RHS. An unambiguous Grammar for Simple Assignment Statements <assign> <id> = <expr> <id> A | B | C <expr> <expr> + <term> | <term> (Left associative) <term> <term> * <factor> | <factor> (Left associative) <factor> ( <expr> ) | <id>
Right recursive: when a BNF rule has its LHS also appearing at the end of its RHS. Rules for exponentiation as a right-associative operator <factor> <exp> ** <factor> | <exp> (Right associative) <exp> ( <exp> ) | <id>
5.3 An unambiguous grammar for if-then-else Rules for if-then-else <if_stmt> if <logic_expr> then <stmt> | if <logic_expr> then <stmt> else <stmt> Deriving if <logic_expr> then if <logic_expr> then <stmt> else <stmt> by the rules above
Two distinct parse trees for the same sentential form
The unambiguous grammar for if-then-else Rules for if-then-else <stmt> <matched> | <unmatched> <matched> if <logic_expr> then <matched> else <matched> | any non-if statement <unmatched> if <logic_expr> then <stmt> | if <logic_expr> then <matched> else <unmached> Deriving if <logic_expr> then if <logic_expr> then <stmt> else <stmt> by the rules above