作者 : 陳鍾誠 單位 : 金門技術學院資管系 URL : 日期 : 2016/6/4 程式語言的語法 Grammar
Grammar 2 陳鍾誠 /6/4
Language 3 陳鍾誠 /6/4
Recursive Definition 4 陳鍾誠 /6/4
Mathematical Expression 5 陳鍾誠 /6/4
Structure of Expressions 6 陳鍾誠 /6/4
Formal Language 7 陳鍾誠 /6/4
Backus Naur Form (BNF) 8 陳鍾誠 /6/ by J. Backus and P. Naur
EBNF (Extended BNF) 9 陳鍾誠 /6/4
BNF EBNF 10 陳鍾誠 /6/4 BNFEBNF
Formalism (Formal notation) N. Chomsky 近代結構語言學之父 11 陳鍾誠 /6/4 N. Chromsky -
Differing structural trees for the same expression 12 陳鍾誠 /6/4
Problem of Different structural trees 13 陳鍾誠 /6/4
No Ambiguous Sentence 14 陳鍾誠 /6/4
Context Free Language Syntactic equations of the form defined in EBNF generate context- free languages. The term "context free” is due to Chomsky and stems from the fact that substitution of the symbol left of = by a sequence derived from the expression to the right of = is always permitted, regardless of the context in which the symbol is embedded within the sentence. It has turned out that this restriction to context freedom (in the sense of Chomsky) is quite acceptable for programming languages, and that it is even desirable. Context dependence in another sense, however, is indispensible. We will return to this topic in Chapter 陳鍾誠 /6/4
Regular Expression A language is regular, if its syntax can be expressed by a single EBNF expression. The requirement that a single equation suffices also implies that only terminal symbols occur in the expression. Such an expression is called a regular expression. 16 陳鍾誠 /6/4
Syntax Analysis v.s. Regular Expression The reason for our interest in regular languages lies in the fact that programs for the recognition of regular sentences are particularly simple and efficient. By "recognition" we mean the determination of the structure of the sentence, and thereby naturally the determination of whether the sentence is well formed, that is, it belongs to the language. Sentence recognition is called syntax analysis. 17 陳鍾誠 /6/4
Regular Expression v.s. State Machine For the recognition of regular sentences a finite automaton, also called a state machine, is necessary and sufficient. In each step the state machine reads the next symbol and changes state. The resulting state is solely determined by the previous state and the symbol read. If the resulting state is unique, the state machine is deterministic, otherwise nondeterministic. If the state machine is formulated as a program, the state is represented by the current point of program execution. 18 陳鍾誠 /6/4
EBNF Program The analyzing program can be derived directly from the defining syntax in EBNF. For each EBNF construct K there exists a translation rule which yields a program fragment Pr(K). The translation rules from EBNF to program text are shown below. Therein sym denotes a global variable always representing the symbol last read from the source text by a call to procedure next. Procedure error terminates program execution, signaling that the symbol sequence read so far does not belong to the language. 19 陳鍾誠 /6/4
Analyzing program 20 陳鍾誠 /6/4
EBNF with only 1 rule 21 陳鍾誠 /6/4
First() 22 陳鍾誠 /6/4
Precondition 23 陳鍾誠 /6/4
Lexical Analysis for Identifier 24 陳鍾誠 /6/4
Lexical Analysis for Integer 25 陳鍾誠 /6/4
Scanner The process of syntax analysis is based on a procedure to obtain the next symbol. This procedure in turn is based on the definition of symbols in terms of sequences of one or more characters. This latter procedure is called a scanner, and syntax analysis on this second, lower level, lexical analysis. 26 陳鍾誠 /6/4
Lexical Analysis v.s. Syntax Analysis 27 陳鍾誠 /6/4
A Scanner Example As an example we show a scanner for a parser of EBNF. Its terminal symbols and their definition in terms of characters are 28 陳鍾誠 /6/4
Procedure GetSym() –(1) 29 陳鍾誠 /6/4
Procedure GetSym() –(2) 30 陳鍾誠 /6/4
Procedure GetSym() –(3) 31 陳鍾誠 /6/4
Syntax Analysis Overview Goal – determine if the input token stream satisfies the syntax of the program What do we need to do this? An expressive way to describe the syntax A mechanism that determines if the input token stream satisfies the syntax description For lexical analysis Regular expressions describe tokens Finite automata = mechanisms to generate tokens from input stream
Just Use Regular Expressions? REs can expressively describe tokens Easy to implement via DFAs So just use them to describe the syntax of a programming language NO! – They don’t have enough power to express any non- trivial syntax Example – Nested constructs (blocks, expressions, statements) – Detect balanced braces: {{} {} {{} { }}} { {{{{ }}}}}... - We need unbounded counting! - FSAs cannot count except in a strictly modulo fashion
Context-Free Grammars Consist of 4 components: Terminal symbols = token or Non-terminal symbols = syntactic variables Start symbol S = special non-terminal Productions of the form LHS RHS LHS = single non-terminal RHS = string of terminals and non-terminals Specify how non-terminals may be expanded Language generated by a grammar is the set of strings of terminals derived from the start symbol by repeatedly applying the productions L(G) = language generated by grammar G S a S a S T T b T b T
CFG - Example Grammar for balanced-parentheses language S ( S ) S S 1 non-terminal: S 2 terminals: “)”, “)” Start symbol: S 2 productions If grammar accepts a string, there is a derivation of that string using the productions “(())” S = (S) = ((S) S) = (( ) ) = (()) ? Why is the final S required?
More on CFGs Shorthand notation – vertical bar for multiple productions S a S a | T T b T b | CFGs powerful enough to expression the syntax in most programming languages Derivation = successive application of productions starting from S Acceptance? = Determine if there is a derivation for an input token stream
A Parser Parser Context free grammar, G Token stream, s (from lexer) Yes, if s in L(G) No, otherwise Error messages Syntax analyzers (parsers) = CFG acceptors which also output the corresponding derivation when the token stream is accepted Various kinds: LL(k), LR(k), SLR, LALR
RE is a Subset of CFG Can inductively build a grammar for each RE S aS a R1 R2S S1 S2 R1 | R2S S1 | S2 R1*S S1 S | Where G1 = grammar for R1, with start symbol S1 G2 = grammar for R2, with start symbol S2
Grammar for Sum Expression Grammar S E + S | E E number | (S) Expanded S E + S S E E number E (S) 4 productions 2 non-terminals (S,E) 4 terminals: “(“, “)”, “+”, number start symbol: S
Constructing a Derivation Start from S (the start symbol) Use productions to derive a sequence of tokens For arbitrary strings α, β, γ and for a production: A β A single step of the derivation is α A γ α β γ (substitute β for A) Example S E + S (S + E) + E (E + S + E) + E
Class Problem S E + S | E E number | (S) Derive: ( (3 + 4)) + 5
Parse Tree S E+S ( S )E E + S 5 1 2E ( S ) E + S E34 Parse tree = tree representation of the derivation Leaves of the tree are terminals Internal nodes are non-terminals No information about the order of the derivation steps
Parse Tree vs Abstract Syntax Tree S E+S ( S )E E + S 5 1 2E ( S ) E + S E Parse tree also called “concrete syntax” AST discards (abstracts) unneeded information – more compact format
Derivation Order Can choose to apply productions in any order, select non-terminal and substitute RHS of production Two standard orders: left and right-most Leftmost derivation In the string, find the leftmost non-terminal and apply a production to it E + S 1 + S Rightmost derivation Same, but find rightmost non-terminal E + S E + E + S
Leftmost/Rightmost Derivation Examples » S E + S | E » E number | (S) » Leftmost derive: ( (3 + 4)) + 5 S E + S (S)+S (E+S) + S (1+S)+S (1+E+S)+S (1+2+S)+S (1+2+E)+S (1+2+(S))+S (1+2+(E+S))+S (1+2+(3+S))+S (1+2+(3+E))+S (1+2+(3+4))+S (1+2+(3+4))+E (1+2+(3+4))+5 »Now, rightmost derive the same input string Result: Same parse tree: same productions chosen, but in diff order S E+S E+E E+5 (S)+5 (E+S)+5 (E+E+S)+5 (E+E+E)+5 (E+E+(S))+5 (E+E+(E+S))+5 (E+E+(E+E))+5 (E+E+(E+4))+5 (E+E+(3+4))+5 (E+2+(3+4))+5 (1+2+(3+4))+5
Class Problem S E + S | E E number | (S) | -S Do the rightmost derivation of : 1 + (2 + -(3 + 4)) + 5
Ambiguous Grammars In the sum expression grammar, leftmost and rightmost derivations produced identical parse trees + operator associates to the right in parse tree regardless of derivation order (1+2+(3+4))
An Ambiguous Grammar + associates to the right because of the right- recursive production: S E + S Consider another grammar S S + S | S * S | number Ambiguous grammar = different derivations produce different parse trees More specifically, G is ambiguous if there are 2 distinct leftmost (rightmost) derivations for some sentence
Ambiguous Grammar - Example S S + S | S * S | number Consider the expression: * 3 Derivation 1: S S+S 1+S 1+S*S 1+2*S 1+2*3 Derivation 2: S S*S S+S*S 1+S*S 1+2*S 1+2*3 + *1 23 * Obviously not equal!
Impact of Ambiguity Different parse trees correspond to different evaluations! Thus, program meaning is not defined!! + *1 23 * = 7 = 9
Can We Get Rid of Ambiguity? Ambiguity is a function of the grammar, not the language! A context-free language L is inherently ambiguous if all grammars for L are ambiguous Every deterministic CFL has an unambiguous grammar So, no deterministic CFL is inherently ambiguous No inherently ambiguous programming languages have been invented To construct a useful parser, must devise an unambiguous grammar
Eliminating Ambiguity Often can eliminate ambiguity by adding nonterminals and allowing recursion only on right or left S S + T | T T T * num | num T non-terminal enforces precedence Left-recursion; left associativity S S + T TT * 3 12
A Closer Look at Eliminating Ambiguity Precedence enforced by Introduce distinct non-terminals for each precedence level Operators for a given precedence level are specified as RHS for the production Higher precedence operators are accessed by referencing the next-higher precedence non- terminal
Associativity An operator is either left, right or non associative Left:a + b + c = (a + b) + c Right:a ^ b ^ c = a ^ (b ^ c) Non:a < b < c is illegal (thus undefined) Position of the recursion relative to the operator dictates the associativity Left (right) recursion left (right) associativity Non: Don’t be recursive, simply reference next higher precedence non-terminal on both sides of operator
Class Problem (Tough) S S + S | S – S | S * S | S / S | (S) | -S | S ^ S | number Enforce the standard arithmetic precedence rules and remove all ambiguity from the above grammar Precedence (high to low) (), unary – ^ *, / +, - Associativity ^ = right rest are left