Grammars, Languages and Parse Trees
Language Let V be an alphabet or vocabulary V* is set of all strings over V A language L is a subset of V*, i.e., L V* L may be finite or infinite Programming language –Set of all possible programs (valid, very long string) –Programs with syntax errors are not in the set –Infinite number of programs
Language Representation Finite –Enumerate all sentences Infinite language –Cannot be specified by enumeration –Use a generative device, i.e., a grammar Specifies the set of all legal sentences Defined recursively (or inductively)
Sample Grammar Simple arithmetic expressions (E) Basis Rules: –A Variable is an E –An Integer is an E Inductive Rules: –If E 1 and E 2 are Es, so is (E 1 + E 2 ) –If E 1 and E 2 are Es, so is (E 1 * E 2 ) Examples: x, y, 3, 12, (x + y), (z * (x + y)), ((z * (x + y)) + 12)
Production Rules Use symbols (aka syntactical categories) and meta-symbols to define basis and inductive rules For our example: E V E I E (E + E) E (E * E) Inductive Rules Basis Rules
Formal Definition of a Grammar G = (V N, V T, S, ), where – V N, V T, sets of non-terminal and terminal symbols – S V N, a start symbol – = a finite set of relations from (V T V N ) + to (V T V N ) * An element ( , ) of , is written as and is called a production rule or a rewrite rule
Sample Grammar Revisited 1.E V | I | (E + E) | (E * E) 2.V L | VL | VD 3.I D | ID 4.D 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 5.L x | y | z V N : E, V, I, D, L V T : 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, x, y, z S = E : rules 1-5
Another Simple Grammar Symbols: S: sentence V: verb O: object A: article N: noun SP: subject phrase VP: verb phrase NP: noun phrase Rules: S SP VP SP A N A a | the N monkey | banana | tree VP V O V ate | climbs O NP NP A N
Context-Free Grammar A context-free grammar is a grammar with the following restriction: – The relation is a finite set of relations from V N to (V T V N ) + The left hand side of a production is a single non-terminal The right hand side of any production cannot be empty Context-free grammars generate context-free languages. With slight variations, essentially all programming languages are context-free languages. We will focus on context-free grammars
More Grammars G 1 = (V N, V T, S, ), where: V N = {S, B} V T = {a, b, c} S = S = { S aBSc, S abc, Ba aB, Bb bb } G 2 = (V N, V T, S, ), where: V N = {I, L, D} V T = {a, b, …, z, 0, 1, …, 9} S = I = { I L | ID | IL, L a | b | … | z, D 0 | 1 | … | 9 } G 3 = (V N, V T, S, ), where: = { S aA, V N = {S, A, B } A aA | bB, V T = {a, b} B bB | } S = S Which are context-free?
Direct Derivative Let G = (V N, V T, S, ) be a grammar Let α, β (V N V T ) * β is said to be a direct derivative of α, written α β, if there are strings 1 and 2 such that: α = 1 L 2, β = 1 λ 2, L V N and L λ is a production of G We go from α to β using a single rule
Examples of Direct Derivatives G = (V N, V T, S, ), where: V N = {I, L, D} V T = {a, b, …, z, 0, 1, …, 9} S = I = { I L | ID | IL L a | b | … | z D 0 | 1 | … | 9 } αβRule Used 11 22 IL I L IbLb I L b Lbab L a b IDDI0D D 0 ID
Derivation Let G = (V N, V T, S, ) be a grammar A string α produces ω, or α reduces to ω, or ω is a derivation of α, written α + ω, if there are strings 1, …, n (n≥1) such that: α 1 2 … n-1 n ω We go from α to ω using several rules
1.E V | I | (E + E) | (E * E) 2.V L | VL | VD 3.I D | ID 4.D 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 5.L x | y | z ( ( z * ( x + y ) ) + 12 ) ? Example of Derivation E ( E + E ) ( ( E * E ) + E ) ( ( E * ( E + E ) ) + E ) ( ( V * ( V + V ) ) + I ) ( ( L * ( L + L ) ) + ID ) ( ( z * ( x + y ) ) + DD ) ( ( z * ( x + y ) ) + 12 ) How about: ( x + 2 ) ( 21 * ( x4 + 7 ) ) 3 * z 2y
Grammar-generated Language If G is a grammar with start symbol S, a sentential form is any derivative of S A language L generated by a grammar G is the set of all sentential forms whose symbols are all terminals: L(G) = { | S + and V T * }
Example of Language Let G = (V N, V T, S, ), where: V N = {I, L, D} V T = {a, b, …, z, 0, 1, …, 9} S = I = { I L | ID | IL L a | b | … | z D 0 | 1 | … | 9 } L(G) = {abc12, x, m , a1b2c3, …} I ID IDD ILDD ILLDD LLLDD aLLDD abLDD abcDD abc1D abc12
Syntax Analysis: Parsing The parse of a sentence is the construction of a derivation for that sentence The parsing of a sentence results in – acceptance or rejection – and, if acceptance, then also a parse tree We are looking for an algorithm to parse a sentence (i.e., to parse a program) and produce a parse tree
Parse Trees A parse tree is composed of – interior nodes representing elements of V N – leaf nodes representing elements of V T For each interior node N, the transition from N to its children represents the application of one production rule
Parse Tree Construction Top-down – Start with the root (start symbol) – Proceed downward to leaves using productions Bottom-up – Start from leaves – Proceed upward to the root Although these seem like reasonable approaches to develop a parsing algorithm, we’ll see later that neither is ideal we’ll find a better way!
1.A V | I | (A + A) | (A * A) 2.V L | VL | VD 3.I D | ID 4.D 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 5.L x | y | z ( ( z * ( x + y ) ) ) ( ( L * ( L + L ) ) + D D ) 1.A V | I | (A + A) | (A * A) 2.V L | VL | VD 3.I D | ID 4.D 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 5.L x | y | z ( ( V * ( V + V ) ) + I D ) 1.A V | I | (A + A) | (A * A) 2.V L | VL | VD 3.I D | ID 4.D 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 5.L x | y | z ( ( A * ( A + A ) ) + I ) 1.A V | I | (A + A) | (A * A) 2.V L | VL | VD 3.I D | ID 4.D 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 5.L x | y | z ( ( A * A ) + A ) 1.A V | I | (A + A) | (A * A) 2.V L | VL | VD 3.I D | ID 4.D 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 5.L x | y | z ( A + A ) 1.A V | I | (A + A) | (A * A) 2.V L | VL | VD 3.I D | ID 4.D 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 5.L x | y | z A 1.A V | I | (A + A) | (A * A) 2.V L | VL | VD 3.I D | ID 4.D 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 5.L x | y | z ( ( z * ( x + y ) ) + 12 ) Top down
( ( z * ( x + y ) ) ) ( ( V * ( V + V ) ) + I D) A ( A + A ) ( ( L * ( L + L ) ) + D D) ( ( A * ( A + A ) ) + I ) ( ( A * A ) + A ) 1.A V | I | (A + A) | (A * A) 2.V L | VL | VD 3.I D | ID 4.D 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 5.L x | y | z ( ( z * ( x + y ) ) + 12 ) Bottom up
Lexical Analyzer and Parser Lexical analyzers –Input: symbols of length 1 –Output: classified tokens Parsers –Input: classified tokens –Output: parse tree (i.e., syntactically correct program) A syntactically correct program will run. Will it do what you want? [a monkey ate a banana / a banana climbs the tree]
Backus-Naur Form (BNF) A traditional meta-language to represent grammars for programming languages – Every non-terminal is enclosed in – Instead of the symbol , we use ::= Example I L | ID | IL L a | b | … | z D 0 | 1 | … | 9 ::= | | ::= a | b | … | z ::= 0 | 1 | … | 9 WHY?