Grammar Variation in Compiler Design Carl Wu
Three topics Syntax Grammar vs. AST Component(?)-based grammar Aspect-oriented grammar
Grammar vs. AST (I) How to automatically generate a tree from a grammar?
Grammar vs. AST (I) Stmt ::= Block | “if” Expr “then” Stmt | IdUse “:=” Exp
Grammar vs. AST (I) Stmt ::= Block | “if” Exp “then” Stmt | IdUse “:=” Exp JastAdd Specification (Tree) abstract Stmt; BlockStmt : Stmt ::= Block; IfStmt : Stmt ::= Exp Stmt; AssignStmt : Stmt ::= IdUse Exp;
Grammar vs. AST (I) Restricted CFG Definition A ::= B C D √ => aggregation A ::= B | C | D √ => inheritance A ::= B C | D ×
Grammar vs. AST (I) RCFG Specification Stmt :: Block | IfStmt | AssignStmt IfStmt :: “if” Exp “then” Stmt AssignStmt :: IdUse “:=” Exp
Grammar vs. AST (II) Parse tree vs. IR tree
Grammar vs. AST (II) In an IDE, there are multiple visitors for the same source code (>12 !). Different requirement for the tree structure: –Syntax vs. semantics –Immutable vs. transformable (optimization) –Parse tree vs. IR tree
Grammar vs. AST (II) Generate two tree structures from the same grammar! One immutable, strong-typed, concrete parse tree – Read only! One transferable, untyped, abstract IR tree – Read and write!
Grammar vs. AST (II) IfStmt :: “if” Exp “then” Stmt Class ASTNode{ protected ASTNode[] children; } class IfStmt extends ASTNode{ final protected Token token_if, Exp exp, Token token_then, Stmt stmt; IfStmt(Token token_if, Exp exp, Token token_then, Stmt stmt){ // parse tree construction this.token_if = token_if; this.exp = exp; this.token_then = token_then; this.stmt = stmt; // IR tree construction children[0] = exp; children[1] = stmt; }
Component(?)-based grammar
Component vs. module What is the different between a component and a module? What is a modularized grammar? What is an ideal component-based grammar?
Component vs. module Grammar Component Parser Grammar Module Grammar Parser Modularized grammar Component-based grammar
Benefits Benefits from modularized grammar –Easy to read, write, change –Eliminate naming conflicts Additional benefits brought from component- based grammar –Each component can be designed, developed and tested individually. –Any change to certain component does not require compiling all the other components. –Different type of grammars/parsing algorithms can be used for different component, e.g., one component can be LL, one can be LALR.
Difficulty in designing component- based grammar No clear guards between two components. –Switch the control to a new parser or stay in the same? –Suitable for embed languages, e.g., Jscript in Html –Not suitable for an integral language, e.g., Java Two much coupling between two components. –Not just reuse the component as a whole, may also reuse the internal productions and symbols. –Not applicable for LR parsers, once the table is built, you can’t reuse the internal productions (no way to jump into a table).
Ideal vs. reality
Suggestions?
Aspect-oriented grammar
Join-point: grammar patterns that crosscut multiple productions Punctuations, identifiers, modifiers…
Example ";“ appears 25 times in one of the Java grammars “.” appears 74 times in one of the Cobol grammars Every one of them should be carefully placed!
::= '.' | '.' | '.' pointcut PreDot(): ; after PreDot(): ‘.'
Another example pointcut Content(): … … before Content(): “(”; after Content(): “)”; Guarantee they match!
Grammar weaving Base Grammar Grammar Aspect Result grammar Parser
What do you think?