ANTLR v3 Overview (for ANTLR v2 users)

Slides:



Advertisements
Similar presentations
CPSC 388 – Compiler Design and Construction
Advertisements

Semantic Analysis and Symbol Tables
Semantics Static semantics Dynamic semantics attribute grammars
The Role Of Template Engines in Code Generation Terence Parr University of San Francisco
1 Mooly Sagiv and Greta Yorsh School of Computer Science Tel-Aviv University Modern Compiler Design.
1 Compiler Construction Intermediate Code Generation.
Fall Semantics Juan Carlos Guzmán CS 3123 Programming Languages Concepts Southern Polytechnic State University.
CS 330 Programming Languages 10 / 16 / 2008 Instructor: Michael Eckmann.
CS 280 Data Structures Professor John Peterson. Lexer Project Questions? Must be in by Friday – solutions will be posted after class The next project.
Cs164 Prof. Bodik, Fall Symbol Tables and Static Checks Lecture 14.
CSC 8310 Programming Languages Meeting 2 September 2/3, 2014.
Attribute Grammars They extend context-free grammars to give parameters to non-terminals, have rules to combine attributes Attributes can have any type,
StringTemplate Terence Parr University of San Francisco
2.2 A Simple Syntax-Directed Translator Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis.
Course Revision Contents  Compilers  Compilers Vs Interpreters  Structure of Compiler  Compilation Phases  Compiler Construction Tools  A Simple.
CPSC 388 – Compiler Design and Construction Parsers – Context Free Grammars.
Getting Started with ANTLR Chapter 1. Domain Specific Languages DSLs are high-level languages designed for specific tasks DSLs include data formats, configuration.
1 Week 4 Questions / Concerns Comments about Lab1 What’s due: Lab1 check off this week (see schedule) Homework #3 due Wednesday (Define grammar for your.
1 Semantic Analysis Aaron Bloomfield CS 415 Fall 2005.
Interpretation Environments and Evaluation. CS 354 Spring Translation Stages Lexical analysis (scanning) Parsing –Recognizing –Building parse tree.
Chapter 2. Design of a Simple Compiler J. H. Wang Sep. 21, 2015.
CPS 506 Comparative Programming Languages Syntax Specification.
Chapter 3 Syntax, Errors, and Debugging Fundamentals of Java.
. n COMPILERS n n AND n n INTERPRETERS. -Compilers nA compiler is a program thatt reads a program written in one language - the source language- and translates.
Muhammad Idrees, Lecturer University of Lahore 1 Top-Down Parsing Top down parsing can be viewed as an attempt to find a leftmost derivation for an input.
Top-down Parsing. 2 Parsing Techniques Top-down parsers (LL(1), recursive descent) Start at the root of the parse tree and grow toward leaves Pick a production.
1 A Simple Syntax-Directed Translator CS308 Compiler Theory.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture Ahmed Ezzat.
LECTURE 10 Semantic Analysis. REVIEW So far, we’ve covered the following: Compilation methods: compilation vs. interpretation. The overall compilation.
Comp 411 Principles of Programming Languages Lecture 3 Parsing
Announcements/Reading
Chapter 3 – Describing Syntax
Semantic analysis Jakub Yaghob
Compiler Design (40-414) Main Text Book:
Introduction to Compiler Construction
A Simple Syntax-Directed Translator
Constructing Precedence Table
CS 3304 Comparative Languages
Tutorial On Lex & Yacc.
Introduction to Parsing
CS510 Compiler Lecture 4.
Introduction to Parsing (adapted from CS 164 at Berkeley)
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 5, 09/25/2003 Prof. Roy Levow.
Compiler Construction (CS-636)
Abstract Syntax Trees Lecture 14 Mon, Feb 28, 2005.
PROGRAMMING LANGUAGES
Ch. 4 – Semantic Analysis Errors can arise in syntax, static semantics, dynamic semantics Some PL features are impossible or infeasible to specify in grammar.
CMPE 152: Compiler Design April 5 Class Meeting
Compiler Design 22. ANTLR AST Traversal (AST as Input, AST Grammars)
Basic Program Analysis: AST
CS 3304 Comparative Languages
Syntax Analysis Sections :.
Topics Introduction to File Input and Output
CSE 3302 Programming Languages
Lecture 15 (Notes by P. N. Hilfinger and R. Bodik)
CSE401 Introduction to Compiler Construction
Lecture 7: Introduction to Parsing (Syntax Analysis)
R.Rajkumar Asst.Professor CSE
System Programming and administration
Lab 2 HRP223 – 2010 October 18, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected.
BNF 9-Apr-19.
The Recursive Descent Algorithm
Intermediate Code Generation
COMPILERS Semantic Analysis
Subject:Object oriented programming
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Predictive Parsing Program
Topics Introduction to File Input and Output
CMPE 152: Compiler Design March 28 Class Meeting
COMPILER CONSTRUCTION
Presentation transcript:

ANTLR v3 Overview (for ANTLR v2 users) Terence Parr University of San Francisco

Topics Information flow v3 grammars Error recovery Attributes Tree construction Tree grammars Code generation Internationalization Runtime support

Block Info Flow Diagram

Grammar Syntax Trees ^(root child1 … childN) Note: No inheritance header {…} /** doc comment */ kind grammar name; options {…} tokens {…} scopes… action rules… /** doc comment */ rule[String s, int z] returns [int x, int y] throws E options {…} scopes init {…} :  |  ; exceptions Trees ^(root child1 … childN) Note: No inheritance

Grammar improvements Single element EBNF like ID* Combined parser/lexer Allows ‘c’ and “literal” literals Multiple parameters, return values Labels do not have to be unique (x=ID|x=INT) {…$x…} For combined grammars, warns when tokens are not defined

Example Grammar grammar SimpleParser; program : variable* method+ ; variable: "int" ID (‘=‘ expr)? ';’ ; method : "method" ID '(' ')' '{' variable* statement+ '}' ; statement : ID ‘=‘ expr ';' | "return" expr ';' expr : ID | INT ; ID : ('a'..'z'|'A'..'Z')+ ; INT : '0'..'9'+ ; WS : (' '|'\t'|'\n')+ {channel=99;}

Using the parser CharStream in = new ANTLRFileStream(“inputfile”); SimpleParserLexer lexer = new SimpleParserLexer(in); CommonTokenStream tokens = new CommonTokenStream(lexer); SimpleParser p = new SimpleParser(tokens); p.program(); // invoke start rule

Improved grammar warnings they happen less often ;) internationalized (templates again!) gives (smallest) sample input sequence better recursion warnings

Recursion Warnings a : a A | B ; t.g:2:5: Alternative 1 discovers infinite left-recursion to a from a // with -Im 0 (secret internal parameter) a : b | B ; b : c ; c : B b ; t.g:2:5: Alternative 1: after matching input such as B decision cannot predict what comes next due to recursion overflow to c from b

Nondeterminisms a : (A B|A B) C ; a : (A+ B|A+ B) C ; t.g:2:5: Decision can match input such as "A B" using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input t.g:2:5: The following alternatives are unreachable: 2 a : (A+ B|A+ B) C ; t.g:2:5: Decision can match input such as "A B" using multiple alternatives: 1, 2

Runtime Objects of Interest Lexer passes all tokens to the parser, but parser listens to only a single “channel”; channel 99, for example, where I place WS tokens, is ignored Tokens have start/stop index into single text input buffer Token is an abstract class TokenSource anything answering nextToken() TokenStream stream pulling from TokenSource; LT(i), … CharStream source of characters for a lexer; LT(i), …

Error Recovery ANTLR v3 does what Josef Grosch does in Cocktail Does single token insertion or deletion if necessary to keep going Computes context-sensitive FOLLOW to do insert/delete proper context is passed to each rule invocation knows precisely what can follow reference to r rather than what could follow any reference to r (per Wirth circa 1970)

Example Error Recovery int i = 0; method foo( { int j = i; i = 4 } [program, method]: line 2:12 mismatched token: [@14,23:23='{',<14>,2:12]; expecting type ')' [program, method, statement]: line 5:0 mismatched token: [@31,46:46='}',<15>,5:0]; expecting type ';' One token insertion int i = 0; method foo() ) { int j = i; i = = 4; } [program, method]: line 2:13 mismatched token: [@15,24:24=')',<13>,2:13]; expecting type '{' [program, method, statement, expr]: line 4:6 mismatched token: [@32,47:47='=',<6>,4:6]; expecting set null One token deletion Note: I put in two errors each so you’ll see it continues properly

Attributes New label syntax and multiple return values Unified token, rule, parameter, return value, tree reference syntax in actions Dynamically scope attributes! a[String s] returns [float y] : id=ID f=field (ids+=ID)+ {$s, $y, $id, $id.text, $f.z; $ids.size();} ; field returns [int x, int z] : … ;

Label properties Token label reference properties text, type, line, pos, channel, index, tree Rule label reference properties start, stop; indices of token boundaries tree text; text matched for whole rule

Rule Scope Attributes A rule may define a scope of attributes visible to any invoked rule; operates like a stacked global variable Avoids having to pass a value down method scope { String name; } : "method" ID '(' ')' {$name=$ID.text;} body ; body: '{' stat* '}’ ; … atom init {… $method.name …} : ID | INT ;

Global Scope Attributes Named scopes; rules must explicitly request access scope Symbols { List names; } {int level=0;} globals scope Symbols; init { level++; $Symbols.names = new ArrayList(); } : decl* {level--;} ; block scope Symbols; init { level++; $Symbols.names = new ArrayList(); } : '{' decl* stat* '}’ {level--;} ; decl : "int" ID ';' {$Symbols.names.add($ID);} *What if we want to keep the symbol tables around after parsing?

Tree Support TreeAdaptor; How to create and navigate trees (like ASTFactory from v2); ANTLR assumes tree nodes are Object type Tree; used by support code BaseTree; List of children, w/o payload (no more child-sibling trees) CommonTree; node wrapping Token as payload ParseTree; used by interpreter to build trees

Tree Construction Automatic mechanism is same as v2 except ^ is now ^^ expr : atom ( '+'^^ atom )* ; ^ implies root of tree for enclosing subrule a : ( ID^ INT )* ; builds (a 1) (b 2) … Token labels are $label not #label and rule invocation tree results are $ruleLabel.tree Turn on options {output=AST;} (one can imagine output=text for templates) Option: ASTLabelType=CommonTree;

Tree Rewrite Rules Maps an input grammar fragment to an output tree grammar fragment variable : type declarator ';' -> ^(VAR_DEF type declarator) ; functionHeader : type ID '(' ( formalParameter ( ',' formalParameter )* )? ')' -> ^(FUNC_HDR type ID formalParameter+) atom : … | '(' expr ')' -> expr

Mixed Rewrite/Auto Trees Alternatives w/o -> rewrite use automatic mechanism b : ID INT -> INT ID | INT // implies -> INT ;

Rewrites and labels Disambiguates element references or used to construct imaginary nodes Concatenation += labels useful too: forStat : "for" '(' start=assignStat ';' expr ';' next=assignStat ')' block -> ^("for" $start expr $next block) ; block : lc='{' variable* stat* '}’ -> ^(BLOCK[$lc] variable* stat*) /** match string representation of tree and build tree in memory */ tree : ‘^’ ‘(‘ root=atom (children+=tree)+ ‘)’ -> ^($root $children) | atom ;

Loops in Rewrites Repeated element ID ID -> ^(VARS ID+) yields ^(VARS a b) Repeated tree ID ID -> ^(VARS ID)+ yields ^(VARS a) ^(VARS b) Multiple elements in loop need same size ID INT ID INT -> ^( R ID ^( S INT) )+ yields (R a (S 1)) (R b (S 2)) Checks cardinality + and * loops

Preventing cyclic structures Repeated elements get duplicated a : INT -> INT INT ; // dups INT! a : INT INT -> INT+ INT+ ; // 4 INTs! Repeated rule references get duplicated a : atom -> ^(atom atom) ; // no cycle! Duplicates whole tree for all but first ref to an element; here 2nd ref to atom results in a duplicated atom tree *Useful example “int x,y” -> “^(int x) ^(int y)” decl : type ID (‘,’ ID)* -> ^(type ID)+ ; *Just noticed a bug in this one ;)

Predicated rewrites Use semantic predicate to indicate which rewrite to choose from a : ID INT -> {p1}? ID -> {p2}? INT -> ;

Misc Rewrite Elements Arbitrary actions a : atom -> ^({adaptor.createToken(INT,"9")} atom) ; rewrite always sets the rule’s AST not subrule’s Reference to previous value (useful?) b : "int" ( ID -> ^(TYPE "int" ID) | ID '=' INT -> ^(TYPE "int" ID INT) ) ; a : (atom -> atom) (op='+' r=atom -> ^($op $a $r) )* ;

Tree Grammars Syntax same as parser grammars, add ^(root children…) tree element Uses LL(*) also; even derives from same superclass! Tree is serialized to include DOWN, UP imaginary tokens to encode 2D structure for serial parser variable : ^(VAR_DEF type ID) | ^(VAR_DEF type ID ^(INIT expr)) ;

Code Generation Uses StringTemplate to specify how each abstract ANTLR concept maps to code; wildly successful! Separates code gen logic from output; not a single character of output in the Java code Java.stg: 140 templates, 1300 lines

Sample code gen templates /** Dump the elements one per line and stick in debugging * location() trigger in front. */ element() ::= << <if(debug)> dbg.location(<it.line>,<it.pos>);<\n> <endif> <it.el><\n> >> /** match a token optionally with a label in front */ tokenRef(token,label,elementIndex) ::= << <if(label)> <label>=input.LT(1);<\n> match(input,<token>,FOLLOW_<token>_in_<ruleName><elementIndex>);

Internationalization ANTLR v3 uses StringTemplate to display all errors Senses locale to load messages; en.stg: 76 templates ErrorManager error number constants map to a template name; e.g., RULE_REDEFINITION(file,line,col,arg) ::= "<loc()>rule <arg> redefinition” /* This factors out file location formatting; file,line,col inherited from * enclosing template; don't manually pass stuff in. */ loc() ::= "<file>:<line>:<col>: "

Runtime Support Better organized, separated: org.antlr.runtime org.antlr.runtime.tree org.antlr.runtime.debug Clean; Parser has input ptr only (except error recovery FOLLOW stack); Lexer also only has input ptr 4500 lines of Java code minus BSD header

Summary v3 kicks ass it sort of works! http://www.antlr.org/download/… ANTLRWorks progressing in parallel