Writing Parsers with Ruby

Slides:



Advertisements
Similar presentations
Parsing V: Bottom-up Parsing
Advertisements

Exercise: Balanced Parentheses
Chapter 2 Syntax. Syntax The syntax of a programming language specifies the structure of the language The lexical structure specifies how words can be.
Lexical and Syntactic Analysis Here, we look at two of the tasks involved in the compilation process –Given source code, we need to first break it into.
9/27/2006Prof. Hilfinger, Lecture 141 Syntax-Directed Translation Lecture 14 (adapted from slides by R. Bodik)
6/12/2015Prof. Hilfinger CS164 Lecture 111 Bottom-Up Parsing Lecture (From slides by G. Necula & R. Bodik)
Bottom-Up Syntax Analysis Mooly Sagiv Textbook:Modern Compiler Design Chapter (modified)
ISBN Chapter 4 Lexical and Syntax Analysis.
ISBN Chapter 4 Lexical and Syntax Analysis.
Fall 2007CS 2251 Miscellaneous Topics Deque Recursion and Grammars.
Parsing III (Eliminating left recursion, recursive descent parsing)
ISBN Chapter 4 Lexical and Syntax Analysis The Parsing Problem Recursive-Descent Parsing.
CS 330 Programming Languages 09 / 23 / 2008 Instructor: Michael Eckmann.
Bottom-Up Syntax Analysis Mooly Sagiv & Greta Yorsh Textbook:Modern Compiler Design Chapter (modified)
COS 320 Compilers David Walker. last time context free grammars (Appel 3.1) –terminals, non-terminals, rules –derivations & parse trees –ambiguous grammars.
1 Bottom-up parsing Goal of parser : build a derivation –top-down parser : build a derivation by working from the start symbol towards the input. builds.
ISBN Lecture 04 Lexical and Syntax Analysis.
Lexical and syntax analysis
(2.1) Grammars  Definitions  Grammars  Backus-Naur Form  Derivation – terminology – trees  Grammars and ambiguity  Simple example  Grammar hierarchies.
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
Parser construction tools: YACC
Chapter 2 Syntax A language that is simple to parse for the compiler is also simple to parse for the human programmer. N. Wirth.
Syntax Analysis – Part II Quick Look at Using Bison Top-Down Parsers EECS 483 – Lecture 5 University of Michigan Wednesday, September 20, 2006.
Parsing IV Bottom-up Parsing Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Syntax and Semantics Structure of programming languages.
LEX and YACC work as a team
Parsing. Goals of Parsing Check the input for syntactic accuracy Return appropriate error messages Recover if possible Produce, or at least traverse,
4 4 (c) parsing. Parsing A grammar describes the strings of tokens that are syntactically legal in a PL A recogniser simply accepts or rejects strings.
CS 330 Programming Languages 09 / 26 / 2006 Instructor: Michael Eckmann.
Lesson 10 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
1 Programming Languages (CS 550) Lecture 1 Summary Grammars and Parsing Jeremy R. Johnson.
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
PL&C Lab, DongGuk University Compiler Lecture Note, MiscellaneousPage 1 Miscellaneous 컴파일러 입문.
1 Compiler Construction Syntax Analysis Top-down parsing.
Lexical and Syntax Analysis
Review 1.Lexical Analysis 2.Syntax Analysis 3.Semantic Analysis 4.Code Generation 5.Code Optimization.
CS 153 A little bit about LR Parsing. Background We’ve seen three ways to write parsers:  By hand, typically recursive descent  Using parsing combinators.
Syntax and Semantics Structure of programming languages.
COP4020 Programming Languages Syntax Prof. Robert van Engelen (modified by Prof. Em. Chris Lacher)
ISBN Chapter 4 Lexical and Syntax Analysis.
CPS 506 Comparative Programming Languages Syntax Specification.
D Goforth COSC Translating High Level Languages.
Comp 311 Principles of Programming Languages Lecture 3 Parsing Corky Cartwright August 28, 2009.
1 Parsers and Grammar. 2 Categories of Grammar Rules  Declarations or definitions. AttributeDeclaration ::= [ final ] [ static ] [ access ] datatype.
1 Using Yacc. 2 Introduction Grammar –CFG –Recursive Rules Shift/Reduce Parsing –See Figure 3-2. –LALR(1) –What Yacc Cannot Parse It cannot deal with.
Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters.
YACC. Introduction What is YACC ? a tool for automatically generating a parser given a grammar written in a yacc specification (.y file) YACC (Yet Another.
Top-down Parsing lecture slides from C OMP 412 Rice University Houston, Texas, Fall 2001.
Top-Down Parsing CS 671 January 29, CS 671 – Spring Where Are We? Source code: if (b==0) a = “Hi”; Token Stream: if (b == 0) a = “Hi”; Abstract.
ISBN Chapter 4 Lexical and Syntax Analysis.
Top-down Parsing. 2 Parsing Techniques Top-down parsers (LL(1), recursive descent) Start at the root of the parse tree and grow toward leaves Pick a production.
1 Programming Languages (CS 550) Lecture 1 Summary Grammars and Parsing Jeremy R. Johnson.
PL&C Lab, DongGuk University Compiler Lecture Note, MiscellaneousPage 1 Yet Another Compiler-Compiler Stephen C. Johnson July 31, 1978 YACC.
CS 330 Programming Languages 09 / 25 / 2007 Instructor: Michael Eckmann.
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.3-1 Language Specification and Translation Lecture 8.
Parser: CFG, BNF Backus-Naur Form is notational variant of Context Free Grammar. Invented to specify syntax of ALGOL in late 1950’s Uses ::= to indicate.
More yacc. What is yacc – Tool to produce a parser given a grammar – YACC (Yet Another Compiler Compiler) is a program designed to compile a LALR(1) grammar.
Parsing III (Top-down parsing: recursive descent & LL(1) )
Bottom Up Parsing CS 671 January 31, CS 671 – Spring Where Are We? Finished Top-Down Parsing Starting Bottom-Up Parsing Lexical Analysis.
Compilers: Bottom-up/6 1 Compiler Structures Objective – –describe bottom-up (LR) parsing using shift- reduce and parse tables – –explain how LR.
Syntax(1). 2 Syntax  The syntax of a programming language is a precise description of all its grammatically correct programs.  Levels of syntax Lexical.
Programming Languages Translator
Lexical and Syntax Analysis
Syntax (1).
Chapter 4 Syntax Analysis.
CS 363 Comparative Programming Languages
Compiler Design 4. Language Grammars
Lexical and Syntax Analysis
Top-Down Parsing CS 671 January 29, 2008.
Bison Marcin Zubrowski.
Presentation transcript:

Writing Parsers with Ruby

Lexer (Lexical Analyzer) Breaks up an input into a series of lexical tokens. Each token has a symbol (:IDENTIFIER) and a value (‘x’). Represent each token by a two element array: [symbol, value] [:IDENTIFIER,‘x’], [‘+’,‘+’], [:INTEGER,‘10’] Typically matched using Regular Expressions token = case input when /\A[a-zA-Z_]\w*/ [:IDENTIFIER, $&] when /\A[0-9]+[ULul]*/ [:INTEGER, $&] ... end input = $’ Lexer must be able to shift. Return next token, move to next pos.

Parser (Grammatical Structure Analysis) Start Symbol <=> Input Parse Tree Concrete Syntax Tree Input tokens are the Leaves Start Symbol is the Root Built using the grammar rules expr ===========|========== expr + expr | | | [:INT,15],[‘+’,’+’],[:INT,2] Context Free Grammar The parse tree can be constructed from the rules of the grammar alone. Parser Generators $ racc my_parser.y -o my_parser.rb How do Parsers and Parser Generators work? I am going to gloss over many of the details. Wikipedia.org is your friend. Lots of good articles.

Parsing Rules Parse Tree (not AST) expr : expr '+' expr | '-' INT = UMINUS | INT ; results token symbol == rule name Typically shifted tokens already have type and value. Reduces to a single expr (this is the start symbol).

Defining Rules Backus-Naur Form (BNF) Most common (understood by YACC, Bison, RACC). The rules on the previous slide. argument_list: expression | argument_list ‘,’ expression ; Extended Backus-Naur Form (EBNF) Newer tools use it (Coco/R, Spirit, many others). Allows the use of operators: ?,*,+,(,),[,],{,},<,> () -> group tokens, * -> zero or more argument_list = expression ( ‘,’ expression )* ; {} -> zero or more of group argument_list ::= expression { ‘,’ expression } Some EBNF parsers combine the lexer and parser :-/ BNF <=> EBNF (EBNF adds some syntactic sugar)

Top-Down or Bottom-Up? Top-Down Parsing (aka Top-Bottom Parsing) LL (Left-to-right, Left most derivation) Expand the start symbol from the left to get the input. Recursive Descent Parser most common. LL Parsers are becoming popular (again). Bottom-Up Parsing (aka Shift/Reduce Parsing) LR (Left-to-right, Right most derivation) Reduce the input from the right to get back to the start symbol. LALR Parser (Look-Ahead LR) most common. Most common, many generators/grammars available. YACC, Bison, and RACC all generate LALR Parsers. Look ahead: LR(1), LL(1), ...

The Ruby Tools (http://i.loveruby.net/en/prog/) Coco/R and CocoRb http://rubyforge.org/projects/coco-ruby/ New (11/18/04), docs? (http://www.scifac.ru.ac.za/coco/). http://www3.sympatico.ca/mark.probert/work/ruby.html StringScanner For lexing, iterates over a String. Much faster than lexing with $’, $`, and $&. Included with Ruby 1.8+. RACC (Ruby yACC) A LALR Parser Generator. Pretty fast, allows Strings as Token Symbols. Runtime included with Ruby 1.8+. Parser generator (racc) not included with Ruby.

C’s if Statement Part of a RACC grammar for parsing C’s if statement: : ‘{‘ statement_list ‘}’ | if_statement | expression ‘;’ ; if_statement : ‘if’ ‘(‘ expression ‘)’ statement | ‘if’ ‘(‘ expression ‘)’ statement ‘else’ statement 1 shift/reduce conflict...

The Infamous Dangling else if (batman) if (robin) pow(); else zap(); Look ahead issues: The parser reduces pow(); to a statement. Looks ahead and sees else. Now must choose: reduce if (robin) pow(); to a statement shift getting if (robin) pow(); else What to do?

...The Infamous Dangling else A shift/reduce conflict. Be greedy. Can’t be correct in all cases, but this is the simplest way. The parser looks ahead 1 token, but not behind... No way to know that ’if (robin) pow();’ is nested. LALR Look ahead 1 token before reducing (if necessary) Reducing from the right -> LR ‘if’ ‘(‘ <expression> ‘)’ ‘if’ ‘(‘ <expression> ‘)’ <statement> ‘else’ <statement> 2. ‘if’ ‘(‘ <expression> ‘)’ <statement> 3. <statement>

Specifying Precedence With the Rules cast_expression : unary_expression | '(' type_name ')' cast_expression ; multiplicative_expression : cast_expression | multiplicative_expression '*' cast_expression | multiplicative_expression '/' cast_expression | multiplicative_expression '%' cast_expression additive_expression : multiplicative_expression | additive_expression '+' multiplicative_expression | additive_expression '-' multiplicative_expression Other Facilities prechigh right '!' '~' right 'sizeof' left '*' '/' '%' left '+' '-' left '<' '<=' '>' '>=' left '==' '!=' left '&' '^' '|' left '&&' '||' nonassoc POINTER preclow talk about building rules, rule recursion

Recursive Rules Handling a C cast Building lists Nested blocks cast_expression : unary_expression | '(' type_name ')' cast_expression ; Building lists argument_list : expression | argument_list ‘,’ expression Nested blocks statement_list : statement | statement_list statement ; statement : '{' statement_list '}' | if_statement | expression ';'

skeleton.y class MyParser # could be MyModule::MyParser prechigh nonassoc UMINUS left '*' '/' preclow # token symbols created by the lexer token IDENTIFIER INTEGER STRING CHARACTER rule expect 1 # number of expected shift/reduce conflict # bogus rule target : /* blank */ { result = nil } ; end ---- header ---- # stuff that will come before the definition of MyParser ---- inner ---- # inside the class definition of MyParser ---- footer ---- # stuff that will come after the definition of MyParser

RACC Actions Constructing the parse tree Define your own inside curly braces: expr : expr ’*’ expr { result = val[0] * val[2] } Constructing the parse tree result : value left hand side is reduced to value of parent in parse tree token symbol is the rule name val : array of left hand side values children of result _values : array of right hand side values not reduced by current action, see next slide The default action is { result = val[0] }

Building the Parse Tree LHS = [:INT, 5], [‘*’, ’*’], [:INT, 3] = val RHS = /* nothing */ = _values expr = result ============|=========== expr + expr | | | [:INT, 5], [‘*’, ’*’], [:INT, 3], [‘+’, ‘+’], [:INT, 2] ------------------------------------------------------------- LHS = [:INT, 5], [‘*’, ’*’], [:INT, 3] RHS = [:INT, 2], [‘+’, ‘+’] expr expr + expr | | | [:INT, 2], [‘+’, ‘+’], [:INT, 5], [‘*’, ’*’], [:INT, 3] If the action is result = val[0]*val[2], then the LHS reduces to [expr, 15]

RACC API Generated parser is a subclass of Racc::Parser Can’t subclass other classes :( mix-ins OK Entry points to the parser do_parse() next_token() called on shift yyparse(receiver, :method) For each shift: receiver.method { |token| ... } yyparse([[:INTEGER,’2’],[‘+’,’+’],[:INTEGER,’3’]], :each) on_error(err_token_id, err_value, value_stack) Called when a parse error occurs (optional override) Use token_to_str(err_token_id) to get the name String. Parser exits automatically at end of input (if no errors). [false, false] token signals end of input, symbol must be false. Returns the value of the start symbol / root of the parse tree. yyaccept() Exit parser. Returns val[0], NOT result. Error raised if using yyparse() and more tokens in receiver.

Final Tips Regexp look ahead /[0-9]+(?=[^\.])/ shouldn’t have to use much, if at all Put most common matches first (if you can) Need immutable tokens? (#freeze) Streaming? Limit object creation / destruction Use StringScanner Reuse tokens (symbols are nice) Use a hash for common tokens / reserved words if, do, while, +, -, *, /, == h.fetch(token_value) { |v| [:IDENTIFIER, v] } Single pass / multiple passes show hashes in ctokenizer.rb using hashes to identify reserved words talk about StringScanner API

Questions?