Writing Parsers with Ruby

Writing Parsers with Ruby

Lexer (Lexical Analyzer)
Breaks up an input into a series of lexical tokens. Each token has a symbol (:IDENTIFIER) and a value (‘x’). Represent each token by a two element array: [symbol, value] [:IDENTIFIER,‘x’], [‘+’,‘+’], [:INTEGER,‘10’] Typically matched using Regular Expressions token = case input when /\A[a-zA-Z_]\w*/ [:IDENTIFIER, $&] when /\A[0-9]+[ULul]*/ [:INTEGER, $&] ... end input = $’ Lexer must be able to shift. Return next token, move to next pos.

Parser (Grammatical Structure Analysis)
Start Symbol <=> Input Parse Tree Concrete Syntax Tree Input tokens are the Leaves Start Symbol is the Root Built using the grammar rules expr ===========|========== expr expr | | | [:INT,15],[‘+’,’+’],[:INT,2] Context Free Grammar The parse tree can be constructed from the rules of the grammar alone. Parser Generators $ racc my_parser.y -o my_parser.rb How do Parsers and Parser Generators work? I am going to gloss over many of the details. Wikipedia.org is your friend. Lots of good articles.

Parsing Rules Parse Tree (not AST) expr : expr '+' expr
| '-' INT = UMINUS | INT ; results token symbol == rule name Typically shifted tokens already have type and value. Reduces to a single expr (this is the start symbol).

Defining Rules Backus-Naur Form (BNF)
Most common (understood by YACC, Bison, RACC). The rules on the previous slide. argument_list: expression | argument_list ‘,’ expression ; Extended Backus-Naur Form (EBNF) Newer tools use it (Coco/R, Spirit, many others). Allows the use of operators: ?,*,+,(,),[,],{,},<,> () -> group tokens, * -> zero or more argument_list = expression ( ‘,’ expression )* ; {} -> zero or more of group argument_list ::= expression { ‘,’ expression } Some EBNF parsers combine the lexer and parser :-/ BNF <=> EBNF (EBNF adds some syntactic sugar)

Top-Down or Bottom-Up? Top-Down Parsing (aka Top-Bottom Parsing)
LL (Left-to-right, Left most derivation) Expand the start symbol from the left to get the input. Recursive Descent Parser most common. LL Parsers are becoming popular (again). Bottom-Up Parsing (aka Shift/Reduce Parsing) LR (Left-to-right, Right most derivation) Reduce the input from the right to get back to the start symbol. LALR Parser (Look-Ahead LR) most common. Most common, many generators/grammars available. YACC, Bison, and RACC all generate LALR Parsers. Look ahead: LR(1), LL(1), ...

The Ruby Tools (http://i.loveruby.net/en/prog/)
Coco/R and CocoRb New (11/18/04), docs? ( StringScanner For lexing, iterates over a String. Much faster than lexing with $’, $`, and $&. Included with Ruby 1.8+. RACC (Ruby yACC) A LALR Parser Generator. Pretty fast, allows Strings as Token Symbols. Runtime included with Ruby 1.8+. Parser generator (racc) not included with Ruby.

C’s if Statement Part of a RACC grammar for parsing C’s if statement:
: ‘{‘ statement_list ‘}’ | if_statement | expression ‘;’ ; if_statement : ‘if’ ‘(‘ expression ‘)’ statement | ‘if’ ‘(‘ expression ‘)’ statement ‘else’ statement 1 shift/reduce conflict...

The Infamous Dangling else
if (batman) if (robin) pow(); else zap(); Look ahead issues: The parser reduces pow(); to a statement. Looks ahead and sees else. Now must choose: reduce if (robin) pow(); to a statement shift getting if (robin) pow(); else What to do?

...The Infamous Dangling else
A shift/reduce conflict. Be greedy. Can’t be correct in all cases, but this is the simplest way. The parser looks ahead 1 token, but not behind... No way to know that ’if (robin) pow();’ is nested. LALR Look ahead 1 token before reducing (if necessary) Reducing from the right -> LR ‘if’ ‘(‘ <expression> ‘)’ ‘if’ ‘(‘ <expression> ‘)’ <statement> ‘else’ <statement> 2. ‘if’ ‘(‘ <expression> ‘)’ <statement> 3. <statement>

Specifying Precedence
With the Rules cast_expression : unary_expression | '(' type_name ')' cast_expression ; multiplicative_expression : cast_expression | multiplicative_expression '*' cast_expression | multiplicative_expression '/' cast_expression | multiplicative_expression '%' cast_expression additive_expression : multiplicative_expression | additive_expression '+' multiplicative_expression | additive_expression '-' multiplicative_expression Other Facilities prechigh right '!' '~' right 'sizeof' left '*' '/' '%' left '+' '-' left '<' '<=' '>' '>=' left '==' '!=' left '&' '^' '|' left '&&' '||' nonassoc POINTER preclow talk about building rules, rule recursion

Recursive Rules Handling a C cast Building lists Nested blocks
cast_expression : unary_expression | '(' type_name ')' cast_expression ; Building lists argument_list : expression | argument_list ‘,’ expression Nested blocks statement_list : statement | statement_list statement ; statement : '{' statement_list '}' | if_statement | expression ';'

skeleton.y class MyParser # could be MyModule::MyParser prechigh
nonassoc UMINUS left '*' '/' preclow # token symbols created by the lexer token IDENTIFIER INTEGER STRING CHARACTER rule expect 1 # number of expected shift/reduce conflict # bogus rule target : /* blank */ { result = nil } ; end ---- header ---- # stuff that will come before the definition of MyParser ---- inner ---- # inside the class definition of MyParser ---- footer ---- # stuff that will come after the definition of MyParser

RACC Actions Constructing the parse tree
Define your own inside curly braces: expr : expr ’*’ expr { result = val[0] * val[2] } Constructing the parse tree result : value left hand side is reduced to value of parent in parse tree token symbol is the rule name val : array of left hand side values children of result _values : array of right hand side values not reduced by current action, see next slide The default action is { result = val[0] }

Building the Parse Tree
LHS = [:INT, 5], [‘*’, ’*’], [:INT, 3] = val RHS = /* nothing */ = _values expr = result ============|=========== expr expr | | | [:INT, 5], [‘*’, ’*’], [:INT, 3], [‘+’, ‘+’], [:INT, 2] LHS = [:INT, 5], [‘*’, ’*’], [:INT, 3] RHS = [:INT, 2], [‘+’, ‘+’] expr expr expr | | | [:INT, 2], [‘+’, ‘+’], [:INT, 5], [‘*’, ’*’], [:INT, 3] If the action is result = val[0]*val[2], then the LHS reduces to [expr, 15]

RACC API Generated parser is a subclass of Racc::Parser
Can’t subclass other classes :( mix-ins OK Entry points to the parser do_parse() next_token() called on shift yyparse(receiver, :method) For each shift: receiver.method { |token| ... } yyparse([[:INTEGER,’2’],[‘+’,’+’],[:INTEGER,’3’]], :each) on_error(err_token_id, err_value, value_stack) Called when a parse error occurs (optional override) Use token_to_str(err_token_id) to get the name String. Parser exits automatically at end of input (if no errors). [false, false] token signals end of input, symbol must be false. Returns the value of the start symbol / root of the parse tree. yyaccept() Exit parser. Returns val[0], NOT result. Error raised if using yyparse() and more tokens in receiver.

Final Tips Regexp look ahead /[0-9]+(?=[^\.])/
shouldn’t have to use much, if at all Put most common matches first (if you can) Need immutable tokens? (#freeze) Streaming? Limit object creation / destruction Use StringScanner Reuse tokens (symbols are nice) Use a hash for common tokens / reserved words if, do, while, +, -, *, /, == h.fetch(token_value) { |v| [:IDENTIFIER, v] } Single pass / multiple passes show hashes in ctokenizer.rb using hashes to identify reserved words talk about StringScanner API

Questions?

Writing Parsers with Ruby

Similar presentations

Presentation on theme: "Writing Parsers with Ruby"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Writing Parsers with Ruby

Similar presentations

Presentation on theme: "Writing Parsers with Ruby"— Presentation transcript:

Similar presentations

About project

Feedback