Design Patterns for Recursive Descent Parsing Dung Nguyen, Mathias Ricken & Stephen Wong Rice University
RDP in CS2? Context: objects-first intro curriculum which already covers Polymorphism Recursion Design patterns (visitors, factories, etc) OOD principles Want good OOP/D example Want a relevant CS topic Recursive Descent Parsing: Smooth transitions from simple to complex examples, developing abstract model ∆ change in grammar ∆ change in code
The Problem of Teaching RDP Mutual Recursion! Parser generator ? ? “A complex, isolated, advanced topic for upper division only” Global Analysis ? ? New Grammar High level topic only Complex Non-modular Difficult to extract an overall abstraction Scaling to generators problematic Less useful from a pedagogical standpoint. Difficult example to learn recursion with Path from grammar to parser easy for computer but hard for humans New Code
Object-Oriented Approach Grammar must drive any processing related to it, e.g. parsing. Model the grammar first: Terminal symbols (tokens) Non-Terminal symbols (incl. start symbol) Rules Driving forces Decouple intelligent tokens from rules visitors to tokens Extensible system: open ended number of tokens extended visitors Then Parsing will come! Intelligent tokens vs. switching on dumb tokens Rules as visitors vs. interpreter pattern on tokens Localized decisions Express the overall abstraction Pedagogical aspects Tangibility of objects makes understanding the recursion easier Easier to see how the grammar creates the parser Fits in with OO curriculum—no new concepts to master
Representing Tokens Intelligent Tokens No type checking! Decoupled from processing Visitor pattern For LL(1) grammars, in any given situation, the token determines the parsing action taken Parsing is done by visitors to tokens
Processing Tokens with Visitors Standard Visitor Pattern: Visitor caseA caseB visits Token A calls visits Token B calls But we want to be able to add an unbounded number of tokens!
Processing Tokens with Visitors Visitor Pattern modified with Chain-of-Responsibility: VisitorA defaultCase Visitor caseA visits Token A caseB VisitorB caseA calls delegates to visits chain Token B calls visits VisitorB defaultCase calls caseB caseB Handles Any Types of Tokens!
Modeling an LL(1) Grammar F | F + E ¤ E1 F E1 num | id empty | Preparing the LL1 grammar Left-factorization Left-Factoring Make grammar predictively parsable
Modeling an LL(1) Grammar F E1 E1 empty | + E F E1a E1a num | id F num | id F1 F1 F2 F2 Preparing the LL1 grammar Associating with a unique symbol Sequences – separates from branches Wrappers of terminals In multiple rules (branches), replace sequences and tokens with unique non-terminal symbols Branches only contain non-terminals
Modeling an LL(1) Grammar Branches modeled by inheritance (“is-a”) A B | C Sequences modeled by composition (“has-a”) Local view only. For non-terminals Representing multiple rules with inheritance (union) F1 is a F and F2 is a F union Representing a sequence with composition E1a has a +, E1a has an E. composite with sequential processing S X Y
Object Model of Grammar E F E1 E1 empty | E1a E1a + E F F1 | F2 F1 num F2 id Move this before “Representing the Tokens”? Grammar Structure = Class Structure
Modeling an LL(1) Grammar No Predictive Parsing Table! Declarative, not procedural Model the grammar, not the parsing!
Detailed and Global Analysis Abstract and Local Analysis! Detailed and Global Analysis E F E1 To process E, we must have the ability to process F and E1, independent of how either F or E1 are processed! To process E, we must first know about F and E1… E1 empty | E1a E1a E1a + E But to process F, we must first know about F1 and F2… F F1 | F2 Since parsing is done with visitors to tokens, all we need to parse E are the visitors to parse F and E1. F1 F1 num but to process F1, we must first know about num! F2 id Interdependence between rules One rule needs functionality of another rule Circular relationship problem Delegation model Visitors to tokens determine the parsing that occurs due to the grammar rules. Replaces switch statements With visitors, don’t need to know either which token or what rules to follow can think in terms of abstract behaviors Look at abstract behavior to decouple Abstract behaviors abstract construction Abstract Factories create concrete instances of the abstract behaviors Solution using factories Branching factory Sequence factory But E doesn’t know what it takes to make the F and E1 parsing visitors… The processing of one rule requires deep knowledge of the whole grammar! We need abstract construction of the visitors… Or does it??... Abstract Factories Decouple Rules
Factory Model of Parser E F E1 E1 empty | E1a E1a + E F F1 | F2 F1 num F2 id Parser Structure = Factory Structure Grammar represented purely with composition
Extending the Grammar Adding new tokens and rules Highly localized impact on code No re-computing of prediction tables
E F E1 E1 empty | E1a E1a + E F F1 | F2 F1 num F2 id E S E1 E1 empty | E1a E1a + E S P | T P (E) T F T1 T1 empty | T1a T1a * S F F1 | F2 F1 num F2 id
We change your grammar in two minutes Parser Demo (If time permits) We change your grammar in two minutes while you wait! gram
Automatic Parser Generator No additional theory needed for generalization No fixed-points, FIRST and FOLLOWS sets Kooprey Parser generator: BNF Java kou·prey (noun): “a rare short-haired ox (Bos sauveli) of forests of Indochina […]” (Merriam-Webster Online) Extensions Skip generation of source, create parser at runtime
Conclusion Simple enough to introduce in CS2 course (@Rice – near end of CS2) Teaches an abstraction of grammars and parsing Reinforces foundational OO principles Abstract representations Abstract construction Decoupled systems Recursion http:///www.exciton.cs.rice.edu/research/sigcse05