Top-Down Parsing using Regular Expressions A seminar by Brian Westphal.

Slides:



Advertisements
Similar presentations
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Advertisements

Chapter 10 Introduction to Arrays
Games and Simulations O-O Programming in Java The Walker School
CPSC Compiler Tutorial 9 Review of Compiler.
Chapter 10.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
CS 330 Programming Languages 09 / 13 / 2007 Instructor: Michael Eckmann.
Introduction to Computers and Programming Lecture 15: Arrays Professor: Evan Korth New York University.
Context-Free Grammars Lecture 7
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
1 Data types, operations, and expressions Overview l Format of a Java Application l Primitive Data Types l Variable Declaration l Arithmetic Operations.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
28-Jun-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Adrian Ilie COMP 14 Introduction to Programming Adrian Ilie June 27, 2005.
Bash Shell Scripting 10 Second Guide Common environment variables PATH - Sets the search path for any executable command. Similar to the PATH variable.
CSC 8310 Programming Languages Meeting 2 September 2/3, 2014.
© The McGraw-Hill Companies, 2006 Chapter 1 The first step.
2.2 A Simple Syntax-Directed Translator Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis.
1 Lab Session-III CSIT-120 Fall 2000 Revising Previous session Data input and output While loop Exercise Limits and Bounds Session III-B (starts on slide.
CPSC 388 – Compiler Design and Construction Parsers – Context Free Grammars.
Programming Logic and Design Sixth Edition Chapter 2 Working with Data, Creating Modules, and Designing High-Quality Programs.
Recursive Descent Parsing for XML Developers Roger L. Costello 15 October
CHAPTER 05 Compiled by: Dr. Mohammad Omar Alhawarat Stacks & Queues.
Parsing arithmetic expressions Reading material: These notes and an implementation (see course web page). The best way to prepare [to be a programmer]
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Cross-Platform Low-Level Language CPL 3 - Language Overview Brian Westphal.
Invitation to Computer Science, Java Version, Second Edition.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
Lexical Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University
CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.
Compiler Construction Lexical Analysis. The word lexical means textual or verbal or literal. The lexical analysis implemented in the “SCANNER” module.
CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Tevfik Bultan Lecture 12: Pointers continued, C strings.
PART I: overview material
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
Arrays An array is a data structure that consists of an ordered collection of similar items (where “similar items” means items of the same type.) An array.
D. M. Akbar Hussain: Department of Software & Media Technology 1 Compiler is tool: which translate notations from one system to another, usually from source.
Interpretation Environments and Evaluation. CS 354 Spring Translation Stages Lexical analysis (scanning) Parsing –Recognizing –Building parse tree.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Lexical and Syntax Analysis
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
CPS 506 Comparative Programming Languages Syntax Specification.
Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters.
Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
 2007 Pearson Education, Inc. All rights reserved C Arrays.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
Chapter 3 Context-Free Grammars Dr. Frank Lee. 3.1 CFG Definition The next phase of compilation after lexical analysis is syntax analysis. This phase.
Chapter 5 Linked List by Before you learn Linked List 3 rd level of Data Structures Intermediate Level of Understanding for C++ Please.
CPSC 388 – Compiler Design and Construction Parsers – Syntax Directed Translation.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
CSE 311 Foundations of Computing I Lecture 19 Recursive Definitions: Context-Free Grammars and Languages Autumn 2012 CSE
CMSC 104, Section 301, Fall Lecture 18, 11/11/02 Functions, Part 1 of 3 Topics Using Predefined Functions Programmer-Defined Functions Using Input.
OPERATORS IN C CHAPTER 3. Expressions can be built up from literals, variables and operators. The operators define how the variables and literals in the.
LINKED LISTS.
BNF A CFL Metalanguage Some Variations Particular View to SLK Copyright © 2015 – Curt Hill.
Copyright © 2014 Pearson Addison-Wesley. All rights reserved. 4 Simple Flow of Control.
C++ Memory Management – Homework Exercises
Information and Computer Sciences University of Hawaii, Manoa
CS170 – Week 1 Lecture 3: Foundation Ismail abumuhfouz.
System Software Unit-1 (Language Processors) A TOY Compiler
A Simple Syntax-Directed Translator
CS510 Compiler Lecture 4.
Miscellaneous Items Loop control, block labels, unless/until, backwards syntax for “if” statements, split, join, substring, length, logical operators,
CSE 311: Foundations of Computing
R.Rajkumar Asst.Professor CSE
High-Level Programming Language
Chapter 10: Compilers and Language Translation
Presentation transcript:

Top-Down Parsing using Regular Expressions A seminar by Brian Westphal

Questions and comments Please feel free to interrupt at any time throughout this presentation to make comments and ask questions. However, if there is a question that you feel might take more than a minute or two to explain, please wait until I specifically ask for questions.

Part I General Discussion

What is a top-down parser? Starts at the most general case (I.e. source code for a computer program). Tries to reach more specific cases (I.e. variable, loop, init statement, etc.) until a string is broken down into its smallest elements. End result is a tree of parts, each part described by a rule.

An example of tree in English This tree is used to parse a Sentence which is made up of an ordered set including Article, Noun, Verb, …. It starts with the most general - Sentence, and gets more specific.

English::= Sentence* Sentence::= (Article S)? Noun S Verb S (Preposition S)? (Article S)? Noun '/.' Article::= 'a' | 'an' | 'the' Noun::= 'house' | 'car' | 'person' Verb::= 'plays' | 'sits' | 'goes' Preposition::= 'on' | 'over' | 'above' S::= ' ' The grammar (or rules) in BNF Bachus-Naur Form

Regular Expressions v Grammars Regular expressions compose the most specific elements in a grammar (I.e. ‘a’, ‘car’, etc). BNF notations allows rules to be combined in a regular expression-like manner. The grammar holds a collection of rules which eventually terminate in regular expressions.

The process of parsing, by example We want to parse the following string with our example grammar: “A person sits on the car. A person sits on the house.” (ignore capitalization for now) Start with most basic rule, English (not denoted by graph)

Try to match the string to rule English. 1. English is equal to Sentence* In a loop (until all of string is exhausted or failure is reached) match each Sentence 2. Sentence is equal to (Article S)? Noun S … Match each sub-rule where “(Article S)?” is a single sub-rule, Noun is a single sub-rule, etc. 3. Article is equal to ‘a’|’an’|’the’

4. Try to match string with ‘a’, ‘an’, and ‘the’ –A match is found with ‘a’ (the longest of the acceptable matches). 5. Try to match remaining string (chop off ‘a’) with rule S. 6. S is equal to ‘ ‘ (a single space). A match is found with ‘ ‘. 7. Try to match remaining string with rule Noun. 8. Continue pattern.

What happens if a rule does not match? It depends on the situation. If a rule is followed by a * or a ? The rule will never fail by the fact that it is not required. Otherwise, there is an error in the data. Report an error message or try to fix it.

Failure/Success Examples “sits on the car.” failure (noun, space) “person sits on the car.”success “person sits on the car”failure (period) “a car goes over person.” success

Part II Goals and Requirements

List of goals Parse any EBNF grammar compatible data such as source code. Use and extend our existing regular expression classes. Output a syntax tree describing the full structure of the parsed data. For the end result we also wish to develop a tool that builds parsers for us (based on specified EBNF grammar files).

Goal Requirements A class to process EBNF rules (also called productions). A class to process more basic rules. A class to process regular expressions as rules. A class to act as the tree structure.

A BNF grammar describing EBNF grammars Show EBNF.grammar

In other words An EBNF grammar (Extended Bachus Naur Form) supports a subtraction operation on top of regular BNF functionality. Subtraction (A-B) implies that for the same section of data the system matches A but does not match B. Clearly this is not a typical regular expression-like operation.

Part III Basic Class Descriptions

RETree The structure for the syntax tree. Each node in an RETree is either an RETree or an RE (regular expression). Each node may also be associated with a type (I.e. English, Sentence, Noun, etc.)

REProcessor Handles regular expressions in binary or unary operations as parts of productions. Binary modes (operations) are: andnot, or, follow Unary modes are: follow, star, plus, maybe

SubProductionProcessor Extension of REProcessor. Handles REProcessors and references to other productions and subproductions. In this way the unary and binary operations can be used for productions in general. Assigns a type in the tree to matching values.

ProductionProcessor Extension of SubProductionProcessor. Contains a reference array to the other productions. Breaks down a single rule into subproductions by splitting rule in more manageable (unary or binary) pieces (a.k.a. subproductions).

Part IV Code Overview

Coding RETree Take advantage of Java’s excellent polymorphism. RETree has two properties: Object branches & String type branches can be of type String or LinkedList.

Useful function in RETree Collapse –Returns a single string of the section matched by the RETree (I.e. for English in our previous example, it would be the two sentences. For Sentence it would be a whole sentence, etc.) Size –Returns the length of the string (so that we can know how much of the input we have used).

Coding REProcessor Make some contants for the modes (operations). REProcessor has three properties: Object A, Object B, & byte mode A & B can be of type REProcessor or RE. Mode will be one of the contant values.

Useful functions in REProcessor beginningMatches –Takes a string of input and returns an RETree if the REProcessor matches the beginning of the input (returns the longest match if multiple matches exist). Returns null if no match is found. evaluate –Calls beginningMatches for either REProcessors or REs

More on beginningMatches Must perform actions for each mode. Calls evaluate on substrings of the input for A and/or B as many times as is necessary to find the longest match or failure. Mode follow has both a unary and binary operation. In unary mode it checks if A matches, in binary mode it checks if A matches then B matches the section immediately following A.

Coding SubProductionProcessor SubProductionProcessor has two properties: String type SubProductionProcessor [] subproduction Uses the super functions for beginningMatches and evaluate (with a few minor changes). Not much code to this class.

Changes to beginningMatches First calls the super classes beginningMatches function. If it is successful and no type has been assigned to the resulting tree, a type is added.

Changes to evaluate If the automata passed to evaluate is of type Integer, the function gets the subproduction associated with the value and calls beginningMatches for it. Otherwise, it calls the super classes evaluate function.

Coding ProductionProcessor ProductionProcessor has three properties: ProductionProcessor [] production LinkedList subproductionlist int numsubproductions The production array contains references to all other productions and the subproductions for the current production.

subproductionlist holds 5-tuples describing subproductions. These are only copied into actual subproductions during the InitializeProcessors function call. numsuproductions holds the number of subproductions currently in use (incremented duing BuildSubProductionProcessors function call).

A note about ProductionProcessor Pieces to be matched by ProductionProcessor are passed in chunks. This way, special characters are easy to process. For example, instead of passing the rule matched by “(hello)?” we would pass “(“, “hello”, “)”, “?”. To search for parentheses and the question mark in this way is much quicker.

Useful functions in ProductionProcessor InitializeProcessors –Adds intermediate subproduction 5-tuples to the production array so that subproductions and productions are regarded as inherently similar items. This takes great advantage of Java’s excellent polymorphism.

BuildSubProductionProcessors –Builds the intermedie 5-tuples representing the subproductions. This is the first step in a rather complex series of function calls which break down the production into unary and/or binary pieces. –It begins by splitting the rule on or operations (the pipe symbol). –Following order of operations, it then splits on the minus operation leaving groups without having to worry about or and minus operations. –No more detail will be provided about this sequence as it is too complicated for one class period.

Changes to beginningMatches Calls beginningMatches for the first subproduction (as they are linked together, this will eventually call beginningMatches for all subproductions). A type is also assigned if none is available when the matching RETree is returned.

Part V Building a Parser Generator

Parsing an EBNF grammar To build a parser generator we must first complete our parser by making a class to parse EBNF grammar data. We have already seen the BNF grammar for EBNF grammars.

rule::= symbol whitespace eq whitespace expression whitespace::= '[\t ]*' symbol::= '[a-zA-Z0-9]+' re::= hexCharacter | characterList | standardRE //re supporting rules: hexCharacter::= '/#x[0-9A-F]+' characterList::= lbracket characterList1? rbracket characterList1::= characterList2 | characterList2 characterList1 characterList2::= '[^/]//]' | escapeSequence standardRE::= singleQuote standardRE1? singleQuote standardRE1::= standardRE2 | standardRE2 standardRE1 standardRE2::= '[^/'//]' | escapeSequence escapeSequence::= '//.' lbracket::= '/[' rbracket::= '/]' singleQuote::= '/'' // eq::= '::=' expression::= group (whitespace or whitespace expression)? (whitespace or whitespace epsilon)? group::= sequence (whitespace minus whitespace group)? sequence::= sequenceLHS modifier? whitespace sequence? sequenceLHS::= symbol | re | lparen whitespace expression whitespace rparen modifier::= '[*+?]' lparen::= '/(' rparen::= '/)' minus::= '/-' or::= '/|' epsilon::= '_'

The Tokens //Production constants public static int PRODUCTIONS = 0; public static final Integer RULE = new Integer (PRODUCTIONS++); public static final Integer WHITESPACE = new Integer (PRODUCTIONS++); public static final Integer SYMBOL = new Integer (PRODUCTIONS++); public static final Integer RE = new Integer (PRODUCTIONS++); public static final Integer HEXCHARACTER = new Integer (PRODUCTIONS++); public static final Integer CHARACTERLIST = new Integer (PRODUCTIONS++); … public static ProductionProcessor [] production = new ProductionProcessor[PRODUCTIONS];

The Productions Object [] parameters; parameters = new Object[5]; parameters[0] = SYMBOL; parameters[1] = WHITESPACE; parameters[2] = EQ; parameters[3] = WHITESPACE; parameters[4] = EXPRESSION; //rule ::= symbol whitespace eq whitespace expression production[RULE.intValue ()] = new ProductionProcessor (production, "rule", parameters); parameters = new Object[1]; parameters[0] = "[\t ]*"; //whitespace ::= '[\t ]*' production[WHITESPACE.intValue ()] = new ProductionProcessor (production, "whitespace", parameters); Continue with this process for each rule.

Make sure to call InitializeProcessors //Initializing production processors for (int index = 0; index < production.length; index++) { production[index].InitializeProcessors (); }

Continuing On… That is pretty much it for the foundations of the parser generator. Actually writing the parser generator to do what you want is up to you. I will show you my code (or you can download it from but it is too difficult to present line by line.

The End Thank you for coming. Please feel free to ask questions and/or make comments at this time. Be sure to visit to download a copy of the discussed source code.