Recognizers 16-Jan-19.

Slides:



Advertisements
Similar presentations
15-Dec-14 BNF. Metalanguages A metalanguage is a language used to talk about a language (usually a different one) We can use English as its own metalanguage.
Advertisements

Compilers and Language Translation
Honors Compilers An Introduction to Grammars Feb 12th 2002.
16-Jun-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the.
17-Jun-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the.
Fall 2007CS 2251 Miscellaneous Topics Deque Recursion and Grammars.
Chapter 3 Describing Syntax and Semantics Sections 1-3.
26-Jun-15 Recursive descent parsing. The Stack One easy way to do recursive descent parsing is to have each parse method take the tokens it needs, build.
28-Jun-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the.
29-Jun-15 Recursion. 2 Definitions I A recursive definition is a definition in which the thing being defined occurs as part of its own definition Example:
30-Jun-15 BNF. Metalanguages A metalanguage is a language used to talk about a language (usually a different one) We can use English as its own metalanguage.
14-Jul-15 Parser Hints. The Stack To turn a “Recognizer” into a “Parser,” we need the use of a Stack All boolean Recognizer methods should continue to.
14-Jul-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the.
Bottom-up parsing Goal of parser : build a derivation
(2.1) Grammars  Definitions  Grammars  Backus-Naur Form  Derivation – terminology – trees  Grammars and ambiguity  Simple example  Grammar hierarchies.
Chapter 2 Syntax A language that is simple to parse for the compiler is also simple to parse for the human programmer. N. Wirth.
1 Syntax and Semantics The Purpose of Syntax Problem of Describing Syntax Formal Methods of Describing Syntax Derivations and Parse Trees Sebesta Chapter.
Chpater 3. Outline The definition of Syntax The Definition of Semantic Most Common Methods of Describing Syntax.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
CS 355 – PROGRAMMING LANGUAGES Dr. X. Topics Introduction The General Problem of Describing Syntax Formal Methods of Describing Syntax.
COMP Parsing 2 of 4 Lecture 22. How do we write programs to do this? The process of getting from the input string to the parse tree consists of.
PART I: overview material
Profs. Necula CS 164 Lecture Top-Down Parsing ICOM 4036 Lecture 5.
ISBN Chapter 3 Describing Syntax and Semantics.
Interpretation Environments and Evaluation. CS 354 Spring Translation Stages Lexical analysis (scanning) Parsing –Recognizing –Building parse tree.
1 Syntax In Text: Chapter 3. 2 Chapter 3: Syntax and Semantics Outline Syntax: Recognizer vs. generator BNF EBNF.
22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the.
D Goforth COSC Translating High Level Languages Note error in assignment 1: #4 - refer to Example grammar 3.4, p. 126.
ISBN Chapter 3 Describing Syntax and Semantics.
Top-down Parsing lecture slides from C OMP 412 Rice University Houston, Texas, Fall 2001.
5-Jan-16 Recursive descent parsing. Some notes on recursive descent The starter code that I gave you did not exactly fit the grammar that I gave you Both.
Top-down Parsing. 2 Parsing Techniques Top-down parsers (LL(1), recursive descent) Start at the root of the parse tree and grow toward leaves Pick a production.
The Interpreter Pattern (Behavioral) ©SoftMoore ConsultingSlide 1.
The single most important skill for a computer programmer is problem solving Problem solving means the ability to formulate problems, think creatively.
CS 330 Programming Languages 09 / 25 / 2007 Instructor: Michael Eckmann.
BNF A CFL Metalanguage Some Variations Particular View to SLK Copyright © 2015 – Curt Hill.
JavaScript: Conditionals contd.
Parsing 2 of 4: Scanner and Parsing
Grammars and Parsing.
CS510 Compiler Lecture 4.
Automata and Languages What do these have in common?
Recognizers 13-Sep-18.
Trees.
4 (c) parsing.
Recursion 12-Nov-18.
Recursive descent parsing
Top-Down Parsing CS 671 January 29, 2008.
Lesson 2: Building Blocks of Programming
Lecture 15 (Notes by P. N. Hilfinger and R. Bodik)
Recursion 2-Dec-18.
Recursion 2-Dec-18.
ADTs, Grammars, Parsing, Tree traversals
Subject Name:Sysytem Software Subject Code: 10SCS52
Recursive descent parsing
R.Rajkumar Asst.Professor CSE
Theory of Computation Languages.
CS 3304 Comparative Languages
Recursion 29-Dec-18.
Recognizers 1-Jan-19.
CS 3304 Comparative Languages
Recognizers 22-Feb-19.
Trees.
BNF 23-Feb-19.
BNF 9-Apr-19.
The Recursive Descent Algorithm
Recursive descent parsing
Recursion 23-Apr-19.
Chapter 10: Compilers and Language Translation
Lecture 3 More on Flow Control, More on Functions,
COMPILER CONSTRUCTION
Presentation transcript:

Recognizers 16-Jan-19

Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the language defined by the grammar A parser will try to build a tree corresponding to the string, according to the rules of the grammar Input string Recognizer result Parser result 2 + 3 * 4 true 2 + 3 * false Error

Building a recognizer One way of building a recognizer from a grammar is called recursive descent Recursive descent is pretty easy to implement, once you figure out the basic ideas Recursive descent is a great way to build a “quick and dirty” recognizer or parser Production-quality parsers use much more sophisticated and efficient techniques In the following slides, I’ll talk about how to do recursive descent, and give some examples in Java

Review of BNF and EBNF “Plain” BNF < > indicate a nonterminal that needs to be further expanded, for example, <variable> Symbols not enclosed in < > are terminals; they represent themselves, for example, if, while, ( The symbol ::= means is defined as The symbol | means or; it separates alternatives, for example, <addop> ::= + | - Extended BNF [ ] enclose an optional part of the rule Example: <if statement> ::= if ( <condition> ) <statement> [ else <statement> ] { } mean the enclosed can be repeated zero or more times Example: <parameter list> ::= ( ) | ( { <parameter> , } <parameter> )

Recognizing simple alternatives, I Consider the following BNF rule: <add_operator> ::= + | - That is, an add operator is a plus sign or a minus sign To recognize an add operator, we need to get the next token, and test whether it is one of these characters If it is a plus or a minus, we simply return true But what if it isn’t? We not only need to return false, but we also need to put the token back because it doesn’t belong to us, and some other grammar rule probably wants it Our tokenizer needs to be able to take back tokens Usually, it’s enough to be able to put just one token back More complex grammars may require the ability to put back several tokens

Recognizing simple alternatives, II Our rule is <add_operator> ::= + | - Our method for recognizing an <add_operator> (which we will simply call addOperator) looks like this: public boolean addOperator() { Get the next token, call it t If t is a “+”, return true If t is a “-”, return true If t is anything else, put the token back return false }

Java code public boolean addOperator() { Token t = myTokenizer.next(); if (t.type == Type.SYMBOL && t.value.equals("+")) { return true; } if (t.type == Type.SYMBOL && t.value.equals("-")) { return true; } myTokenizer.pushBack(); return false; } While this code isn’t particularly long or hard to read, we are going to have a lot of very similar methods

Helper methods Remember the DRY principle: Don’t Repeat Yourself If we turn each BNF production directly into Java, we will be writing a lot of very similar code We should write some auxiliary or “helper” methods to hide some of the details for us First helper method: private boolean symbol(String expectedSymbol) Gets the next token and tests whether it matches the expectedSymbol If it matches, returns true If it doesn’t match, puts the symbol back and returns false We’ll look more closely at this method in a moment

Recognizing simple alternatives, III Our rule is <add_operator> ::= + | - Our pseudocode is: public boolean addOperator() { Get the next token, call it t If t is a “+”, return true If t is a “-”, return true If t is anything else, put the token back return false } Thanks to our helper method, our actual Java code is: public boolean addOperator() { return symbol("+") || symbol("-"); }

First implementation of symbol Here’s what symbol does: Gets a token Makes sure that the token is a symbol Compares the symbol to the desired symbol (by value) If all the above is satisfied, returns true Else (if not satisfied) puts the token back, and returns false private boolean symbol(String value) { Token t = tokenizer.next(); if (t.type == Type.SYMBOL && value.equals(t.value())) { return true; } else { tokenizer.pushBack(); return false; } }

Implementing symbol We can implement methods name, number, and maybe eol the same way All this code will look pretty much alike The main difference is in checking for the type The DRY principle suggests we should use a helper method for symbol private boolean symbol(String expectedValue) { return nextTokenMatches(Type.SYMBOL, expectedValue); }

nextTokenMatches #1 The nextTokenMatches method should: Get a token Compare types and values Return true if the token is as expected Put the token back and return false if it doesn’t match private boolean nextTokenMatches(Type type, String value) { Token t = tokenizer.next(); if (type == t.type() && value.equals(t.value())) { return true; } else { tokenizer.pushBack(1); return false; } }

nextTokenMatches #2 The previous method is fine for symbols, but what if we only care about the type? For example, we want to get a number—any number We need to compare only type, not value private boolean nextTokenMatches(Type type, String value) { Token t = tokenizer.next(); omit this parameter if (type == t.type() && value.equals(t.getValue())) return true; else tokenizer.pushBack(1); omit this test return false; } It’s easier to overload nextTokenMatches than to combine the two versions, and both versions are fairly short, so we are probably better off with the code duplication

addOperator reprise public boolean addOperator() { return symbol("+") || symbol("-"); } private boolean symbol(String expectedValue) { return nextTokenMatches(Type.SYMBOL, expectedValue); } private boolean nextTokenMatches(Type type, String value) { Token t = tokenizer.next(); if (type == t.type() && value.equals(t.value())) return true; else tokenizer.pushBack(); return false; }

Sequences, I Suppose we want to recognize a grammar rule in which one thing follows another, for example, <empty_list> ::= “[” “]” (I put quotes around these brackets to distinguish them from the EBNF metasymbols for “optional”) Here’s some code we might try: public boolean emptyList() { return symbol("[") && symbol("]"); }

Sequences, I Here’s the grammar rule again: <empty_list> ::= “[” “]” The code for this would be fairly simple... public boolean emptyList() { return symbol("[") && symbol("]"); } ...except for one thing... What happens if we get [ 5 ]? We recognize and accept the [ We reject (and put back) the 5 We cannot also put back the [, because we can only put back one thing Putting back two things isn’t enough—what about [ 1, 2, 3] ?

Sequences, II The grammar rule is <empty_list> ::= “[” “]” And the token string contains [ 5 ] Solution #1: Write a pushBack method that push back more than one token at a time This will allow you to put the back both the “[” and the “5” You have to be very careful of the order in which you return tokens This is a good use for a Stack But you never know when to quit! Solution #2: Call it an error You might be able to get away with this, depending on the grammar For example, for any reasonable grammar, (2 + 3 +) is clearly an error Solution #3: Change the grammar Tricky, and may not be possible Solution #4: Combine rules See the next slide

Implementing a fancier pushBack() To push back more tokens than one, you need to either: Make your tokenizer keep track of the last several tokens (and have a pushBack(int n) method, or Expect the calling program to tell you what tokens to push back (with a pushBack(Token t) method) I’ve had you implement your own Tokenizer This was so you would understand state machines In practice, you would probably use Java’s built-in StreamTokenizer

Extending StreamTokenizer java.io.StreamTokenizer does almost everything you need in a tokenizer Its pushBack() method only “puts back” a single token If you need more than that, you have to extend StreamTokenizer To push back more tokens than one, you need to either: Make your extended tokenizer keep track of the last several tokens (and have a pushBack(int n) method, or Expect the calling program to tell you what tokens to push back (with a pushBack(Token t) method) Plus, you will have to override nextToken() Inside your nextToken() method, you can call super.nextToken() to get the next never-before-seen token Your nextToken() method will also have to do something about nval and sval, such as provide methods to get these values

Sequences, III Suppose the grammar really says <list> ::= “[” “]” | “[” <number> “]” Now your pseudocode should look something like this: public boolean list() { if first token is “[” { if second token is “]” return true else if second token is a number { if third token is “]” return true else error } else put back first token }

Sequences, IV Another possibility is to revise the grammar (but make sure the new grammar is equivalent to the old one!) Old grammar: <list> ::= “[” “]” | “[” <number> “]” New grammar: <list> ::= “[” <rest_of_list> <rest_of_list> ::= “]” | <number> “]” New pseudocode: public boolean list() { if first token is “[” { if restOfList() return true } else put back first token } private boolean restOfList() { if first token is “]”, return true if first token is a number and second token is a “]”, return true else return false }

Simple sequences in Java Suppose you have this rule: <factor> ::= ( <expression> ) A good way to do this is often to test whether the grammar rule is not met public boolean factor() { if (symbol("(")) { if (!expression()) { error("Error in parenthesized expression"); } if (!symbol(")")) { error("Unclosed parenthetical expression"); } return true; } return false; } To do this, you need to be careful that the “(” is not the start of some other production that can be used where a factor can be used In other words, be sure that if you get a “(” it must start a factor Also, error(String) must throw an Exception—why?

false vs. error When should a method return false, and when should it report an error? false means that this method did not recognize its input Report an error if you know that something has gone wrong In other words, you know that no other method will recognize the input, either public boolean ifStatement() { if you don’t see “if”, return false // could be some other kind of statement if you don’t see a condition, return an error // “if” is a keyword that must start an if statement If you see if, and it isn’t followed by a condition, there is nothing else that it could be This isn’t completely mechanical; you have to decide

Sequences and alternatives Here’s the real grammar rule for <factor>: <factor> ::= <name> | <number> | ( <expression> ) And here’s the actual code: public boolean factor() { if (name()) return true; if (number()) return true; if (symbol("(")) { if (!expression()) error("Error in parenthesized expression"); if (!symbol(")")) error("Unclosed parenthetical expression"); return true; } return false; }

Recursion, I Here’s an unfortunate (but legal!) grammar rule: <expression> ::= <term> | <expression> + <term> Here’s some code for it: public boolean expression() { if (term()) return true; if (!expression()) return false; if (!addOperator()) return false; if (!term()) error("Error in expression after '+' "); return true; } Do you see the problem?

Recursion, I Here’s the rule again: And the code: <expression> ::= <term> | <expression> + <term> And the code: public boolean expression() { if (term()) return true; if (!expression()) return false; if (!addOperator()) return false; if (!term()) error("Error in expression after '+' "); return true; } We aren’t recurring with a simpler case, therefore, we have an infinite recursion Our grammar rule is left recursive (the recursive part is the leftmost thing in the definition)

Recursion, II Here’s our unfortunate grammar rule again: <expression> ::= <term> | <expression> + <term> Here’s an equivalent, right recursive rule: <expression> ::= <term> [ + <expression> ] Here’s some (much happier!) code for it: public boolean expression() { if (!term()) return false; if (!addOperator()) return true; if (!expression()) error("Error in expression after '+'"); return true; } This works for the Recognizer, but will cause problems later We’ll cross that bridge when we come to it

Extended BNF—optional parts Extended BNF uses brackets to indicate optional parts of rules Example: <if_statement> ::= if <condition> <statement> [ else <statement> ] Pseudocode for this example: public boolean ifStatement() { if you don’t see “if”, return false if you don’t see a condition, return an error if you don’t see a statement, return an error if you see an “else” { if you see a “statement”, return true else return an error } else return true; }

Extended BNF—zero or more Extended BNF uses braces to indicate parts of a rule that can be repeated Example: <expression> ::= <term> { + <term> } Pseudocode for this example: public boolean expression() { if you don’t see a term, return false while you see a “+” { if you don’t see a term, return an error } return true }

Back to parsers A parser is like a recognizer The difference is that, when a parser recognizes something, it does something about it Usually, what a parser does is build a tree If the thing that is being parsed is a program, then You can write another program that “walks” the tree and executes the statements and expressions as it finds them Such a program is called an interpreter You can write another program that “walks” the tree and produces code in some other language (usually assembly language) that does the same thing Such a program is called a compiler

Conclusions If you start with a BNF definition of a language, You can write a recursive descent recognizer to tell you whether an input string “belongs to” that language (is a valid program in that language) Writing such a recognizer is a “cookbook” exercise—you just follow the recipe and it works (hopefully) You can write a recursive descent parser to create a parse tree representing the program The parse tree can later be used to execute the program BNF is purely syntactic BNF tells you what is legal, and how things are put together BNF has nothing to say about what things actually mean

The End