COMP261 2015 Parsing 2 of 4 Lecture 22. How do we write programs to do this? The process of getting from the input string to the parse tree consists of.

Slides:



Advertisements
Similar presentations
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
Advertisements

Translator Architecture Code Generator ParserTokenizer string of characters (source code) string of tokens abstract program string of integers (object.
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
Pushdown Automata Consists of –Pushdown stack (can have terminals and nonterminals) –Finite state automaton control Can do one of three actions (based.
Chapter 2 Syntax. Syntax The syntax of a programming language specifies the structure of the language The lexical structure specifies how words can be.
1 Pass Compiler 1. 1.Introduction 1.1 Types of compilers 2.Stages of 1 Pass Compiler 2.1 Lexical analysis 2.2. syntactical analyzer 2.3. Code generation.
Lexical and Syntactic Analysis Here, we look at two of the tasks involved in the compilation process –Given source code, we need to first break it into.
CSE 3302 Programming Languages Chengkai Li, Weimin He Spring 2008 Syntax Lecture 2 - Syntax, Spring CSE3302 Programming Languages, UT-Arlington ©Chengkai.
Honors Compilers An Introduction to Grammars Feb 12th 2002.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
16-Jun-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the.
Fall 2007CS 2251 Miscellaneous Topics Deque Recursion and Grammars.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
28-Jun-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the.
MIT Top-Down Parsing Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
CPSC 388 – Compiler Design and Construction Parsers – Context Free Grammars.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
CSI 3120, Grammars, page 1 Language description methods Major topics in this part of the course: –Syntax and semantics –Grammars –Axiomatic semantics (next.
AN IMPLEMENTATION OF A REGULAR EXPRESSION PARSER
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
COMP Parsing 1 of 4 Lecture 21. Parsing text Making sense of "structured text": Compiling programs (javac, g++, Python etc) Rendering web sites.
CS 461 – Oct. 7 Applications of CFLs: Compiling Scanning vs. parsing Expression grammars –Associativity –Precedence Programming language (handout)
1 Top Down Parsing. CS 412/413 Spring 2008Introduction to Compilers2 Outline Top-down parsing SLL(1) grammars Transforming a grammar into SLL(1) form.
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Lexical and Syntax Analysis
Review 1.Lexical Analysis 2.Syntax Analysis 3.Semantic Analysis 4.Code Generation 5.Code Optimization.
CS 153 A little bit about LR Parsing. Background We’ve seen three ways to write parsers:  By hand, typically recursive descent  Using parsing combinators.
LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES.
1 Syntax In Text: Chapter 3. 2 Chapter 3: Syntax and Semantics Outline Syntax: Recognizer vs. generator BNF EBNF.
COMP Parsing 3 of 4 Lectures 23. Using the Scanner Break input into tokens Use Scanner with delimiter: public void parse(String input ) { Scanner.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Introduction to Parsing
CPS 506 Comparative Programming Languages Syntax Specification.
22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the.
Topic #1: Introduction EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Comp 311 Principles of Programming Languages Lecture 3 Parsing Corky Cartwright August 28, 2009.
D Goforth COSC Translating High Level Languages Note error in assignment 1: #4 - refer to Example grammar 3.4, p. 126.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Top-down Parsing Recursive Descent & LL(1) Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.
Top-Down Parsing CS 671 January 29, CS 671 – Spring Where Are We? Source code: if (b==0) a = “Hi”; Token Stream: if (b == 0) a = “Hi”; Abstract.
PZ03BX Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ03BX –Recursive descent parsing Programming Language.
CSC312 Automata Theory Lecture # 26 Chapter # 12 by Cohen Context Free Grammars.
LECTURE 4 Syntax. SPECIFYING SYNTAX Programming languages must be very well defined – there’s no room for ambiguity. Language designers must use formal.
1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations.
LECTURE 7 Lex and Intro to Parsing. LEX Last lecture, we learned a little bit about how we can take our regular expressions (which specify our valid tokens)
CS 330 Programming Languages 09 / 25 / 2007 Instructor: Michael Eckmann.
More yacc. What is yacc – Tool to produce a parser given a grammar – YACC (Yet Another Compiler Compiler) is a program designed to compile a LALR(1) grammar.
Comp 311 Principles of Programming Languages Lecture 2 Syntax Corky Cartwright August 26, 2009.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Comp 411 Principles of Programming Languages Lecture 3 Parsing
Parsing 1 of 4: Grammars and Parse Trees
Parsing 2 of 4: Scanner and Parsing
COMP261 Lecture 18 Parsing 3 of 4.
Programming Languages Translator
COMP261 Lecture 19 Parsing 4 of 4.
PROGRAMMING LANGUAGES
CS 363 Comparative Programming Languages
Compiler Design 4. Language Grammars
Lexical and Syntax Analysis
ENERGY 211 / CME 211 Lecture 15 October 22, 2008.
Lecture 7: Introduction to Parsing (Syntax Analysis)
R.Rajkumar Asst.Professor CSE
CS 3304 Comparative Languages
CS 3304 Comparative Languages
High-Level Programming Language
Presentation transcript:

COMP Parsing 2 of 4 Lecture 22

How do we write programs to do this? The process of getting from the input string to the parse tree consists of two steps: 1.Lexical analysis: the process of converting a sequence of characters into a sequence of tokens. Note that java.util.Scanner allows us to do lexical analysis with great ease! 2.Syntactic analysis or parsing: the process of analysing text, made of a sequence of tokens to determine its grammatical structure with respect to a given grammar. Assignment will require you to write a recursive descent parser discussed in the next lecture!

Using a Scanner for Lexical Analysis Need to separate the text into a sequence of tokens Java Scanner, by default, separates at white space. figure.walk(45,Math.min(Figure.stepSize,figure.cur Speed));  white space is not good enough!! Java Scanner can use more complicated pattern to separate the tokens. –Can use a “Regular Expression” string with “wild cards” * + ?: specifying repetitions [-+*/] \d \s: specifying sets of possible characters |: specifying alternatives (?=end) (?<=begin) : specifying pre- and post-context eg: scan.useDelimiter("(? )\\s*|\\s*(?=<)");

Parsing text Given some text, a grammar, a specification of the non-terminals of the grammar First Lexical analysis: break up text into a sequence of tokens Second Parsing: (a) check if the text meets the grammar rules, or (b) construct the parse tree for the text, according to the grammar.

Lexical Analysis The simplest approach: (spaces between tokens) –Use the standard Java Scanner class –Make sure that all the tokens are separated by white spaces (and don’t contain any white spaces) ⇒ the Scanner will return a sequence of the tokens –very restricted: eg, couldn’t separate tokens in html More powerful approach: –Use the standard Java Scanner class –Define a delimiter that separates all the tokens delimiter is a Java regular expression text matching the delimiter will not be returned in tokens eg scan.useDelimiter("\\s*(?= )\\s*"); would separate the tokens for the html grammar:

Delimiter: "\\s*(?= )\\s*" Given: Something My Header Item 1 Item 42 Something really important scanner would generate the tokens: Something My Header Item 1 Item 42 Something really important

Lexical Analysis Defining delimiters can be very tricky. Some languages (such as lisp, html, xml) are designed to be easy. Better approach: Define a pattern matching the tokens (instead of a pattern matching the separators between tokens) Make a method that will search for and return the next token, based on the token pattern. The pattern is typically made from combination of patterns for each kind of token. Patterns are generally regular expressions. ⇒ use a finite state automata to match / recognise them. There are tools to make this easier: –eg LEX, JFLEX, ANTLR, … –see

Lexical analysis Desirable to return the type of the token, in addition to the text of the token: If the lexical analyser is made up of patterns for each type of token, it can tell which type a token was from: eg: size = (width + 1) * length; ⇒ 〈 Name: “size” 〉 〈 Equals: “=” 〉 〈 OpenParen: “(” 〉 〈 Name: “width” 〉 〈 Operator: “+” 〉 〈 Number: “1” 〉 〈 CloseParen: “)” 〉 〈 Operator: “*” 〉 〈 Name: “length” 〉

Parsing text? Consider this example grammar: Expr ::= Num | Add | Sub | Mul | Div Add ::= “add” “(” Expr “,” Expr “)” Sub ::= “sub” “(“ Expr “,” Expr “)” Mul ::= “mul” “(” Expr “,” Expr “)” Div ::= “div” “(“ Expr “,” Expr “)” Num ::= an optional sign followed by a sequence of digits: [-+]?[0-9]+ Check the following texts: add(div( 56, 8), mul(sub(0, 10 ), mul (-1, 3))) div(div(86, 5), 67) 50 add(-5, sub(50, 50), 4) div(100, 0)

Idea: Write a Program to Mimic Rules! Naïve Top Down Recursive Descent Parsers: –have a method corresponding to each nonterminal that calls other nonterminal methods for each nonterminal and calls a scanner for each terminal! For example, given a grammar: FOO ::= “a” BAR | “b” BAZ BAR ::= …. Parser would have a method such as: public boolean parseFOO(Scanner s){ if (!s.hasNext()) { return false; }// PARSE ERROR String token = s.next(); if (token.equals(“a”)) { return parseBAR(s); } else if (token.equals(“b”)){ return parseBAZ(s); } else { return false; }// PARSE ERROR }

Top Down Recursive Descent Parser A top down recursive descent parser: built from a set of mutually-recursive procedures each procedure usually implements one of the production rules of the grammar. Structure of the resulting program closely mirrors that of the grammar it recognizes. Naive Parser: looks at next token checks what the token is to decide which branch of the rule to follow fails if token is missing or is of a non-matching type. requires the grammar rules to be highly constrained: always able to choose next path given current state and next token

A parser for arithmetic expressions Expr ::= Num | Add | Sub | Mul | Div Add ::= “add” “(” Expr “,” Expr “)” Sub ::= “sub” “(“ Expr “,” Expr “)” Mul ::= “mul” “(” Expr “,” Expr “)” Div ::= “div” “(“ Expr “,” Expr “)” mul(sub(mul(65, 74), add(68, 25) ), add(div(5, 3), 15)) Num MulAddDiv SubAdd Mul Expr

Using the Scanner Break input into tokens Use Scanner with delimiter: public void parse(String input ) { Scanner s = new Scanner(input); s.useDelimiter("\\s*(?=[(),])|(?<=[(),])\\s*"); if ( parseExpr(s) ) { System.out.println("That is a valid expression"); } Breaks the input into a sequence of tokens, spaces are separator characters and not part of the tokens tokens also delimited at round brackets and commas which will be tokens in their own right.