Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 3 Lexical and Syntactic Analysis Syntactic.

Slides:



Advertisements
Similar presentations
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Advertisements

Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 2 Syntax A language that is simple to parse.
Lexical and Syntactic Analysis Here, we look at two of the tasks involved in the compilation process –Given source code, we need to first break it into.
Chapter 4 Lexical and Syntax Analysis Sections
Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 2 Syntax A language that is simple to parse.
Chapter 4 Lexical and Syntax Analysis Sections 1-4.
Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 2 Syntax A language that is simple to parse.
Context-Free Grammars Lecture 7
ISBN Chapter 4 Lexical and Syntax Analysis.
ISBN Chapter 4 Lexical and Syntax Analysis The Parsing Problem Recursive-Descent Parsing.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis.
Lexical and Syntax Analysis
Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 3 Lexical and Syntactic Analysis Syntactic.
Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 3 Lexical and Syntactic Analysis Syntactic.
ISBN Lecture 04 Lexical and Syntax Analysis.
(2.1) Grammars  Definitions  Grammars  Backus-Naur Form  Derivation – terminology – trees  Grammars and ambiguity  Simple example  Grammar hierarchies.
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
Chapter 2 Syntax A language that is simple to parse for the compiler is also simple to parse for the human programmer. N. Wirth.
Lee CSCE 314 TAMU 1 CSCE 314 Programming Languages Syntactic Analysis Dr. Hyunyoung Lee.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 7 Mälardalen University 2010.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
Building lexical and syntactic analyzers
CPSC 388 – Compiler Design and Construction Parsers – Context Free Grammars.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Syntax – Intro and Overview CS331. Syntax Syntax defines what is grammatically valid in a programming language –Set of grammatical rules –E.g. in English,
CS 330 Programming Languages 09 / 26 / 2006 Instructor: Michael Eckmann.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 2 Syntax A language that is simple to parse.
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
Lexical and Syntax Analysis
Dr. Philip Cannata 1 Lexical and Syntactic Analysis Chomsky Grammar Hierarchy Lexical Analysis – Tokenizing Syntactic Analysis – Parsing Hmm Concrete Syntax.
Bernd Fischer RW713: Compiler and Software Language Engineering.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
ISBN Chapter 4 Lexical and Syntax Analysis.
Introduction to Parsing
CPS 506 Comparative Programming Languages Syntax Specification.
Syntax and Semantics Structure of programming languages.
CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 4.
Syntax (2).
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
ISBN Chapter 4 Lexical and Syntax Analysis.
1 A Simple Syntax-Directed Translator CS308 Compiler Theory.
Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 2 Syntax A language that is simple to parse.
CS 330 Programming Languages 09 / 25 / 2007 Instructor: Michael Eckmann.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Chapter 3 – Describing Syntax CSCE 343. Syntax vs. Semantics Syntax: The form or structure of the expressions, statements, and program units. Semantics:
CS 3304 Comparative Languages
Chapter 3 – Describing Syntax
Programming Languages 2nd edition Tucker and Noonan
Introduction to Parsing
Programming Languages Translator
CS510 Compiler Lecture 4.
Lexical and Syntax Analysis
Finite-State Machines (FSMs)
Finite-State Machines (FSMs)
Lexical and Syntax Analysis
Programming Languages 2nd edition Tucker and Noonan
R.Rajkumar Asst.Professor CSE
CS 3304 Comparative Languages
Lecture 4: Lexical Analysis & Chomsky Hierarchy
CS 3304 Comparative Languages
Programming Languages 2nd edition Tucker and Noonan
Chapter 4: Lexical and Syntax Analysis Sangho Ha
Lexical and Syntax Analysis
Syntactic sugar causes cancer of the semicolon.
Programming Languages 2nd edition Tucker and Noonan
COMPILER CONSTRUCTION
Presentation transcript:

Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 3 Lexical and Syntactic Analysis Syntactic sugar causes cancer of the semicolon. A. Perlis

Copyright © 2006 The McGraw-Hill Companies, Inc. Contents 3.1 Chomsky Hierarchy 3.2 Lexical Analysis 3.3 Syntactic Analysis

Copyright © 2006 The McGraw-Hill Companies, Inc. Lexical Analysis 3.1Chomsky Hierarchy of Languages 3.2Purpose of Lexical Analysis Regular Expressions regular expressions for Clite lexicon Finite State Automata (FSA) FSA as a basis for a lexical analyzer Lexical Analyzer (Lexer) Code

Copyright © 2006 The McGraw-Hill Companies, Inc. 3.1 Chomsky Hierarchy Each grammar class corresponds to a language class –Regular grammars lexical grammars –Context-free grammars programming language syntax –Context-sensitive grammars able to express some type rules –Unrestricted grammars – most powerful can express all features of languages such as C/C++

Copyright © 2006 The McGraw-Hill Companies, Inc. Chomsky Hierarchy Context sensitive and unrestricted grammars are not appropriate for developing translators –Given a terminal string ω and a context-sensitive language G it is undecidable whether ω is in the language defined by G, and it is undecidable whether L(G) has any valid strings. A problem is decidable if you can write an algorithm that is guaranteed to solve the problem in a finite number of steps.

Copyright © 2006 The McGraw-Hill Companies, Inc. Regular Grammars (for Lexical Analysis) In terms of expressive power, equivalent to: –Regular expressions –Finite-state automata

Copyright © 2006 The McGraw-Hill Companies, Inc. Context-Free Grammars Capable of expressing concrete syntax of programming languages Equivalent to –a pushdown automaton Other grammar levels – beyond the scope of this course; see CS 403 or 603 – also correspond to theoretical machines

Copyright © 2006 The McGraw-Hill Companies, Inc. 3.2 Lexical Analysis Input: a sequence of characters (the program) Discard: whitespace, comments Output: tokens Define: A token is a logically cohesive sequence of characters representing a single symbol; e.g. –Identifiers: numberVal –Literals: 123, 5.67, 'x', true –Keywords: bool | char... –Operators: + - * /... –Punctuation: ;, ( ) { }

Copyright © 2006 The McGraw-Hill Companies, Inc. Character Sequences to Be Recognized by Clite Lexer (tokens + other) IdentifiersWhitespace: space or tab LiteralsComments: // to end-of- line KeywordsEnd-of-line Operators End-of-file Punctuation

Copyright © 2006 The McGraw-Hill Companies, Inc. Ways to Describe Lexical Elements Natural language descriptions Regular grammars Regular expressions Context free grammars

Copyright © 2006 The McGraw-Hill Companies, Inc. Regular Expressions Regular expressions (regexp) are patterns that describe a particular class of strings –Used for pattern matching –One regexp can describe or match many strings Used in many text-processing applications –Python,Perl, Tcl, UNIX utilities such as grep all use regular expressions

Copyright © 2006 The McGraw-Hill Companies, Inc. Using Regular Expressions An alternative to regular grammars for expressing lexical syntax Lexical-analyzer generator programs (e.g. Lex) take regular expressions as input and produce C/C++ programs that tokenize text.

Copyright © 2006 The McGraw-Hill Companies, Inc. With Regular Expressions You Can Test for a pattern within a string (data validation) –For example, you can test an input string to see if a telephone number pattern or a credit card number pattern occurs within the string. Replace text. –Use a regular expression to identify specific text in a document and either remove it completely or replace it with other text. Extract a substring from a string based upon a pattern match. Find specific text within a document or input field.

Copyright © 2006 The McGraw-Hill Companies, Inc. Regular Expression Notation RegExprMeaning xa character x \xan escape character, e.g., \n or \t { name }a reference to a name M | NM or N M NM followed by N M*zero or more occurrences of M Red characters = metacharacters

Copyright © 2006 The McGraw-Hill Companies, Inc. RegExprMeaning M+One or more occurrences of M M?Zero or one occurrence of M [aeiou]the set of vowels/choose one (‘-’ is a metachar.) [0-9]the set of digits/choose one.Any single character (1-char wildcard) \dsame as [0-9] \wsame as [a-zA-Z0-9_] \swhitespace: [ \t\n] Differences in some representations

Copyright © 2006 The McGraw-Hill Companies, Inc. Simple Example gr[ae]y, (gray|grey) and gr(a|e)y are equivalent regexps. Both match either "gray" or "grey".

Copyright © 2006 The McGraw-Hill Companies, Inc. Pattern To Match a Date In the Form yyyy-mm-dd, yyyy.mm.dd, or yyyy/mm/dd (19|20)\d\d[- /.](0[1-9]|1[012])[- /.] (0[1-9]|[12][0-9]|3[01]) (19|20)\d\d : matches “19” or “20” followed by two digits [- /.] : matches ‘-’ or ‘ ‘ or ‘/’ or ‘.’ (0[1-9]|1[012]) : the first option matches a digit between 01 and 09, the second matches 10, 11 or 12. (0[1-9]|[12][0-9]|3[01]) : the 1 st option matches digits 01-09, the 2 nd 10-29, and the 3 rd matches 30 or 31.

Copyright © 2006 The McGraw-Hill Companies, Inc. Clite Lexical Syntax: Ancillary Definitions Category NameDefinition anyChar[ -~] // all printable ASCII chars; blank - tilde letter[a-zA-Z] digit[0-9] whitespace[ \t]// blank or tab eol\n eof\004

Copyright © 2006 The McGraw-Hill Companies, Inc. Clite Lexical Syntax (regexp metacharacters in red) CategoryDefinition keyword bool |char |else | false | float | if | int | main | true | while identifier{letter}({letter} | {digit})* integerLit{digit}+ floatLit{digit}+\.{digit}+ charLit‘{anyChar}’ operator: = |||| && | == |!= | | >= | + | - | * | / |! | [| ] separator: ; |. | {| } | (| ) comment: // ({anyChar} | {whitespace})* {eol}

Copyright © 2006 The McGraw-Hill Companies, Inc. Lexical Analyzer Generators Input: regular expressions Output: a lexical analyzer C/C++: Lex, Flex Java: JLex Regular grammars or regular expressions are converted to a deterministic finite state automaton (DFSA) and then to a lexical analyzer.

Copyright © 2006 The McGraw-Hill Companies, Inc. Elements of a Finite State Automata 1.Set of states: represented by graph nodes 2.Input alphabet + unique end-of-input symbol 3.State transition function represented as labelled, directed edges (arcs) connecting graph nodes 4.A unique start state 5.One or more final states

Copyright © 2006 The McGraw-Hill Companies, Inc. Deterministic FSA Definition: A finite state automaton is deterministic if for each state and each input symbol, there is at most one outgoing arc from the state labelled with the input symbol.

Copyright © 2006 The McGraw-Hill Companies, Inc. A Finite State Automaton for Identifiers Figure 3.2 (p. 64)

Copyright © 2006 The McGraw-Hill Companies, Inc. Use a DFSA to recognize (accept) or reject a string Process the string, one character at a time, by making a series of moves: –Follow the exit arc that corresponds to the leftmost input symbol, thereby consuming it. –If no such arc, then either the input is accepted (if you are in the final state) or there is an error. An input is accepted if, beginning from the start state, the automaton consumes all the input and halts in a final state.

Copyright © 2006 The McGraw-Hill Companies, Inc. Example (S, a2i$) ├ (I, 2i$) ├ (I, i$) ├ (I, $) ├ (F, ) Thus: (S, a2i$) ├* (F, )

Copyright © 2006 The McGraw-Hill Companies, Inc. Practical Issues Explicit terminator (end-of-input symbol) is used only at end of program, not each token. The symbols l and d represent an arbitrary letter and digit, respectively. An unlabelled arc represents any valid input symbol (other than those on labelled arcs leaving the same state).

Copyright © 2006 The McGraw-Hill Companies, Inc. Practical Issues When a token is recognized, move to a final state (one with no exit arc) Recognize a non-token, move back to start Recognize EOF means end of source code. Automaton must be deterministic. Recognize key words as identifiers; then do a table look-up.

Copyright © 2006 The McGraw-Hill Companies, Inc. How It’s Used The lexer is called from the parser. Parser: –Get next token –Parse next token Lexer enters Start state each time the parser calls for a new token Lexer enters “Final” state when a legal token has been recognized. The character that causes the transition to the final state may be white space; may be the first character of the next token.

Copyright © 2006 The McGraw-Hill Companies, Inc. Figure 3.3 (p. 66) – DFSA token recognizer

Copyright © 2006 The McGraw-Hill Companies, Inc.

Lexer Code Parser calls lexer when it needs a new token. Lexer must remember where it left off. –Sometimes the lexer gets one character ahead in the input; compare ab=13; to ab = 13 ; –In the first case, the identifier ab isn’t recognized until the next token, =, is read. –In the second case, blanks signify ends of tokens

Copyright © 2006 The McGraw-Hill Companies, Inc. Lexer Code Solutions: peek function pushback function no symbol consumed by moving out of start state; i.e., when the parser calls the lexer, the lexer already has the first character of the next token, probably in a variable ch

Copyright © 2006 The McGraw-Hill Companies, Inc From Design to Code private char ch = ‘ ’; public Token next ( ) { do { switch (ch) {... } } while (true); } Figure 3.4: Outline of Next Token Routine

Copyright © 2006 The McGraw-Hill Companies, Inc. Remarks Exit do-while loop only when a token is found Loop exited via a return statement which returns control to the parser Variable ch must be initialized to a space character; thereafter it always holds the next character to be processed.

Copyright © 2006 The McGraw-Hill Companies, Inc. Translation Rules Pages 67,68 give rules for translating the DFSA into code. A Java Tokenizer Method for Clite is shown on page 69 (Figure 3.5) Auxiliary functions described on page 68 and 70.

Copyright © 2006 The McGraw-Hill Companies, Inc. private boolean isLetter(char c) { return ch >= ‘a’ && ch <= ‘z’ || ch >= ‘A’ && ch <= ‘Z’; }

Copyright © 2006 The McGraw-Hill Companies, Inc. private String concat(String set) { StringBuffer r = new StringBuffer(“”); do { r.append(ch); ch = nextChar( ); } while (set.indexOf(ch) >= 0); return r.toString( ); }

Copyright © 2006 The McGraw-Hill Companies, Inc. // bold indicates auxiliary methods public Token next( ) { do {if(isLetter(ch) {//ident or keyword String spelling=concat(letters+digits); return Token.keyword(spelling); }else if(isDigit(ch)){//numeric literal String number = concat(digits); if (ch != ‘.’) // int literal return Token.mkIntLiteral(number); number += concat(digits); return Token.mkFloatLiteral(number) ; }

Copyright © 2006 The McGraw-Hill Companies, Inc. else switch (ch) { case ‘ ‘: case ‘\t’: case ‘\r’: case eolnCh: ch = nextCh( ); break; //omitted ‘/’, comments, ‘\’ case eofCh: return Token.eofTok; case ‘+’: ch = nextChar( ); return Token.plusTok; … case ‘&’: check(‘&’); return Token.andTok; case ‘=‘: return chkOpt(‘=‘,Token.assignTok, Token.eqeqTok);

Copyright © 2006 The McGraw-Hill Companies, Inc. Source Tokens // a first program // with 3 comments int main ( ) { char c; int i; c = 'h'; i = c + 3; } // main Token TypeToken Keywordint Keywordmain Punctuation( Punctuation) Punctuation{ Keyword char Identifierc Punctuation; etc.

Copyright © 2006 The McGraw-Hill Companies, Inc. Contents 3.1 Chomsky Hierarchy 3.2 Lexical Analysis 3.3 Syntactic Analysis

Copyright © 2006 The McGraw-Hill Companies, Inc. Syntactic Analysis (The Parser) Purpose: to recognize source code structure Input: tokens Output: parse tree or abstract syntax tree

Copyright © 2006 The McGraw-Hill Companies, Inc. Parsing Algorithms – two types Top-down: (recursive descent, LL) –begin with the most general grammar rule (start symbol) –expand downward using more specific rules –leaves of the parse tree should match program tokens –Equivalent to a left-most derivation. Bottom-up: (LR) –start with the leaves (tokens) –group them together to form interior tree nodes, –End up at the root of the parse tree. –Equivalent to right-most derivations

Copyright © 2006 The McGraw-Hill Companies, Inc. Partial Example: to parse x*y + z Top down Exp: Exp+term Bottom up x * y = term exp

Copyright © 2006 The McGraw-Hill Companies, Inc. Grammar for Parsing Example (remove recursion for recursive descent parsing) Assignment → Identifier = Expression Expression → Term { AddOp Term } AddOp → + | - Term → Factor { MulOp Factor } MulOp → * | / Factor → [ UnaryOp ] Primary UnaryOp → - | ! Primary → Identifier | Literal | ( Expression )

Copyright © 2006 The McGraw-Hill Companies, Inc. Recursive Descent Parsing A recursive descent parser “builds” the parse tree in a top-down manner Defines a method/function for each non- terminal to recognize input derivable from that nonterminal Each method should –Recognize the longest sequence of tokens (in the input stream) derivable from the non-terminal –Return an object which is the root of a subtree.

Copyright © 2006 The McGraw-Hill Companies, Inc. Token Implementation Tokens have two parts: –a type (e.g., Identifier, Literal) –a value (e.g., xyz, 3.45)

Copyright © 2006 The McGraw-Hill Companies, Inc. Auxiliary Functions for the Parser match( ) compares the current token to the expected token t If they match, get next token and return Else display a syntax error message. error( ) displays the error message and exits.

Copyright © 2006 The McGraw-Hill Companies, Inc. private String match (TokenType t) { String value = token.value(); if (token.type().equals t) token = lexer.next(); // token is a global variable else error(t); return value; }

Copyright © 2006 The McGraw-Hill Companies, Inc. private void error(int tok) { System.err.println( “Syntax error: expecting” + tok + “; saw: ” + token); System.exit(1); }

Copyright © 2006 The McGraw-Hill Companies, Inc. Building the Parser - 1 General Idea: Begin with the start symbol For simplicity, assume a parser for assignments: Assignment → Identifier = Expression Skeleton method: private Assignment assignment( ) {... return new Assignment (... ); }

Copyright © 2006 The McGraw-Hill Companies, Inc. Building the Parser - 2 Assignment → Identifier = Expression 1.Get next token; call Match: If (Token ≠ identifier) then error else get next token 2.If (Token ≠ ‘=‘) then error else get next token 3.Call method to identify Expression Expression → Term { + | - Term }

Copyright © 2006 The McGraw-Hill Companies, Inc. Building the Parser – 3 (see code on page 79) The Expression method immediately calls Term Term → Factor { * | / Factor } The Term method then calls Factor: Factor → [UnaryOp] Primary Factor processes Unary Op (if needed) or calls Primary Primary → Identifier | Literal | ( Expression ) If (Token ≠ Identifier) and (Token ≠ Literal) and (Token ≠ ‘(‘ ) then error Else take appropriate action based on value of Token

Copyright © 2006 The McGraw-Hill Companies, Inc. Building the Parser - Summary EBNF concrete syntax rules determine the structure of the parser Parser functions return objects (Term, Assignment, etc.) which are defined according to the abstract syntax rules. Sequence of function calls plus returned objects represent the abstract syntax tree. Nodes in the abstract syntax tree are abstract syntax chunks of intermediate code.

Copyright © 2006 The McGraw-Hill Companies, Inc. Abstract Syntax Example Assignment = Variable target; Expression source Expression = Variable | Value | Binary | Unary Binary = Operator op; Expression term1, term2 Unary = Operator op; Expression term Variable = String id Value = Integer value Operator = + | - | * | / | !

Copyright © 2006 The McGraw-Hill Companies, Inc. abstract class Expression { } class Binary extends Expression { Operator op; Expression term1, term2; } class Unary extends Expression { Operator op; Expression term; }

Copyright © 2006 The McGraw-Hill Companies, Inc. private Assignment assignment( ) { // Assignment → Identifier = Expression; Variable target = new Variable(match (Token.Identifier)); match(Token.Assign); Expression source = expression( ); match(Token.Semicolon); return new Assignment(target, source);

Copyright © 2006 The McGraw-Hill Companies, Inc. Building The Abstract Syntax Tree The Assignment method returns an Assignment object which has 2 data members: –Target: a Variable object (the identifier on the LHS) –Source (an expression object) The source object is obtained by calling the Expression method

Copyright © 2006 The McGraw-Hill Companies, Inc. private Expression expression( ){ //Expression → Term{AddOp Term } Expression e = term(); while (isAddOp()) { Operator op = new Operator(match(token.type())); Expression term2 = term(); e = new Binary(op, e, term2); } return e; }

Copyright © 2006 The McGraw-Hill Companies, Inc. Building The Abstract Syntax Tree The Expression method returns an Expression object The Expression object is generated by calling Term() one or more times (depending on the expression Term( ) generates further levels in the tree.

Copyright © 2006 The McGraw-Hill Companies, Inc. Private Expression term (); { // Term  Factor{MultiplyOP Factor} Expression e = factor(); while (isMultiplyOp()) { Operator op = new Operator(match(token.type())); Expression term2 = factor(); } return e; }

Copyright © 2006 The McGraw-Hill Companies, Inc. Summary If the program is syntactically correct, parsing terminates when the eof symbol is read. The parser will have generated objects that correspond to the nodes in an abstract syntax tree. These nodes are input to the semantic analysis phase.

Copyright © 2006 The McGraw-Hill Companies, Inc. Example: Parse X = 3 * Z Abstract syntax for assignment: Assignment = Variable target; Expression source Expression = Variable | Value | Binary | Unary Binary = Operator op; Expression term1, term2 Unary = Operator op; Expression term Variable = String id Value = Integer value Operator = + | - | * | / | ! assignment variable expr The expression object is a binary generated by calling term( ) once (to get 3*z), factor( ) and primary( ) twice (to get 3 and z)

Copyright © 2006 The McGraw-Hill Companies, Inc. Example: X = A + B + C Three calls to term( ) from expression( ), three calls to factor( ) from term( ), three calls to primary( ) from factor( )

Copyright © 2006 The McGraw-Hill Companies, Inc. QUESTIONS?