©SoftMoore ConsultingSlide 1 Lexical Analysis (a.k.a. Scanning)

Slides:



Advertisements
Similar presentations
Lecture 15: I/O and Parsing
Advertisements

Formal Language, chapter 4, slide 1Copyright © 2007 by Adam Webber Chapter Four: DFA Applications.
Chapter 1 Writing a Program Fall Class Overview Course Information –On the web page and Blackboard –
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
1 JavaCUP JavaCUP (Construct Useful Parser) is a parser generator Produce a parser written in java, itself is also written in Java; There are many parser.
Chapter 1: Computer Systems
Compilation Encapsulation Or: Why Every Component Should Just Do Its Damn Job.
Chapter 2 Syntax. Syntax The syntax of a programming language specifies the structure of the language The lexical structure specifies how words can be.
1 Pass Compiler 1. 1.Introduction 1.1 Types of compilers 2.Stages of 1 Pass Compiler 2.1 Lexical analysis 2.2. syntactical analyzer 2.3. Code generation.
George Blank University Lecturer. CS 602 Java and the Web Object Oriented Software Development Using Java Chapter 4.
The Java Programming Language
1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical.
Chapter 2 Chang Chi-Chung Lexical Analyzer The tasks of the lexical analyzer:  Remove white space and comments  Encode constants as tokens.
Outline Java program structure Basic program elements
Chapter 2 Chang Chi-Chung Lexical Analyzer The tasks of the lexical analyzer:  Remove white space and comments  Encode constants as tokens.
Strings and File I/O. Strings Java String objects are immutable Common methods include: –boolean equalsIgnoreCase(String str) –String toLowerCase() –String.
CS 153: Concepts of Compiler Design August 25 Class Meeting Department of Computer Science San Jose State University Fall 2014 Instructor: Ron Mak
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Java Software Solutions Foundations of Program Design Sixth Edition by Lewis.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CSC 8310 Programming Languages Meeting 2 September 2/3, 2014.
CS 536 Spring Learning the Tools: JLex Lecture 6.
1 Identifiers  Identifiers are the words a programmer uses in a program  An identifier can be made up of letters, digits, the underscore character (
1 Week 4 Questions / Concerns Comments about Lab1 What’s due: Lab1 check off this week (see schedule) Homework #3 due Wednesday (Define grammar for your.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
CS 153: Concepts of Compiler Design September 2 Class Meeting Department of Computer Science San Jose State University Fall 2015 Instructor: Ron Mak
Compiler Construction Lexical Analysis. The word lexical means textual or verbal or literal. The lexical analysis implemented in the “SCANNER” module.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Introduction to Programming David Goldschmidt, Ph.D. Computer Science The College of Saint Rose Java Fundamentals (Comments, Variables, etc.)
The Java Programming Language
1 Top Down Parsing. CS 412/413 Spring 2008Introduction to Compilers2 Outline Top-down parsing SLL(1) grammars Transforming a grammar into SLL(1) form.
JAVA Tokens. Introduction A token is an individual element in a program. More than one token can appear in a single line separated by white spaces.
CS 614: Theory and Construction of Compilers Lecture 10 Fall 2002 Department of Computer Science University of Alabama Joel Jones.
Review: Regular expression: –How do we define it? Given an alphabet, Base case: – is a regular expression that denote { }, the set that contains the empty.
D. M. Akbar Hussain: Department of Software & Media Technology 1 Compiler is tool: which translate notations from one system to another, usually from source.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Lexical and Syntax Analysis
Chapter 2: Java Fundamentals
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
1 Languages and Compilers (SProg og Oversættere) Lexical analysis.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Algorithms  Problem: Write pseudocode for a program that keeps asking the user to input integers until the user enters zero, and then determines and outputs.
CPS 506 Comparative Programming Languages Syntax Specification.
IN LINE FUNCTION AND MACRO Macro is processed at precompilation time. An Inline function is processed at compilation time. Example : let us consider this.
Chapter 9 1 Chapter 9 – Part 2 l Overview of Streams and File I/O l Text File I/O l Binary File I/O l File Objects and File Names Streams and File I/O.
Syntax (2).
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
The Role of Lexical Analyzer
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
ICS3U_FileIO.ppt File Input/Output (I/O)‏ ICS3U_FileIO.ppt File I/O Declare a file object File myFile = new File("billy.txt"); a file object whose name.
©SoftMoore ConsultingSlide 1 Structure of Compilers.
©SoftMoore ConsultingSlide 1 Context-Free Grammars.
Overview of Compilation Prepared by Manuel E. Bermúdez, Ph.D. Associate Professor University of Florida Programming Language Principles Lecture 2.
© 2011 Pearson Education, publishing as Addison-Wesley Chapter 1: Computer Systems Presentation slides for Java Software Solutions for AP* Computer Science.
Compiler Chapter 4. Lexical Analysis Dept. of Computer Engineering, Hansung University, Sung-Dong Kim.
1 Problem Solving  The purpose of writing a program is to solve a problem  The general steps in problem solving are: Understand the problem Dissect the.
Definition of the Programming Language CPRL
Lexical Analysis (a.k.a. Scanning)
Compiler Designs and Constructions (Page 83 – 92)
Working with Java.
Chapter 4 Assignment Statement
Overview of Compilation The Compiler Front End
Overview of Compilation The Compiler Front End
Algorithms Problem: Write pseudocode for a program that keeps asking the user to input integers until the user enters zero, and then determines and outputs.
Computer Programming Methodology File Input
Introduction to Java Programming
Chapter 1: Computer Systems
Lecture 4: Lexical Analysis & Chomsky Hierarchy
CMPE 152: Compiler Design August 21/23 Lab
CMPE 152: Compiler Design September 3 Class Meeting
Presentation transcript:

©SoftMoore ConsultingSlide 1 Lexical Analysis (a.k.a. Scanning)

©SoftMoore ConsultingSlide 2 Class Position Class Position encapsulates the concept of a position in a source file. –used primarily for error reporting The position is characterized by an ordered pair of integers –line number relative to the source file –character number relative to that line Note: Position objects are immutable – once created they can’t be modified. Key methods public int getLineNumber() public int getCharNumber()

©SoftMoore ConsultingSlide 3 Class Source Class Source is essentially an iterator that steps through the characters in a source file one character at a time. At any point during the iteration you can examine the current character and its position within the source file before advancing to the next character. Class Source –Encapsulates the source file reader –Maintains the position of each character in the source file –Input: a Reader (usually a FileReader ) –Output: individual characters and their position within the file

©SoftMoore ConsultingSlide 4 Class Source : Key Methods /** * Returns the current character (as an int) in the source * file. Returns EOF if the end of file has been reached. */ public int getChar() /** * Returns the position (line number, char number) of the * current character in the source file. */ public Position getCharPosition() /** * Advance to the next character in the source file. */ public void advance() throws IOException

©SoftMoore ConsultingSlide 5 Testing Class Source String fileName = args[0]; FileReader fileReader = new FileReader(fileName); Source source = new Source(fileReader); while (source.getChar() != Source.EOF) { int c = source.getChar(); if (c == '\n') System.out.print("\\n"); else if (c != '\r') System.out.print((char) c); System.out.println("\t" + source.getCharPosition()); source.advance(); }

©SoftMoore ConsultingSlide 6 Results of Testing Class Source (Input File is Source.java ) p Line 1, Character 1 a Line 1, Character 2 c Line 1, Character 3 k Line 1, Character 4 a Line 1, Character 5 g Line 1, Character 6 e Line 1, Character 7 Line 1, Character 8 e Line 1, Character 9 d Line 1, Character 10 u Line 1, Character 11. Line 1, Character 12 c Line 1, Character 13 i Line 1, Character 14 t Line 1, Character 15 a Line 1, Character 16...

©SoftMoore ConsultingSlide 7 Symbol (a.k.a. Token Type) The term symbol will be used to refer to the basic lexical units returned by the scanner. From the perspective of the parser, these are the terminal symbols. Symbols include –reserved words (“ while ”, “ if ”, …) –operators and punctuation (“ := ”, “ + ”, “ ; ”, …), –identifier –intLiteral –special symbols (EOF, unknown)

©SoftMoore ConsultingSlide 8 Enum Symbol public enum Symbol { // reserved words BooleanRW("Reserved word: Boolean"), IntegerRW("Reserved word: Integer"),... whileRW("Reserved word: while"), writeRW("Reserved word: write"), writelnRW("Reserved word: writeln"), // arithmetic operator symbols plus("+"), minus("-"), times("*"), divide("/"), (continued on next slide)

©SoftMoore ConsultingSlide 9 Enum Symbol (continued)... // literal values and identifier symbols intLiteral("Integer Literal"), charLiteral("Character Literal"), stringLiteral("String Literal"), identifier("Identifier"), // special scanning symbols EOF("End-of-File"), unknown("Unknown");... } See source file for details.

©SoftMoore ConsultingSlide 10 Token The term token will be used to refer to a symbol together with additional information including –the position (line number and character number) of the symbol in the source file –the text associated with the symbol The additional information provided by a token is used for error reporting and code generation, but not to determine if the program is syntactically correct.

©SoftMoore ConsultingSlide 11 Examples: Text Associated with Symbols “average” for an identifier “100” for an integer literal “Hello, world.” for a string literal “while” for the reserved word “while” “<=” for the operator “<=” The text associated with user-defined symbols such as identifiers or literals is more significant than the text associated with language-defined symbols such as reserved words or operators.

©SoftMoore ConsultingSlide 12 Class Token : Key Methods /** * Returns the token's symbol. */ public Symbol getSymbol() /** * Returns the token's position within the source file. */ public Position getPosition() /** * Returns the string representation for the token. */ public String getText()

Implementing Class Token Class Token is implemented in two separate classes: An abstract, generic class that can be instantiated with any Symbol enum class public abstract class AbstractToken > implements Cloneable A concrete class that instantiates the generic class using the Symbol enum class for CPRL public class Token extends AbstractToken ©SoftMoore ConsultingSlide 13 Class AbstractToken is reusable on compiler projects other than a compiler for CPRL.

Scanner (Lexical Analyzer) Class Scanner is essentially an iterator that steps through the tokens in a source file one token at a time. At any point during the iteration you can examine the current token, its text, and its position within the source file before advancing to the next token. Class Scanner –Consumes characters from the source code file as it constructs the tokens –Removes extraneous white space and comments –Reports any errors –Input: Individual characters (from class Source ) –Output: Tokens (to be consumed by the parser) ©SoftMoore ConsultingSlide 14

©SoftMoore ConsultingSlide 15 Class Scanner : Key Methods /** * Returns a copy (clone) of the current token in the * source file. */ public Token getToken() /** * Returns a reference to the current symbol in the * source file. */ public Symbol getSymbol() /** * Advance to the next token in the source file. */ public void advance() throws IOException

©SoftMoore ConsultingSlide 16 Classes Source and Scanner Scanner 'y' ' ' ':' '=' ' ' 'x' ' ' '+' ' ' '1' '0' id [“ y ”, (1, 1)] := [(1, 3)]id [“ x ”, (1, 6)] + [(1, 8)]literal [(“ 10 ”, (1, 10)] Source source code file y := x + 10

©SoftMoore ConsultingSlide 17 Method advance()... try { skipWhiteSpace(); // currently at starting character of next token currentToken.setPosition(source.getCharPosition()); currentToken.setText(null); if (source.getChar() == Source.EOF) { // set symbol but don't advance currentToken.setSymbol(Symbol.EOF); } (continued on next page)

©SoftMoore ConsultingSlide 18 Method advance() (continued) else if (Character.isLetter((char) source.getChar())) { String idString = scanIdentifier(); Symbol scannedSymbol = getIdentifierSymbol(idString); currentToken.setSymbol(scannedSymbol); if (scannedSymbol == Symbol.identifier) currentToken.setText(idString); } else if (Character.isDigit((char) source.getChar())) { currentToken.setText(scanIntegerLiteral()); currentToken.setSymbol(Symbol.intLiteral); } (continued on next page)

©SoftMoore ConsultingSlide 19 Method advance() (continued – scanning “ + ” and “ - ” symbols) else { switch((char) source.getChar()) { case '+': currentToken.setSymbol(Symbol.plus); source.advance(); break; case '-': currentToken.setSymbol(Symbol.minus); source.advance(); break;... (continued on next page)

©SoftMoore ConsultingSlide 20 Method advance() (continued – scanning “ > ” and “ >= ” symbols) case '>': source.advance(); if ((char) source.getChar() == '=') { currentToken.setSymbol(Symbol.greaterOrEqual); source.advance(); } else currentToken.setSymbol(Symbol.greaterThan); break;...

©SoftMoore ConsultingSlide 21 Example: Scanning an Integer Literal protected String scanIntegerLiteral() throws IOException { // assumes that source.getChar() is the first digit // of the integer literal clearScanBuffer(); do { scanBuffer.append((char) source.getChar()); source.advance(); } while (Character.isDigit((char) source.getChar())); return scanBuffer.toString(); }

©SoftMoore ConsultingSlide 22 Tips on Scanning an Identifier Use a single method to scan all identifiers, including reserved words. /** * Scans characters in the source file for a valid * identifier. */ protected String scanIdentifier() throws IOException Use an “efficient” search routine to determine if the identifier is a user-defined identifier or a reserved word. /** * Returns the symbol associated with an identifier * (Symbol.arrayRW, Symbol.ifRW, Symbol.identifier, etc.) */ protected Symbol getIdentifierSymbol(String idString)

©SoftMoore ConsultingSlide 23 Testing Class Scanner String fileName = args[0]; FileReader fileReader = new FileReader(fileName); Source source = new Source(fileReader); Scanner scanner = new Scanner(source); Token token; do { token = scanner.getToken(); printToken(token); scanner.advance(); } while (token.getSymbol() != Symbol.EOF);

©SoftMoore ConsultingSlide 24 Testing Class Scanner (continued) public static void printToken(Token token) { System.out.printf("line: %2d char: %2d token: ", token.getPosition().getLineNumber(), token.getPosition().getCharNumber()); if ( token.getSymbol() == Symbol.identifier || token.getSymbol() == Symbol.intLiteral || token.getSymbol() == Symbol.stringLiteral || token.getSymbol() == Symbol.charLiteral) System.out.print(token.getSymbol().toString() + " -> "); System.out.println(token.getText());

©SoftMoore ConsultingSlide 25 Results of Testing Class Scanner (Input File is TestScanner.cprl ) line: 2 char: 1 token: Reserved word: and line: 2 char: 11 token: Reserved word: array line: 2 char: 21 token: Reserved word: begin line: 2 char: 31 token: Reserved word: Boolean... line: 9 char: 31 token: Reserved word: while line: 9 char: 41 token: Reserved word: write line: 10 char: 1 token: Reserved word: writeln line: 13 char: 1 token: + line: 13 char: 6 token: - line: 13 char: 11 token: * line: 13 char: 16 token: / line: 16 char: 1 token: = line: 16 char: 5 token: != line: 16 char: 10 token: < line: 16 char: 14 token: <=...