Lexical Analysis (a.k.a. Scanning)

Slides:



Advertisements
Similar presentations
Chapter 1 Writing a Program Fall Class Overview Course Information –On the web page and Blackboard –
Advertisements

Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
Chapter 2 Syntax. Syntax The syntax of a programming language specifies the structure of the language The lexical structure specifies how words can be.
1 Pass Compiler 1. 1.Introduction 1.1 Types of compilers 2.Stages of 1 Pass Compiler 2.1 Lexical analysis 2.2. syntactical analyzer 2.3. Code generation.
The Java Programming Language
1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical.
Chapter 2 Chang Chi-Chung Lexical Analyzer The tasks of the lexical analyzer:  Remove white space and comments  Encode constants as tokens.
Outline Java program structure Basic program elements
Chapter 2 Chang Chi-Chung Lexical Analyzer The tasks of the lexical analyzer:  Remove white space and comments  Encode constants as tokens.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Java Software Solutions Foundations of Program Design Sixth Edition by Lewis.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
1 Identifiers  Identifiers are the words a programmer uses in a program  An identifier can be made up of letters, digits, the underscore character (
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
CS 153: Concepts of Compiler Design September 2 Class Meeting Department of Computer Science San Jose State University Fall 2015 Instructor: Ron Mak
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Introduction to Programming David Goldschmidt, Ph.D. Computer Science The College of Saint Rose Java Fundamentals (Comments, Variables, etc.)
The Java Programming Language
1 Top Down Parsing. CS 412/413 Spring 2008Introduction to Compilers2 Outline Top-down parsing SLL(1) grammars Transforming a grammar into SLL(1) form.
JAVA Tokens. Introduction A token is an individual element in a program. More than one token can appear in a single line separated by white spaces.
Review: Regular expression: –How do we define it? Given an alphabet, Base case: – is a regular expression that denote { }, the set that contains the empty.
Lexical and Syntax Analysis
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
Algorithms  Problem: Write pseudocode for a program that keeps asking the user to input integers until the user enters zero, and then determines and outputs.
CPS 506 Comparative Programming Languages Syntax Specification.
IN LINE FUNCTION AND MACRO Macro is processed at precompilation time. An Inline function is processed at compilation time. Example : let us consider this.
Chapter 9 1 Chapter 9 – Part 2 l Overview of Streams and File I/O l Text File I/O l Binary File I/O l File Objects and File Names Streams and File I/O.
Syntax (2).
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
The Role of Lexical Analyzer
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
ICS3U_FileIO.ppt File Input/Output (I/O)‏ ICS3U_FileIO.ppt File I/O Declare a file object File myFile = new File("billy.txt"); a file object whose name.
©SoftMoore ConsultingSlide 1 Lexical Analysis (a.k.a. Scanning)
Overview of Compilation Prepared by Manuel E. Bermúdez, Ph.D. Associate Professor University of Florida Programming Language Principles Lecture 2.
© 2011 Pearson Education, publishing as Addison-Wesley Chapter 1: Computer Systems Presentation slides for Java Software Solutions for AP* Computer Science.
1 Problem Solving  The purpose of writing a program is to solve a problem  The general steps in problem solving are: Understand the problem Dissect the.
C++ Lesson 1.
Syntax Analysis (a.k.a. Parsing)
Definition of the Programming Language CPRL
Compiler Designs and Constructions (Page 83 – 92)
Introduction to programming in java
Working with Java.
Chapter 4 Assignment Statement
Lexical and Syntax Analysis
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Finite-State Machines (FSMs)
Overview of Compilation The Compiler Front End
Overview of Compilation The Compiler Front End
Building Java Programs
Algorithms Problem: Write pseudocode for a program that keeps asking the user to input integers until the user enters zero, and then determines and outputs.
RegExps & DFAs CS 536.
University of Central Florida COP 3330 Object Oriented Programming
PROGRAMMING LANGUAGES
Finite-State Machines (FSMs)
Introduction to C Topics Compilation Using the gcc Compiler
Chapter 3 Assignment Statement
CS 536 / Fall 2017 Introduction to programming languages and compilers
Chapter 3: Lexical Analysis
Starting JavaProgramming
Introduction to Java Programming
An overview of Java, Data types and variables
Chapter 1: Computer Systems
CS 3304 Comparative Languages
Lecture 4: Lexical Analysis & Chomsky Hierarchy
CMPE 152: Compiler Design August 21/23 Lab
CS 3304 Comparative Languages
CSCE 3110 Data Structures & Algorithm Analysis
Focus of the Course Object-Oriented Software Development
The Recursive Descent Algorithm
CMPE 152: Compiler Design September 3 Class Meeting
Presentation transcript:

Lexical Analysis (a.k.a. Scanning) ©SoftMoore Consulting

Scanning Class Position Class Position encapsulates the concept of a position in a source file. used primarily for error reporting The position is characterized by an ordered pair of integers line number relative to the source file character number relative to that line Note: Position objects are immutable – once created they can’t be modified. Key methods public int getLineNumber() public int getCharNumber() ©SoftMoore Consulting

Scanning Class Source Class Source is essentially an iterator that steps through the characters in a source file one character at a time. At any point during the iteration you can examine the current character and its position within the source file before advancing to the next character. Class Source Encapsulates the source file reader Maintains the position of each character in the source file Input: a Reader (usually a FileReader) Output: individual characters and their position within the file ©SoftMoore Consulting

Class Source: Key Methods Scanning Class Source: Key Methods /** * Returns the current character (as an int) in the source * file. Returns EOF if the end of file has been reached. */ public int getChar() * Returns the position (line number, char number) of the * current character in the source file. public Position getCharPosition() * Advance to the next character in the source file. public void advance() throws IOException ©SoftMoore Consulting

Testing Class Source String fileName = args[0]; Scanning Testing Class Source String fileName = args[0]; FileReader fileReader = new FileReader(fileName); Source source = new Source(fileReader); while (source.getChar() != Source.EOF) { int c = source.getChar(); if (c == '\n') System.out.print("\\n"); else if (c != '\r') System.out.print((char) c); System.out.println("\t" + source.getCharPosition()); source.advance(); } ©SoftMoore Consulting

Results of Testing Class Source (Input File is Source.java) Scanning Results of Testing Class Source (Input File is Source.java) p Line 1, Character 1 a Line 1, Character 2 c Line 1, Character 3 k Line 1, Character 4 a Line 1, Character 5 g Line 1, Character 6 e Line 1, Character 7 Line 1, Character 8 e Line 1, Character 9 d Line 1, Character 10 u Line 1, Character 11 . Line 1, Character 12 c Line 1, Character 13 i Line 1, Character 14 t Line 1, Character 15 a Line 1, Character 16 ... ©SoftMoore Consulting

Symbol (a.k.a. Token Type) Scanning Symbol (a.k.a. Token Type) The term symbol will be used to refer to the basic lexical units returned by the scanner. From the perspective of the parser, these are the terminal symbols. Symbols include reserved words (“while”, “if”, …) operators and punctuation (“:=”, “+”, “;”, …), identifier intLiteral special symbols (EOF, unknown) ©SoftMoore Consulting

(continued on next slide) Scanning Enum Symbol public enum Symbol { // reserved words BooleanRW("Reserved word: Boolean"), IntegerRW("Reserved word: Integer"), ... whileRW("Reserved word: while"), writeRW("Reserved word: write"), writelnRW("Reserved word: writeln"), // arithmetic operator symbols plus("+"), minus("-"), times("*"), divide("/"), (continued on next slide) ©SoftMoore Consulting

Enum Symbol (continued) Scanning Enum Symbol (continued) ... // literal values and identifier symbols intLiteral("Integer Literal"), charLiteral("Character Literal"), stringLiteral("String Literal"), identifier("Identifier"), // special scanning symbols EOF("End-of-File"), unknown("Unknown"); } See source file for details. ©SoftMoore Consulting

Scanning Token The term token will be used to refer to a symbol together with additional information including the position (line number and character number) of the symbol in the source file the text associated with the symbol The additional information provided by a token is used for error reporting and code generation, but not to determine if the program is syntactically correct. ©SoftMoore Consulting

Examples: Text Associated with Symbols Scanning Examples: Text Associated with Symbols “average” for an identifier “100” for an integer literal “Hello, world.” for a string literal “while” for the reserved word “while” “<=” for the operator “<=” The text associated with user-defined symbols such as identifiers or literals is more significant than the text associated with language-defined symbols such as reserved words or operators. ©SoftMoore Consulting

Class Token: Key Methods Scanning Class Token: Key Methods /** * Returns the token's symbol. */ public Symbol getSymbol() * Returns the token's position within the source file. public Position getPosition() * Returns the string representation for the token. public String getText() ©SoftMoore Consulting

Implementing Class Token Class Token is implemented in two separate classes: An abstract, generic class that can be instantiated with any Symbol enum class public abstract class AbstractToken <Symbol extends Enum<Symbol>> implements Cloneable A concrete class that instantiates the generic class using the Symbol enum class for CPRL public class Token extends AbstractToken<Symbol> Class AbstractToken is reusable on compiler projects other than a compiler for CPRL. ©SoftMoore Consulting

Scanner (Lexical Analyzer) Scanning Scanner (Lexical Analyzer) Class Scanner is essentially an iterator that steps through the tokens in a source file one token at a time. At any point during the iteration you can examine the current token, its text, and its position within the source file before advancing to the next token. Class Scanner Consumes characters from the source code file as it constructs the tokens Removes extraneous white space and comments Reports any errors Input: Individual characters (from class Source) Output: Tokens (to be consumed by the parser) ©SoftMoore Consulting

Class Scanner: Key Methods Scanning Class Scanner: Key Methods /** * Returns a copy (clone) of the current token in the * source file. */ public Token getToken() * Returns a reference to the current symbol in the source file. public Symbol getSymbol() * Advance to the next token in the source file. public void advance() throws IOException ©SoftMoore Consulting

Classes Source and Scanner Scanning Classes Source and Scanner Scanner 'y' ' ' ':' '=' ' ' 'x' ' ' '+' ' ' '1' '0' id [“y”, (1, 1)] := [(1, 3)] id [“x”, (1, 6)] + [(1, 8)] literal [(“10”, (1, 10)] Source source code file y := x + 10 ©SoftMoore Consulting

(continued on next page) Scanning Method advance() ... try { skipWhiteSpace(); // currently at starting character of next token currentToken.setPosition(source.getCharPosition()); currentToken.setText(null); if (source.getChar() == Source.EOF) // set symbol but don't advance currentToken.setSymbol(Symbol.EOF); } (continued on next page) ©SoftMoore Consulting

Method advance() (continued) Scanning Method advance() (continued) else if (Character.isLetter((char) source.getChar())) { String idString = scanIdentifier(); Symbol scannedSymbol = getIdentifierSymbol(idString); currentToken.setSymbol(scannedSymbol); if (scannedSymbol == Symbol.identifier) currentToken.setText(idString); } else if (Character.isDigit((char) source.getChar())) currentToken.setText(scanIntegerLiteral()); currentToken.setSymbol(Symbol.intLiteral); (continued on next page) ©SoftMoore Consulting

Method advance() (continued – scanning “+” and “-” symbols) else { switch((char) source.getChar()) case '+': currentToken.setSymbol(Symbol.plus); source.advance(); break; case '-': currentToken.setSymbol(Symbol.minus); ... (continued on next page) ©SoftMoore Consulting

Method advance() (continued – scanning “>” and “>= ” symbols) case '>': source.advance(); if ((char) source.getChar() == '=') { currentToken.setSymbol(Symbol.greaterOrEqual); } else currentToken.setSymbol(Symbol.greaterThan); break; ... ©SoftMoore Consulting

Example: Scanning an Integer Literal protected String scanIntegerLiteral() throws IOException { // assumes that source.getChar() is the first digit // of the integer literal clearScanBuffer(); do scanBuffer.append((char) source.getChar()); source.advance(); } while (Character.isDigit((char) source.getChar())); return scanBuffer.toString(); ©SoftMoore Consulting

Tips on Scanning an Identifier Use a single method to scan all identifiers, including reserved words. /** * Scans characters in the source file for a valid identifier. */ protected String scanIdentifier() throws IOException Use an “efficient” search routine to determine if the identifier is a user-defined identifier or a reserved word. * Returns the symbol associated with an identifier * (Symbol.arrayRW, Symbol.ifRW, Symbol.identifier, etc.) protected Symbol getIdentifierSymbol(String idString) ©SoftMoore Consulting

Testing Class Scanner String fileName = args[0]; Scanning Testing Class Scanner String fileName = args[0]; FileReader fileReader = new FileReader(fileName); Source source = new Source(fileReader); Scanner scanner = new Scanner(source); Token token; do { token = scanner.getToken(); printToken(token); scanner.advance(); } while (token.getSymbol() != Symbol.EOF); ©SoftMoore Consulting

Testing Class Scanner (continued) Scanning Testing Class Scanner (continued) public static void printToken(Token token) { System.out.printf("line: %2d char: %2d token: ", token.getPosition().getLineNumber(), token.getPosition().getCharNumber()); if ( token.getSymbol() == Symbol.identifier || token.getSymbol() == Symbol.intLiteral || token.getSymbol() == Symbol.stringLiteral || token.getSymbol() == Symbol.charLiteral) System.out.print(token.getSymbol().toString() + " -> "); System.out.println(token.getText()); ©SoftMoore Consulting

Results of Testing Class Scanner (Input File is Correct_01.cprl) Scanning Results of Testing Class Scanner (Input File is Correct_01.cprl) line: 2 char: 1 token: Reserved word: and line: 2 char: 11 token: Reserved word: array line: 2 char: 21 token: Reserved word: begin line: 2 char: 31 token: Reserved word: Boolean ... line: 9 char: 31 token: Reserved word: while line: 9 char: 41 token: Reserved word: write line: 10 char: 1 token: Reserved word: writeln line: 13 char: 1 token: + line: 13 char: 6 token: - line: 13 char: 11 token: * line: 13 char: 16 token: / line: 16 char: 1 token: = line: 16 char: 5 token: != line: 16 char: 10 token: < line: 16 char: 14 token: <= ©SoftMoore Consulting