CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire Lecture 2 Phases of a Compiler Lexical Analysis.

Slides:



Advertisements
Similar presentations
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
Advertisements

COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
Lexical Analysis — Introduction Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.
1 CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan et al. Lecture-Module 5 More Lexical Analysis.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical.
CS189A/172 - Winter 2008 Lecture 7: Software Specification, Architecture Specification.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
From Cooper & Torczon1 Implications Must recognize legal (and illegal) programs Must generate correct code Must manage storage of all variables (and code)
Compiler Construction
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
Scanner Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Compiler course 1. Introduction. Outline Scope of the course Disciplines involved in it Abstract view for a compiler Front-end and back-end tasks Modules.
CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 2.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
May 31, May 31, 2016May 31, 2016May 31, 2016 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa Pacific University,
1 Languages and Compilers (SProg og Oversættere) Lexical analysis.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
Chapter 1 Introduction Study Goals: Master: the phases of a compiler Understand: what is a compiler Know: interpreter,compiler structure.
Introduction to Compiling
Compiler Introduction 1 Kavita Patel. Outlines 2  1.1 What Do Compilers Do?  1.2 The Structure of a Compiler  1.3 Compilation Process  1.4 Phases.
Compiler Construction Dr. Naveed Ejaz Lecture 4. 2 The Back End Register Allocation:  Have each value in a register when it is used. Instruction selection.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Compilers Computer Symbol Table Output Scanner (lexical analysis)
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
Compiler Construction CPCS302 Dr. Manal Abdulaziz.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
1 Asstt. Prof Navjot Kaur Computer Dept PRESENTED BY.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
1 Compiler Construction Vana Doufexi office CS dept.
Presented by : A best website designer company. Chapter 1 Introduction Prof Chung. 1.
CS510 Compiler Lecture 1. Sources Lecture Notes Book 1 : “Compiler construction principles and practice”, Kenneth C. Louden. Book 2 : “Compilers Principles,
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Lecture 02: Compiler Overview Kwangman Man ( SangJi University.
Chapter 1 Introduction Samuel College of Computer Science & Technology Harbin Engineering University.
1. Course Goals To provide students with an understanding of the major phases of a compiler. To introduce students to the theory behind the various phases,
CS 3304 Comparative Languages
Advanced Computer Systems
Compiler Design (40-414) Main Text Book:
Chapter 1 Introduction.
A Simple Syntax-Directed Translator
Chapter 3 Lexical Analysis.
Finite-State Machines (FSMs)
Chapter 1 Introduction.
Finite-State Machines (FSMs)
Compiler Lecture 1 CS510.
Introduction CI612 Compiler Design CI612 Compiler Design.
CPSC 388 – Compiler Design and Construction
Lexical Analysis - An Introduction
Review: Compiler Phases:
CS 3304 Comparative Languages
Lexical Analysis - An Introduction
Compiler Construction
Presentation transcript:

CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire Lecture 2 Phases of a Compiler Lexical Analysis

Announcements discussion session Monday –TA presented JLex: A Lexical Analyzer Generator for Java Read JLex user manual (it is available at the JLex website, the link is at the class webpage) Read the following chapters from the textbook –Chapter 1: Introduction –Chapter 2: A translator from infix expressions to postfix expressions –Chapter 3: Lexical analysis Homework 1 is due next Tuesday –Drop it in the homework box before the lecture or give it to me at the beginning of the lecture

Must recognize legal (and illegal) programs Must generate correct code Must manage storage of all variables (and code) Must agree with OS and linker on format for object code High-level View of a Compiler Source code Machine code Compiler Errors

A Higher Level View: How Does the Compiler Fit In? source progra m absolute machine code Compiler Preprocessor AssemblerLoader/Linker skeletal source program target assembly program relocatable machine code library routines, relocatable object files generates machine code from the assembly code collects the source program that is divided into seperate files macro expansion links the library routines and other object modules generates absolute addresses

Traditional Two-pass Compiler Use an intermediate representation ( IR ) Front end maps legal source code into IR Back end maps IR into target machine code Admits multiple front ends and multiple passes –Typically, front end is O(n) or O(n log n), while back end is NPC Different phases of compiler also interact through the symbol table Sour ce code Front End Errors Machine code Back End IR Symbol Table

Responsibilities Recognize legal programs Report errors for the illegal programs in a useful way Produce IR and construct the symbol table Much of front end construction can be automated The Front End Source code Scanner IR Parser tokens IRType Checker Errors

The Front End Scanner Maps character stream into words—the basic unit of syntax Produces tokens and stores lexemes when it is necessary –x = x + y ; becomes EQ PLUS SEMICOLON –Typical tokens include number, identifier, +, -, while, if Scanner eliminates white space and comments Source code Scanner IR Parser tokens IRType Checker Errors

The Front End Parser Uses scanner as a subroutine Recognizes context-free syntax and reports errors Guides context-sensitive analysis (type checking) Builds IR for source program Scanning and parsing can be grouped into one pass Source code Scanner IR Parser IRType Checker Errors token get next token

The Front End Context Sensitive Analysis Check if all the variables are declared before they are used Type checking –Check type errors such as adding a procedure and an array Add the necessary type conversions –int-to-float, float-to-double, etc. Source code Scanner IR Parser tokens IRType Checker Errors

The Back End Responsibilities Translate IR into target machine code Choose instructions to implement each IR operation Decide which values to keep in registers Schedule the instructions for instruction pipeline Automation has been much less successful in the back end Errors IR Instruction Scheduling Instruction Selection Machine code Register Allocation IR

The Back End Instruction Selection Produce fast, compact code Take advantage of target language features such as addressing modes Usually viewed as a pattern matching problem –ad hoc methods, pattern matching, dynamic programming This was the problem of the future in late 70’s when instruction sets were complex –RISC architectures simplified this problem Errors IR Instruction Scheduling Instruction Selection Machine code Register Allocation IR

The Back End Instruction Scheduling Avoid hardware stalls (keep pipeline moving) Use all functional units productively Optimal scheduling is NP-Complete Errors IR Instruction Scheduling Instruction Selection Machine code Register Allocation IR

The Back End Register allocation Have each value in a register when it is used Manage a limited set of registers Can change instruction choices and insert LOAD s and STORE s Optimal allocation is NP-Complete Compilers approximate solutions to NP-Complete problems Errors IR Instruction Scheduling Instruction Selection Machine code Register Allocation IR

Traditional Three-pass Compiler Code Optimization Analyzes IR and transforms IR Primary goal is to reduce running time of the compiled code –May also improve space, power consumption (mobile computing) Must preserve “meaning” of the code Errors Source Code Middle End Front End Machine code Back End IR

The Optimizer (or Middle End) Typical Transformations Discover and propagate constant values Move a computation to a less frequently executed place Discover a redundant computation and remove it Remove unreachable code Errors Opt1Opt1 Opt3Opt3 Opt2Opt2 OptnOptn... IR Modern optimizers are structured as a series of passes

First Phase: Lexical Analysis (Scanning) Scanner Maps stream of characters into words –Basic unit of syntax Characters that form a word are its lexeme Its syntactic category is called its token Scanner discards white space and comments Source code Scanner IR Parser Errors token get next token

Why Lexical Analysis? By separating context free syntax from lexical analysis –We can develop efficient scanners –We can automate efficient scanner construction –We can write simple specifications for tokens Scanner Generator specifications (regular expressions) source code tokens tables or code

What are Tokens? Token: Basic unit of syntax –Keywords if, while,... –Operators +, *, <=, ||,... –Identifiers (names of variables, arrays, procedures, classes) i, i1, j1, count, sum,... –Numbers 12, 3.14, 7.2E-2,...

What are Tokens? Tokens are terminal symbols for the parser –Tokens are treated as undivisible units in the grammar defining the source language 1. S  expr 2. expr  expr op term 3. | term 4. term  number 5. | id 6. op  + 7. | - number, id, +, - are tokens passed from scanner to parser. They form the terminal symbols of this simple grammar.

Lexical Concepts Token: Basic unit of syntax, syntactic output of the scanner Pattern: The rule that describes the set of strings that correspond to a token, specification of the token Lexeme: A sequence of input characters which match to a pattern and generate the token WHILEwhilewhile IFifif IDi1, length,letter followed by count, sqrtletters and digits Token LexemePattern

Tokens can have Attributes A problem If we send this output to the parser is it enough? Where are the variable names, procedure, names, etc.? All identifiers look the same. Tokens can have attributes that they can pass to the parser (using the symbol table) if (i == j) z = 0; else z = 1; becomes IF, LPAREN,ID,EQEQ,ID,RPAREN, ID,EQ,NUM,SEMICOLON,ELSE, ID,EQ,NUM,SEMICOLON IF, LPAREN,,EQEQ,,RPAREN,,EQ,,SEMICOLON,ELSE,,EQ,,SEMICOLON

How do we specify lexical patterns? Some patterns are easy Keywords and operators –Specified as literal patterns: if, then, else, while, =, +, …

Some patterns are more complex Identifiers –letter followed by letters and digits Numbers –Integer: 0 or a digit between 1 and 9 followed by digits between 0 and 9 –Decimal: An optional sign which can be “+” or “-” followed by digit “0” or a nonzero digit followed by an arbitrary number of digits followed by a decimal point followed by an arbitrary number of digits GOAL: We want to have concise descriptions of patterns, and we want to automatically construct the scanner from these descriptions Specifying Lexical Patterns

Specifying Lexical Patterns: Regular Expressions Regular expressions ( RE s) describe regular languages Regular Expression (over alphabet  )  (empty string) is a RE denoting the set {  } If a is in , then a is a RE denoting {a} If x and y are RE s denoting languages L(x) and L(y) then –x is an RE denoting L(x) –x | y is an RE denoting L(x)  L(y) –xy is an RE denoting L(x)L(y) –x * is an RE denoting L(x)* Precedence is closure, then concatenation, then alternation All left- associative x | y * z is equivalent to x | ((y * ) z)

Operations on Languages OperationDefinition Union of LandM WrittenL  M L  M= {s | s  L or s  M } Concatenation of LandM WrittenLM LM ={st |s  L andt  M } Kleene closureofL WrittenL * L * =  0  i  L i L+L+ =  1  i  L i Exponentiation of L Written L i L i = {  } if i = 0 L i-1 L if i > 0 Positive closure of L Written L +

Examples of Regular Expressions All strings of 1s and 0s ( 0 | 1 ) * All strings of 1s and 0s beginning with a 1 1 ( 0 | 1 ) * All strings of 0s and 1s containing at lest two consecutive 1s ( 0 | 1 ) * 1 1( 0 | 1 ) * All strings of alternating 0s and 1s (  | 1 ) ( 0 1 ) * (  | 0 )

Extensions to Regular Expressions (a la JLex) x+= x x*denotes L(x) + x? = x |  denotes L(x)  {  } [abc] = a | b | c matches one character in the square bracket a-z = a | b | c |... | z range [0-9a-z] = 0 | 1 | 2 |... | 9 | a | b | c |... | z [^abc]^ means negation matches any character except a, b or c. (dot) matches any character except the newline. = [^\n]\n means newline, dot is equivalent to [^\n] “[“matches left square bracket, metacharacters in double quotes become plain characters \[matches left square bracket, metacharacter after backslash becomes plain character

Regular Definitions We can define macros using regular expressions and use them in other regular expressions Letter  (a|b|c| … |z|A|B|C| … |Z) Digit  (0|1|2| … |9) Identifier  Letter ( Letter | Digit )* Important: We should be able to order these definitions so that every definition uses only the definitions defined before it (i.e., no recursion) Regular definitions can be converted to basic regular expressions with macro expansion In JLex enclose definitions using curly braces Identifier  {Letter} ( {Letter} | {Digit} )*

Examples of Regular Expressions Digit  (0|1|2| … |9) Integer  (+|-)? (0| (1|2|3| … |9)(Digit *)) Decimal  Integer “.” Digit * Real  ( Integer | Decimal ) E (+|-)?Digit * Complex  “(“ Real, Real “)” Numbers can get even more complicated.

From Regular Expressions to Scanners Regular expressions are useful for specifying patterns that correspond to tokens However, we also want to construct programs which recognize these patterns How do we do it? –Use finite automata!

Consider the problem of recognizing register names in an assembler Register  R (0|1|2| … |9) (0|1|2| … |9) * Allows registers of arbitrary number Requires at least one digit RE corresponds to a recognizer (or DFA) Example S0S0 S2S2 S1S1 R (0|1|2| … |9) accepting state (0|1|2| …|9) Recognizer for Register initial state SeSe R R (R|0|1|2| …|9) error state (0|1|2| …|9)

Deterministic Finite Automata (DFA) A set of states S –S = { s 0, s 1, s 2, s e } A set of input symbols (an alphabet)  –  = { R, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 } A transition function  : S    S –Maps (state, symbol) pairs to states –  = { ( s 0, R)  s 1, ( s 0, 0-9)  s e, ( s 1, 0-9 )  s 2, ( s 1, R )  s e, ( s 2, 0-9 )  s 2, ( s 2, R )  s e, ( s e, R | 0-9 )  s e } A start state –s 0 A set of final (or accepting) states –Final = { s 2 } A DFA accepts a word x iff there exists a path in the transition graph from start state to a final state such that the edge labels along the path spell out x

DFA simulation Start in state s 0 and follow transitions on each input character DFA accepts a word x iff x leaves it in a final state (s 2 ) So, “R17” takes it through s 0, s 1, s 2 and accepts “R” takes it through s 0, s 1 and fails “A” takes it straight to s e “R17R” takes it through s 0, s 1, s 2, s e and rejects Example S0S0 S2S2 S1S1 R (0|1|2| …|9) accepting state (0|1|2| …|9) Recognizer for Register initial state

Simulating a DFA state = s 0 ; char = get_next_char(); while (char != EOF) { state =  (state,char); char =get_next_char(); } if (state  Final) report acceptance; else report failure;  R 0,1,2,3, 4,5,6, 7,8,9 other S 0 S 1 S e S e S 1 S e S 2 S e S 2 S e S 2 S e S e S e S e S e The recognizer translates directly into code To change DFA s, just change the arrays Takes O(|x|) time for input string x Final = { s 2 } We can also store the final states in an array We can store the transition table in a two-dimensional array:

Recognizing Longest Accepted Prefix accepted = false; current_string =  ; // empty string state = s 0 ; // initial state if (state  Final) { accepted_string = current_string; accepted = true; } char =get_next_char(); while (char != EOF) { state =  (state,char); current_string = current_string + char; if (state  Final) { accepted_string = current_string; accepted = true; } char =get_next_char(); } if accepted return accepted_string; else report error;  R 0,1,2,3, 4,5,6, 7,8,9 other S 0 S 1 S e S e S 1 S e S 2 S e S 2 S e S 2 S e S e S e S e S e Given an input string, this simulation algorithm returns the longest accepted prefix Given the input “R17R”, this simulation algorithm returns “R17” Final = { s 2 }