I.E. LEXICAL ANALYSIS I.E. LINEAR ANALYSIS

Slides:



Advertisements
Similar presentations
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Advertisements

From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
Week 13 - Wednesday.  What did we talk about last time?  Exam 3  Before review:  Graphing functions  Rules for manipulating asymptotic bounds  Computing.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
1 Chapter 2: Scanning 朱治平. Scanner (or Lexical Analyzer) the interface between source & compiler could be a separate pass and places its output on an.
2. Lexical Analysis Prof. O. Nierstrasz
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Scanner Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction
Regular Languages A language is regular over  if it can be built from ;, {  }, and { a } for every a 2 , using operators union ( [ ), concatenation.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Finite-State Machines with No Output
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis Natawut Nupairoj, Ph.D.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
PZ02B Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ02B - Regular grammars Programming Language Design.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Compiler Construction Lexical Analysis. The word lexical means textual or verbal or literal. The lexical analysis implemented in the “SCANNER” module.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
Fall 2007CMPS 450 Lexical Analysis CMPS 450 J. Moloney.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Lexical Analyzer in Perspective
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
CSc 453 Lexical Analysis (Scanning)
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
1 Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Lexical Analyzer in Perspective
CS510 Compiler Lecture 2.
Lecture 2 Lexical Analysis
Lexical Analysis.
Chapter 3 Lexical Analysis.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Lexical Analysis (Sections )
CSc 453 Lexical Analysis (Scanning)
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
CSc 453 Lexical Analysis (Scanning)
Regular Expressions.
PROGRAMMING LANGUAGES
Finite-State Machines (FSMs)
Deterministic Finite Automata
Deterministic Finite Automata
פרק 3 ניתוח לקסיקאלי תורת הקומפילציה איתן אביאור.
Review: Compiler Phases:
CS 3304 Comparative Languages
Specification of tokens using regular expressions
Compiler Construction
Chapter 2 :: Programming Language Syntax
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Chapter 2 :: Programming Language Syntax
Compiler Construction
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
CSc 453 Lexical Analysis (Scanning)
PZ02B - Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section PZ02B.
Presentation transcript:

I.E. LEXICAL ANALYSIS I.E. LINEAR ANALYSIS SCANNING I.E. LEXICAL ANALYSIS I.E. LINEAR ANALYSIS

Interaction of Parser and Scanner The source program in HLL is a stream of characters read from left to right Scanner reads characters from that stream and groups them into tokens returned to the parser. Whitespace (blanks, tabs, returns) and comments are eliminated

Benefits of Modularity (Scanning is a separate module) Simplicity: separate input from rest of program, which only deals with tokens and not characters.

Benefits of Modularity (Scanning is a separate module) Simplicity Efficiency: Scanning I/O intensive Use buffering to improve efficiency

Benefits of Modularity (Scanning is a separate module) Simplicity Efficiency Portability: (historical) different machines might have different keyboards: e.g. Pascal: @ and up-arrow

Benefits of Modularity (Scanning is a separate module) Internationalisation: e.g ALGOL 68 in many languages: Russian, German, French, Bulgarian, Japanes LOGO (educational): English, French, Italian Mama (educational): English, Hebrew, Yiddish, Chinese M4 macro processor: English, German Scratch (educational): 40+ languages Perl: Klingon (different grammar as well)

What are tokens? Most basic meaningful basic objects in a computer program: Sequences of characters which form the low-level constructs of the HLL (e.g. variable names, keywords, labels, operators) Examples: A[12,index2] = getstuff(23); B[66,88] = getmorestuff(11);

Token Structure Early on (parser) need to recognize validity of structure  value of identifiers does not matter. Tokens have 2 fields: type (compulsory) and value (optional) Example Vocabulary: the type is often called “token” the string being scanned (instance) is called a “lexeme” Token 25 hello if + - <  >  Type constant identifier IF PLUS MINUS Comparison Value  

How to write a scanner Brute-force scanning (e.g. section that handles identifiers starting with ‘c’)

How to write a scanner Brute-force scanning (e.g. section that handles identifiers starting with ‘c’) Problems?

How to write a scanner Hand-coded Finite State Automaton (FSA) or Transition Diagram EXERCISE: Draw FSA for example above Assume FSA recognizes « class », « case » and other identifiers which only contain letters.

Finite State Automata A Finite State Automaton (FSA) A consists of 4 objects A set I called the input alphabet, of input symbols A set S of states the automaton can be in; A designated state s0 called the initial state; A designated set of states called the set of accepting states, or final states; A next-state function N: S×I → S that associates a “next-state” to each ordered pair consisting of a “current state” and “current input”. For each state s in S and input symbol m in I, N(s,m) is called the state to which A goes if m is input to A when A is in state s.

FSA: Transition Diagrams The operation of an FSA is commonly described by a diagram called a (state-)transition diagram. In a transition diagram, states are represented by circles, and accepting states by double circles. There is one arrow that points to the initial state and other arrows between states as follows: There is an arrow from state s to state t labeled m (∈I) iff N(s,m)=t.

FSA: Next State and Eventual-State The next-state table is a tabular representation of the next-state function. In the annotated next-state table, the initial state is indicated by an arrow and the accepting states by double circles. The eventual-state function of A is the function N*: S×I* → S defined as: for any state s of S and any input string w in I*, N*(s,w) = the state to which A goes if the symbols of w are input into A in sequence starting when A is in state s.

How to write a scanner Hand-coded Finite State Automaton (FSA) or Transition Diagram How to code an FSA? EXERCISE: CODE FSA

How to write a scanner Need: isFinal(state) function NextState[state,c] table TokenState[state]

How to write a scanner Question 1: What about lexical errors?

START HERE

How to write a scanner Question 2: What if tokens are not delimited by whitespace? EXERCISE: add “<”, “<=”, and “=” tokens to language

Returning character to stream Need: Buffering to read and put back characters

How to write a scanner Question 3: How to indicate whether to consume last char?

How to write a scanner Question 4: How to make NextState efficient?

How to write a scanner Question 5: How to optimize for groups of characters? E.g. groups of letters for identifiers, groups of numbers for numeric constants, groups of characters for string constants.

How to write a scanner Question 6: Who is going to create such FSAs and associated tables?

Kleene’s Theorem A language is accepted by an FSA iff it can be described by a regular expression. Such a language is called a regular language.

Formal Languages – Alphabets and Strings An alphabet Σ is a finite set of characters (or symbols). A word, or sequence, or string over Σ is any group of 0 or more consecutive characters of Σ. The length of a word is the number of characters in the word. The null string is the string of length 0. It is denoted ε or λ. A string of length n is really an ordered n-tuple of characters written without parentheses or commas. Given two strings x and y over Σ, the concatenation of x and y is the string xy obtained by putting all the characters of y right after x.

Formal Languages – Languages over alphabet Let Σ be an alphabet. A formal language over Σ is a set of strings over Σ. ∅ is the empty language (over Σ) Σn = {all strings over Σ that have length n} where n∈N Σ+ = the positive closure of Σ ={all strings over Σ that have length ≥ 1} Σ* = the Kleene closure of Σ = {all strings over Σ}

Formal Languages – Operation on Languages Let Σ be an alphabet. Let L and L′ be two languages defined over Σ. The following operations define new languages over Σ: The concatenation of L and L′, denoted LL′, is LL′ = {xy | x∈L ∧ y∈L′} The union of L and L′, denoted L∪L′, is L∪L′ = {x | x∈L ∨ y∈L′} The Kleene closure of L, denoted L*, is L*={ x | x is a concatenation of any finite number of strings in L}. Note that ε∈L*.

Regular Expressions - Definition Let Σ be an alphabet. The following are regular expressions (r.e.) over Σ: I. BASE: ε and each individual symbol of Σ are regular expressions. II. RECURSION: if r and s are regular expressions over Σ, then the following are also regular expressions over Σ: (rs) the concatenation of r and s (r | s) r or s (r*) the Kleene closure of r III.RESTRICTION: The only regular expressions over Σ are the ones defined by I and II above.

Regular Expressions – Operator Precedence The order of precedence of r.e. operators are, from highest to lowest: Highest: () * concatenation Lowest: |

REs – Languages defined by REs Let Σ be an alphabet. Define a function L as follows: L: {all r.e.'s over Σ}→{all languages over Σ} L(r) = the language defined by r I. L(ε) = {ε}, ∀a∈Σ L(a)={a} II. RECURSION: If L(r) and L(s) are the languages defined by the regular expressions r and s over Σ, then L(rs) = L(r)L(s) L(r|s) = L(r) ∪ L(s) L(r*) = (L(r))*

REs – Languages defined by REs Variations Some definitions of regular expressions and regular languages define ∅ to be a r.e. with L(∅)=∅

Properties of REs Axiom Description r | s = s | r | is commutative r | (s | t) = (r | s) | t = r | s| t | is associative (rs)t = r(st) = rst Concatenation is associative r(s|t) = rs | rt and (s|t)r = sr | tr Concatenation is distributive over | rε = εr = r ε is the identity element for concatenation r** = r* * is idempotent r* = (r|ε)* Regular expressions can be simplified by applying the following properties: For any regular expressions r, s, t,

REs – Notional Shorthands Here are some frequent constructs which have their own notation: (r)+ means one or more instances of r. L((r)+) = (L(r))+ (r)? means 0 or 1 instances of r. i.e. (r)? = r|ε L((r)?) = (L(r|ε)) = L(r) ∪ L(ε) = L(r) ∪ {ε} Character classes: [abc] = a|b|c [a-z] = a|b|…|z

REs – Regular Definitions Regular expressions can be broken down into regular definitions: sequences of expressions of the form d1 → r1 … dn →rn where each di is a distinct name and ri is a regular expression over symbols in Σ ∪ {d1, d2, … di-1}

REs – Examples Regular expression Identifier = [A-Za-z][A-Za-z0-9]* Can be broken down into the regular definitions letter  [A-Za-z] digit [0-9] identifier  letter (letter | digit)*

REs and Scanning Why regular expressions for scanning?

Regular Languages and FSA Let A be a FSA with set of input symbols I. Let w be a string of I*. Then w is accepted by A iff N*(s0,w) is an accepting state. The language accepted by A, denoted L(A), is the set of all strings that are accepted by A. L(A) = {w∈I* | N*( s0,w) is an accepting state of A} Kleene’s Theorem: A language is accepted by an FSA iff it can be described by a regular expression. Such a language is called a regular language. Theorem 1: Some languages are not regular. Theorem 2: The set of regular languages over an alphabet I is closed under the complement, union and intersection operators.

How to write a scanner Question 7: Practically speaking: how to translate re’s into FSAs?

RE → Transition Diagram EXAMPLES

STOP HERE

Javacc Web tab1: Assignment 1 Look at assignment description OV2-5 JavaCC Scanning Web tab2: The handout ssh: cd ~cps710/ public_html/term/A1/Lecture Structure: directory structure ssh: ls HL.jj Page 2 of handout look at makefile ssh run javacc (1st command in makefile) look at .java files ssh + Page1 of handout finish compilation (2nd command in makefile) use run program to run interactively use run program to redirect input from testfile Look at .jj structure Page 3-4 of handout Special states: 2 scanners, one for java and one for javadoc + 2 parsers sometimes regular expressions are not enough: need PDA in addition to FSA (e.g. comments) so you will need to write extra code. More: Explain it with strings Conflict Resolution Rules