Lexical Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
CS 3240 – Chuck Allison.  A model of computation  A very simple, manual computer (we draw pictures!)  Our machines: automata  1) Finite automata (“finite-state.
Lexical Analysis Recognize tokens and ignore white spaces, comments
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
CPSC 388 – Compiler Design and Construction
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
1 Syntax Specification Regular Expressions. 2 Phases of Compilation.
Module 2 How to design Computer Language Huma Ayub Software Construction Lecture 7 1.
CSC312 Automata Theory Lecture # 2 Languages.
CSC312 Automata Theory Lecture # 2 Languages.
Formal Methods in SE Theory of Automata Qasiar Javaid Assistant Professor Lecture # 06.
Lecture Two: Formal Languages Formal Languages, Lecture 2, slide 1 Amjad Ali.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
Two examples English-Words English-Sentences alphabet S ={a,b,c,d,…}
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
1 Language Definitions Lecture # 2. Defining Languages The languages can be defined in different ways, such as Descriptive definition, Recursive definition,
1 Chapter 1 Introduction to the Theory of Computation.
Lecture # 3 Regular Expressions 1. Introduction In computing, a regular expression provides a concise and flexible means to "match" (specify and recognize)
Languages, Grammars, and Regular Expressions Chuck Cusack Based partly on Chapter 11 of “Discrete Mathematics and its Applications,” 5 th edition, by Kenneth.
Grammars CPSC 5135.
Lexical Analysis (I) Compiler Baojian Hua
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Module 2 How to design Computer Language Huma Ayub Software Construction Lecture 8.
L ECTURE 3 Chapter 4 Regular Expressions. I MPORTANT T ERMS Regular Expressions Regular Languages Finite Representations.
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Introduction to Theory of Automata By: Wasim Ahmad Khan.
1 Module 14 Regular languages –Inductive definitions –Regular expressions syntax semantics.
Syntax Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University
1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a compiler PART II:
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
MA/CSSE 474 Theory of Computation Regular Expressions Intro.
Mathematical Notions and Terminology Lecture 2 Section 0.2 Fri, Aug 24, 2007.
CPS 506 Comparative Programming Languages Syntax Specification.
Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
CSC312 Automata Theory Lecture # 3 Languages-II. Formal Language A formal language is a set of words—that is, strings of symbols drawn from a common alphabet.
Recursive Definations Regular Expressions Ch # 4 by Cohen
Formal Languages and Grammars
Lecture # 4.
Mathematical Foundations of Computer Science Chapter 3: Regular Languages and Regular Grammars.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
1 Strings and Languages Lecture 2-3 Ref. Handout p12-17.
Lecture 03: Theory of Automata:2014 Asif Nawaz Theory of Automata.
Deterministic Finite Automata Nondeterministic Finite Automata.
Languages and Strings Chapter 2. (1) Lexical analysis: Scan the program and break it up into variable names, numbers, etc. (2) Parsing: Create a tree.
Theory of Computation Lecture #
Lecture # 2.
CS510 Compiler Lecture 2.
Regular Languages, Regular Operations, Closure
Lecture 2 Lexical Analysis
Chapter 3 Lexical Analysis.
Lexical Analysis CSE 340 – Principles of Programming Languages
Formal Language Theory
Review: Compiler Phases:
Recap lecture 29 Example of prefixes of a language, Theorem: pref(Q in R) is regular, proof, example, Decidablity, deciding whether two languages are equivalent.
Lexical Analysis Lecture 3-4 Prof. Necula CS 164 Lecture 3.
Specification of tokens using regular expressions
CSE 340 Recitation Week 3 : Sept 1st – 7th Regular Expressions
Compiler Construction
COMPILERS LECTURE(6-Aug-13)
CSC312 Automata Theory Lecture # 2 Languages.
Presentation transcript:

Lexical Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University

Adam Doupé, Principles of Programming Languages Language Syntax Programming Language must have a clearly specified syntax Programmers can learn the syntax and know what is allowed and what is not allowed Compiler writers can understand programs and enforce the syntax 2

Adam Doupé, Principles of Programming Languages Language Syntax Input is a series of bytes –How to get from a string of characters to program execution? We must first assemble the string of characters into something that a program can understand Output is a series of tokens 3

Adam Doupé, Principles of Programming Languages Language Syntax In English, we have an alphabet – a … z,,,., !, ?, … However, we also have a higher abstraction than letter in the alphabet Words –Defined in a dictionary –Categorized into Nouns Verbs Adverbs Articles Sentences Paragraphs 4

Adam Doupé, Principles of Programming Languages Language Syntax In a programming language, we also have an alphabet (the symbols that are important in the specific language) – a, z,,,., !, ?,, ;, }, {, (, ), … Just as in English, we create abstractions of the low-level alphabet Tokens – == – <= – while – if Tokens are precisely specified using patterns 5

Adam Doupé, Principles of Programming Languages Strings Alphabet symbols together make a string We define a string over an alphabet as a finite sequence of symbols from is the empty string, an empty sequence of symbols Concatenating with a string s gives s –s = s = s In our examples, strings will be stylized differently, either –"in between double quotes" –italic and dark blue 6

Adam Doupé, Principles of Programming Languages Languages represents the set of all symbols in an alphabet We define * as the set of all strings over – * contains all the strings that can be created by combining the alphabet symbols into a string A language L over alphabet is a set of strings over –A language L is a subset of * Is infinite? Is * infinite? Is L infinite? 7

Adam Doupé, Principles of Programming Languages Regular Expressions Tokens are typically specified using regular expressions Regular expressions are –Compact –Expressive –Precise –Widely used –Easy to generate an efficient program to match a regular expression 8

Adam Doupé, Principles of Programming Languages Regular Expressions We must first define the syntax of regular expressions A regular expression is either 1. ∅ 2. 3.a, where a is an element of the alphabet 4.R 1 | R 2, where R 1 and R 2 are regular expressions 5.R 1. R 2, where R 1 and R 2 are regular expressions 6.(R), where R is a regular expression 7.R* where R is a regular expression 9

Adam Doupé, Principles of Programming Languages Regular Expressions A regular expression defines a language (the set of all strings that the regular expression describes) The language L(R) of regular expression R is given by: 1.L( ∅ ) = ∅ 2.L() = {} 3.L(a) = {a} 4.L(R 1 | R 2 ) = L(R 1 ) ∪ L(R 2 ) 5.L(R 1. R 2 ) = L(R 1 ). L(R 2 ) 10

Adam Doupé, Principles of Programming Languages L(R 1 | R 2 ) = L(R 1 ) ∪ L(R 2 ) Examples: L(a | b) = L(a) ∪ L(b) = {a} ∪ {b} = {a, b} L(a | b | c) = L(a | b) ∪ L(c) = {a, b} ∪ {c} = {a, b, c} L(a | ) = L(a) ∪ L() = {a} ∪ {} = {a, } L( | ) = {} {} ?= {} 11

Adam Doupé, Principles of Programming Languages L(R 1 | R 2 ) = L(R 1 ) ∪ L(R 2 ) Examples: L(a | b) = L(a) ∪ L(b) = {a} ∪ {b} = {a, b} L(a | b | c) = L(a | b) ∪ L(c) = {a, b} ∪ {c} = {a, b, c} L(a | ) = L(a) ∪ L() = {a} ∪ {} = {a, } L( | ) = {} {} ?= {} 12

Adam Doupé, Principles of Programming Languages L(R 1. R 2 ) = L(R 1 ). L(R 2 ) Definition For two sets A and B of strings: A. B = {xy : x ∈ A and y ∈ B} Examples: A = {aa, b }, B = {a, b} A. B = {aaa, aab, ba, bb} ab ? ∈ A. B A = {aa, b, }, B = {a, b} A. B = {aaa, aab, ba, bb, a, b} 13

Adam Doupé, Principles of Programming Languages L(R 1. R 2 ) = L(R 1 ). L(R 2 ) Definition For two sets A and B of strings: A. B = {xy : x ∈ A and y ∈ B} Examples: A = {aa, b }, B = {a, b} A. B = {aaa, aab, ba, bb} ab ? ∈ A. B A = {aa, b, }, B = {a, b} A. B = {aaa, aab, ba, bb, a, b} 14

Adam Doupé, Principles of Programming Languages Operator Precedence L( a | b. c ) What does this mean? (a | b). c or a | (b. c) Just like in math or a programming language, we must define the operator precedence (* higher precedence than +) a + b * c (a + b) * c or a + (b * c)?. has higher precedence than | L( a | b. c) = L(a) ∪ L(b. c) = {a} ∪ {bc} = {a, bc} 15

Adam Doupé, Principles of Programming Languages Regular Expressions L( (R) ) = L(R) L( (a | b). c ) = L (a | b). L (c) = {a, b}. {c} = {ac, bc} 16

Adam Doupé, Principles of Programming Languages Kleene Star L(R*) = ? L (R*) = {} ∪ L(R) ∪ L(R). L(R) ∪ L(R). L(R). L(R) ∪ L(R). L( R). L(R). L(R) … Definition L 0 (R) = {} L i (R) = L i-1 (R). L(R) L(R*) = ∪ i≥0 L i (R) 17

Adam Doupé, Principles of Programming Languages L(R*) = ∪ i≥0 L i (R) Examples L(a | b*) = {a,, b, bb, bbb, bbbb, …} L((a | b)*) = {} ∪ {a, b} ∪ {aa, ab, ba, bb} ∪ {aaa, aab, aba, abb, baa, bab, bba, bbb} ∪ … 18

Adam Doupé, Principles of Programming Languages L(R*) = ∪ i≥0 L i (R) Examples L(a | b*) = {a,, b, bb, bbb, bbbb, …} L((a | b)*) = {} ∪ {a, b} ∪ {aa, ab, ba, bb} ∪ {aaa, aab, aba, abb, baa, bab, bba, bbb} ∪ … 19

Adam Doupé, Principles of Programming Languages Tokens letter = a | b | c | d | e | … | A | B | C | D | E… digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ID = letter(letter | digit | _ )* a891_jksdbed 12ajkdfjb 20 Note that we've left out the. regular expression operator. It is implied when two regular expressions are next to each other, similar to x*y=xy in math.

Adam Doupé, Principles of Programming Languages Tokens How to define a number? NUM = digit* 132 NUM = digit(digit)* pdigit = 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 NUM = pdigit(digit)*

Adam Doupé, Principles of Programming Languages Tokens NUM = pdigit. (digit)* | adb 22

Adam Doupé, Principles of Programming Languages Tokens How to define a decimal number? DECIMAL = NUM. \.. NUM DECIMAL = NUM. \.. digit* DECIMAL = NUM. \.. digit. digit* Note that here we mean a regular expression that matches the one- character string dot. However, to differentiate between the regular expression concatenation operator. and the character., we escape the. with a \ (similar to strings where \n represents the newline character in a string). This means that we also need to escape \ with a \ so that the regular expression \\ matches the string containing the single character \

Adam Doupé, Principles of Programming Languages Lexical Analysis The job of the lexer is to turn a series of bytes (composed from the alphabet) into a sequence of tokens –The API that we will discuss in this class will refer to the lexer as having a function called getToken (), which returns the next token from the input steam each time it is called Tokens are specified using regular expressions 24 Source Lexer Tokens Bytes NUM, ID, NUM, OPERATOR, ID, DECIMAL, …

Adam Doupé, Principles of Programming Languages Lexical Analysis Given these tokens: ID = letter. (letter | digit | _ )* DOT = \. NUM = pdigit. (digit)* | 0 DECIMAL = NUM. DOT. digit. digit* What token does getToken () return on this string: 1.1abc1.2 NUM? DECIMAL? ID? 25

Adam Doupé, Principles of Programming Languages Longest Matching Prefix Rule Starting from the next input symbol, find the longest string that matches a token Break ties by giving preference to token listed first in the list 26

Adam Doupé, Principles of Programming Languages StringMatchingPotentialLongest Match 1.1abc1.2 All 1.1abc1.2 NUMDECIMAL, NUM 1.1abc1.2 DECIMAL NUM, 1 1.1abc1.2 DECIMAL NUM, 1 1.1abc1.2 DECIMAL, 3 1.1abc1.2 All abc1.2 ID abc1.2 ID abc1.2 ID abc1.2 ID abc1.2 ID, 4 abc1.2 All.2 DOT.2 DOT, 1.2 All 2 NUM 2 NUM, 1 27 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^

Adam Doupé, Principles of Programming Languages StringMatchingPotentialLongest Match 1.1abc1.2 All 1.1abc1.2 NUMDECIMAL, NUM 1.1abc1.2 DECIMAL NUM, 1 1.1abc1.2 DECIMAL NUM, 1 1.1abc1.2 DECIMAL, 3 1.1abc1.2 All abc1.2 ID abc1.2 ID abc1.2 ID abc1.2 ID abc1.2 ID, 4 abc1.2 All.2 DOT.2 DOT, 1.2 All 2 NUM 2 NUM, 1 28 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^

Adam Doupé, Principles of Programming Languages Mariner 1 29

Adam Doupé, Principles of Programming Languages Lexical Analysis In some programming languages, whitespace is not significant at all –In most programming language, whitespace is not always significant ( ) vs. (5+10) In Fortran, whitespace is ignored DO 15 I = 1,100 DO 15 I = –Variable assignment instead of a loop! 30