Lexical Analysis (I) Compiler Baojian Hua

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

Compiler construction in4020 – lecture 2 Koen Langendoen Delft University of Technology The Netherlands.
Compiler Baojian Hua Lexical Analysis (II) Compiler Baojian Hua
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
 Lex helps to specify lexical analyzers by specifying regular expression  i/p notation for lex tool is lex language and the tool itself is refered to.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
Context-Free Grammars Lecture 7
Compiler Construction
Lexing Discrete Mathematics and Its Applications Baojian Hua
Lexical Analysis Compiler Baojian Hua
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Automata and Regular Expression Discrete Mathematics and Its Applications Baojian Hua
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
LR Parsing Compiler Baojian Hua
Lexical Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
Lexical Analysis: Regular Expressions CS 671 January 22, 2008.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Lexical Analyzer in Perspective
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
CPS 506 Comparative Programming Languages Syntax Specification.
Abstract Syntax Trees Compiler Baojian Hua
D Goforth COSC Translating High Level Languages.
D Goforth COSC Translating High Level Languages Note error in assignment 1: #4 - refer to Example grammar 3.4, p. 126.
IN LINE FUNCTION AND MACRO Macro is processed at precompilation time. An Inline function is processed at compilation time. Example : let us consider this.
Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters.
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Compilers Computer Symbol Table Output Scanner (lexical analysis)
Lexical Analysis: Regular Expressions CS 471 September 3, 2007.
using Deterministic Finite Automata & Nondeterministic Finite Automata
1 Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
Comp 311 Principles of Programming Languages Lecture 2 Syntax Corky Cartwright August 26, 2009.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
1 Compiler Construction Vana Doufexi office CS dept.
Deterministic Finite Automata Nondeterministic Finite Automata.
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Lexical Analyzer in Perspective
CS510 Compiler Lecture 2.
Lecture 2 Lexical Analysis
Lexical Analysis.
Chapter 3 Lexical Analysis.
Lexical Analysis CSE 340 – Principles of Programming Languages
Compiler Baojian Hua LR Parsing Compiler Baojian Hua
Finite-State Machines (FSMs)
Finite-State Machines (FSMs)
Review: Compiler Phases:
Lecture 4: Lexical Analysis & Chomsky Hierarchy
COMPILERS LECTURE(6-Aug-13)
Compiler Construction
Presentation transcript:

Lexical Analysis (I) Compiler Baojian Hua

Compiler source program target program compiler

Front and Back Ends source program target program front end back end IR

Front End source code abstract syntax tree lexical analyzer parser tokens IR semantic analyzer

Lexical Analyzer The lexical analyzer translates the source program into a stream of lexical tokens Source program: stream of characters vary from language to language (ASCII, Unicode, or … ) Lexical token: compiler internal data structure that represents the occurrence of a terminal symbol vary from compiler to compiler

Conceptually character sequence token sequence lexical analyzer

Example if (x > 5) y = “hello”; else z = 1; IF LPAREN IDENT(x) GT INT(5) RPAREN IDENT(y) ASSIGN STRING(“hello”) SEMICOLON ELSE IDENT(z) ASSIGN INT(1) SEMICOLON EOF lexical analysis

Lexer Implementation Options: Write a lexer by hand from scratch boring, error-prone, and too much work see dragon-book section 3.4 for the algorithm nevertheless, many compilers use this approach gcc, llvm, and the Tiger you ’ ll build, … Automatic lexer generator Quick, easy and happy, fast prototype We start with the first approach, and come to the latter one later

Token, Pattern, and Lexeme Token: think “ kind ” e.g., in C, we have identifier, integers, floats, … Pattern: think “ form ” e.g., in C, an identifier starts with letter and followed by zero or more identifiers Lexeme: think “ one instance ” e.g., “ tiger ” is a valid C identifier

So To hack a lexer, one must: Specify all possible tokens dozens in a practical language Specify the pattern for each kind of token a little hard Write code to recognize them standard algorithms For the second purpose, one needs a little of math---regular expression

Basic Definitions Alphabet: the character set e.g., ASCII or Unicode, or … String: a finite sequence of char from alphabet e.g., hello Language: a set of strings finite or infinite say the C language

Regular Expression (RE) Construction by induction each c \in alphabet {c} empty \eps {} for M and N, then M|N (a|b) = {a, b} for M and N, then MN (a|b)(c|d) = {ac, ad, bc, bd} for M, then M* (Kleen closure) (a|b)* = {\eps, a, aa, b, ab, abb, baa, … }

Or more formally e -> \epsi | c | e | e | e e | e*

Example C ’ s indentifier: starts with a letter ( “ _ ” counts as a letter) followed by zero or more of letter or digit e.g., write in a stepwise way: (…) (_|a|b|…|z|A|B|…|Z) (…) (_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9) (_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9)* It’s tedious and error-prone …

Syntax Sugar We introduce some abbreviations: [a-z] == a|b| … |z e+ == one or more of e e? == zero or one of e “ a* ” == a* itself, not a ’ s Kleen closure e{i, j} == more than i and less than j of e. == any character except for ‘ \n ’ All these can be represented by the core RE i.e., they are derived forms

Example Revisited C ’ s indentifier: starts with a letter ( “ _ ” counts as a letter) followed by zero or more of letter or digit (…) (_|a|b|…|z|A|B|…|Z) (…) (_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9) [_a-zA-Z][_a-zA-Z0-9]* What about the key word “if”?

Ambiguous Rule A single RE is not ambiguous But in a language, there may be many REs? [_a-zA-Z][_a-zA-Z0-9]* (i)(f) So, for a given string “if”, which RE to match?

Ambiguous Rule Two conventions: Longest match: The regular expression that matches the longest string takes precedence. Rule Priority: associate each RE a priority. If two regular expressions match the same (longest) string, the higher RE takes precedence.

Transition Diagram algorithm

Lab 1 of Tiger In the lab 1 of the Tiger compiler, you ’ re required to write a lexer by hand the diagram-transition algorithm as described above start early