COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.

Slides:



Advertisements
Similar presentations
Lexical Analysis. what is the main Task of the Lexical analyzer Read the input characters of the source program, group them into lexemes and produce the.
Advertisements

COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
CSE 3302 Programming Languages Chengkai Li, Weimin He Spring 2008 Syntax Lecture 2 - Syntax, Spring CSE3302 Programming Languages, UT-Arlington ©Chengkai.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
Chapter 4 Lexical Analysis.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Lexical Analysis Recognize tokens and ignore white spaces, comments
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source program) – divides it into tokens.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction
Topic #3: Lexical Analysis
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Lexical Analysis Natawut Nupairoj, Ph.D.
Lexical Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lecture 2: Lexical Analysis
Lexical Analysis (I) Compiler Baojian Hua
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
SCRIBE SUBMISSION GROUP 8 Date: 7/8/2013 By – IKHAR SUSHRUT MEGHSHYAM 11CS10017 Lexical Analyser Constructing Tokens State-Transition Diagram S-T Diagrams.
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
What is a language? An alphabet is a well defined set of characters. The character ∑ is typically used to represent an alphabet. A string : a finite.
Lexical Analyzer in Perspective
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
CPS 506 Comparative Programming Languages Syntax Specification.
IN LINE FUNCTION AND MACRO Macro is processed at precompilation time. An Inline function is processed at compilation time. Example : let us consider this.
Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters.
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Scanner Introduction to Compilers 1 Scanner.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Lexical Analysis - Scanner- Contd Computer Science Rensselaer Polytechnic Compiler Design Lecture 3(01/21/98)
The Role of Lexical Analyzer
Recursive Definations Regular Expressions Ch # 4 by Cohen
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
Scanning & Regular Expressions CPSC 388 Ellen Walker Hiram College.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Deterministic Finite Automata Nondeterministic Finite Automata.
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science.
Lexical Analyzer in Perspective
CS510 Compiler Lecture 2.
Lecture 2 Lexical Analysis
Lexical Analysis.
Scanner Scanner Introduction to Compilers.
Chapter 3 Lexical Analysis.
Lexical Analysis CSE 340 – Principles of Programming Languages
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Compiler Construction
Week 14 - Friday CS221.
Review: Compiler Phases:
Lecture 4: Lexical Analysis & Chomsky Hierarchy
Scanner Scanner Introduction to Compilers.
CS 3304 Comparative Languages
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Presentation transcript:

COMP313A Programming Languages Lexical Analysis

Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions

Lexical Analysis Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the most powerful technique that works And nothing more –No parsing sledgehammers for lexical nuts –Portability More modular code More code re-use

Source Code Characteristics Code –Identifiers Count, max, get_num –Language keywords switch, if.. then.. else, printf, return, void Mathematical operators –+, *, >> …. –<=, =, != … –Literals “Hello World” Comments Whitespace

Language of Lexical Analysis Tokens Patterns Lexemes

Tokens are not enough… Clearly, if we replaced every occurrence of a variable with a token then …. We would lose other valuable information Other data items are attributes of the tokens Stored in the symbol table

Token delimiters When does a token/lexeme end? e.g xtemp=ytemp

Ambiguity in identifying tokens A programming language definition will state how to resolve uncertain token assignment <> Is it 1 or 2 tokens? Disambiguating rules state what to do Reserved keywords (e.g. if) take precedence over identifiers ‘Principle of longest substring’

Regular Expressions To represent patterns of strings of characters REs –Alphabet – set of legal symbols –Meta-characters – characters with special meanings  is the empty string 3 basic operations –Choice – choice1|choice2, a|b matches either a or b –Concatenation – firstthing secondthing (a|b)c matches the strings { ac, bc } –Repetition (Kleene closure)– repeatme* a* matches { , a, aa, aaa, aaaa, ….} Precedence: * is highest, | is lowest –Thus a|bc* is a|(b(c*))

Regular Expressions… We can add in regular definitions –digit = 0|1|2 …|9 And then use them: –digit digit* A sequence of 1 or more digits One or more repetitions: –(a|b)(a|b)*  (a|b)+ Any character in the alphabet. –.*b.* - strings containing at least one b Ranges [a-z], [a-zA-Z], [0-9], (assume character set ordering) Not: ~a or [^a]

Some exercises Describe the languages denoted by the following regular expressions 1.0 ( 0 | 1 ) * 0 2.( (  | 0 ) * ) * 3.0* 1 0* 1 0* 1 0 * Write regular definitions for the following regular expressions 1.All strings that contain the five vowels in order 2.All strings of letters in which the letters are in ascending lexicographic order 3.All strings of 0’s and 1’s that do not contain the substring 011

Some exercises Write a regular expression for C/C++ integers Write a regular expression for C/C++ identifiers Write a regular expression for C/C++ numbers

Limitations of REs REs can describe many language constructs but not all For example Alphabet = {a,b}, describe the set of strings consisting of a single a surrounded by an equal number of b’s S= {a, bab, bbabb, bbbabbb, …}

Lookahead, < When we read a token delimiter to establish a token we need to make sure that it is still available –It is the start of the next token! This is lookahead –Decide what to do based on the character we ‘haven’t read’ Sometimes implemented by reading from a buffer and then pushing the input back into the buffer And then starting with recognizing the next token

Classic Fortran example DO 99 I=1,10 becomes DO99I=1,10 versus DO99I=1.10 When can the lexical analyzer assign a token? Push back into input buffer –or ‘backtracking’