CSc 453 Lexical Analysis (Scanning)

Slides:



Advertisements
Similar presentations
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Advertisements

From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.
CSc 453 Lexical Analysis (Scanning)
Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.
 Lex helps to specify lexical analyzers by specifying regular expression  i/p notation for lex tool is lex language and the tool itself is refered to.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
A brief [f]lex tutorial Saumya Debray The University of Arizona Tucson, AZ
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis Hira Waseem Lecture
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
SCRIBE SUBMISSION GROUP 8 Date: 7/8/2013 By – IKHAR SUSHRUT MEGHSHYAM 11CS10017 Lexical Analyser Constructing Tokens State-Transition Diagram S-T Diagrams.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
1 Languages and Compilers (SProg og Oversættere) Lexical analysis.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability.
IN LINE FUNCTION AND MACRO Macro is processed at precompilation time. An Inline function is processed at compilation time. Example : let us consider this.
CSc 453 Lexical Analysis (Scanning)
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
The Role of Lexical Analyzer
Lexical Analysis (Scanning) Lexical Analysis (Scanning)
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
1st Phase Lexical Analysis
using Deterministic Finite Automata & Nondeterministic Finite Automata
Lecture 6 Lexical Analysis Xiaoyin Wang CS5363 Programming Languages and Compilers.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
CC410: System Programming Dr. Manal Helal – Fall 2014 – Lecture 12–Compilers.
Department of Software & Media Technology
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
CS 3304 Comparative Languages
Lecture 2 Lexical Analysis
Chapter 3 Lexical Analysis.
Chapter 2 :: Programming Language Syntax
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Lexical Analysis (Sections )
Finite-State Machines (FSMs)
CSc 453 Lexical Analysis (Scanning)
Finite-State Machines (FSMs)
Deterministic Finite Automata
Compiler Construction
Lexical analysis Jakub Yaghob
Deterministic Finite Automata
Chapter 3: Lexical Analysis
Lexical Analysis - An Introduction
Review: Compiler Phases:
CS 3304 Comparative Languages
CS 3304 Comparative Languages
Compiler Construction
Chapter 2 :: Programming Language Syntax
Lexical Analysis - An Introduction
Lexical Analysis.
Chapter 2 :: Programming Language Syntax
CMPE 152: Compiler Design March 19 Class Meeting
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

CSc 453 Lexical Analysis (Scanning) Saumya Debray The University of Arizona Tucson

Overview CSc 453: Lexical Analysis

Overview tokens lexical analyzer (scanner) syntax analyzer (parser) source program symbol table manager Main task: to read input characters and group them into “tokens.” Secondary tasks: Skip comments and whitespace; Correlate error messages with source program (e.g., line number of error). CSc 453: Lexical Analysis

Overview (cont’d) . . . Token Input file sequence lexical analyzer keywd_int identifier: “main” left_paren keywod_int identifier: “argc” comma keywd_char star identifier: “argv” right_paren left_brace … / * p g m . c \n i n t a ( r , h v ) { \t x Y ; f l o w lexical analyzer . . . CSc 453: Lexical Analysis

Implementing Lexical Analyzers Different approaches: Using a scanner generator, e.g., lex or flex. This automatically generates a lexical analyzer from a high-level description of the tokens. (easiest to implement; least efficient) Programming it in a language such as C, using the I/O facilities of the language. (intermediate in ease, efficiency) Writing it in assembly language and explicitly managing the input. (hardest to implement, but most efficient) CSc 453: Lexical Analysis

Lexical Analysis: Terminology token: a name for a set of input strings with related structure. Example: “identifier,” “integer constant” pattern: a rule describing the set of strings associated with a token. Example: “a letter followed by zero or more letters, digits, or underscores.” lexeme: the actual input string that matches a pattern. Example: count CSc 453: Lexical Analysis

Examples Input: count = 123 Tokens: identifier : Rule: “letter followed by …” Lexeme: count assg_op : Rule: = Lexeme: = integer_const : Rule: “digit followed by …” Lexeme: 123 CSc 453: Lexical Analysis

Attributes for Tokens If a token can have multiple lexemes, the scanner must indicate the actual lexeme that matched. This information is given using an attribute associated with the token. Example: The program statement count = 123 yields the following token-attribute pairs: identifier, pointer to the string “count” assg_op, - integer_const, the integer value 123 Project Note: In flex-generated scanners, attributes are communicated to the parser using the variable yylval CSc 453: Lexical Analysis

Lexical analysis in compilers Language specification Language implementation Specifies the tokens in the language: pattern(s) defining each token specified using regular expressions any attributes associated with a token regular expressions are matched using finite state machines (FSMs) Compiler writer: maps regular expressions in language spec to FSMs Compiler: uses FSMs to recognize tokens in the input character stream CSc 453: Lexical Analysis

Regular expressions Regular expressions are a pattern notation for describing (certain kinds of) sets of strings over a finite alphabet Pattern matches Strings  the empty string  a (an input symbol) the symbol a r | s x : x matches r or x matches s rs x1x2 : x1 matches r and x2 matches s (r)* x1x2…xn (n  0) : each xi matches r CSc 453: Lexical Analysis

Finite state machines An abstract computational device that pattern- matches strings against regular expressions. A finite state machine consists of: a finite set of states one initial state  0 final states a state transition function T T(q, a) gives the next state from a state q on input symbol a CSc 453: Lexical Analysis

Finite state machines: an example A finite state machine to match C-style comments: initial state final state CSc 453: Lexical Analysis

Pattern matching using FSMs A FSM M accepts a string x iff, starting from M’s initial state, the state transitions of M on the symbols of x cause M to end in a final state. A string x matches a regular expression r iff a finite state machine1 for r accepts x. 1There may be more than one such machine. CSc 453: Lexical Analysis

Example: decimal integer constants Regular expression: (0|1|2|3|4|5|6|7|8|9) (0|1|2|3|4|5|6|7|8|9)* Or, using extended notation: [0-9]+ Finite state machine: CSc 453: Lexical Analysis

Finite Automata and Lexical Analysis The tokens of a language are specified using regular expressions. A scanner is a big FSM, essentially the “aggregate” of the FSMs for the individual tokens. Issues: What does the scanner automaton look like? How much should we match? (When do we stop?) What do we do when a match is found? Buffer management (for efficiency reasons). CSc 453: Lexical Analysis

Structure of a Scanner Automaton CSc 453: Lexical Analysis

How much should we match? In general, find the longest match possible. E.g., on input 123.45, match this as num_const(123.45) rather than num_const(123), “.”, num_const(45). CSc 453: Lexical Analysis

How much should we match? Need to be careful when specifying tokens under “longest match possible” policy: Pattern Input program COMMENT ″/*″(.|\n)*″*/″ (.|\n)* float area() { /* area of a circle */ float r, a; scanf(“%”, &r); /* radius */ a = 3.1416*r*r; return a; } CSc 453: Lexical Analysis

Implementing finite state machines Table-driven FSMs (e.g., lex, flex): Use a table to encode transitions: next_state = T(curr_state, next_char); Use one bit in state no. to indicate whether it’s a final (or error) state. If so, consult a separate table for what action to take. T next input character Current state CSc 453: Lexical Analysis

Table-driven FSMs: Example int scanner() { char ch; int currState = 1; while (TRUE) { ch = NextChar( ); if (ch == EOF) return 0; /* fail */ currState = T [currState, ch]; if (IsFinal(currState)) { return 1; /* success */ } } /* while */ T input a b state 1 2 3 3(final) CSc 453: Lexical Analysis

What do we do on finding a match? A match is found when: The current automaton state is a final state; and No transition is enabled on the next input character. Actions on finding a match: if appropriate, copy lexeme (or other token attribute) to where the parser can access it; save any necessary scanner state so that scanning can subsequently resume at the right place; return a value indicating the token found. CSc 453: Lexical Analysis