Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
1 Pass Compiler 1. 1.Introduction 1.1 Types of compilers 2.Stages of 1 Pass Compiler 2.1 Lexical analysis 2.2. syntactical analyzer 2.3. Code generation.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1
Lexical Analysis Mooly Sagiv html:// Textbook:Modern Compiler Implementation in C Chapter 2.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Compiler Construction
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Lexical Analysis Recognize tokens and ignore white spaces, comments
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Course Revision Contents  Compilers  Compilers Vs Interpreters  Structure of Compiler  Compilation Phases  Compiler Construction Tools  A Simple.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis Mooly Sagiv Schrierber Wed 10:00-12:00 html:// Textbook:Modern.
Compiler course 1. Introduction. Outline Scope of the course Disciplines involved in it Abstract view for a compiler Front-end and back-end tasks Modules.
Lexical Analysis (I) Compiler Baojian Hua
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
1 Languages and Compilers (SProg og Oversættere) Lexical analysis.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
CS375 Compilers Lexical Analysis 4 th February, 2010.
Introduction Lecture 1 Wed, Jan 12, The Stages of Compilation Lexical analysis. Syntactic analysis. Semantic analysis. Intermediate code generation.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
CSc 453 Lexical Analysis (Scanning)
Introduction to Compiling
Compiler Introduction 1 Kavita Patel. Outlines 2  1.1 What Do Compilers Do?  1.2 The Structure of a Compiler  1.3 Compilation Process  1.4 Phases.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
The Model of Compilation Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.
Lexical Analysis.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Lexical Analysis: Regular Expressions CS 471 September 3, 2007.
1 Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
Overview of Compilation Prepared by Manuel E. Bermúdez, Ph.D. Associate Professor University of Florida Programming Language Principles Lecture 2.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
1 Compiler Construction Vana Doufexi office CS dept.
Presented by : A best website designer company. Chapter 1 Introduction Prof Chung. 1.
CS510 Compiler Lecture 1. Sources Lecture Notes Book 1 : “Compiler construction principles and practice”, Kenneth C. Louden. Book 2 : “Compilers Principles,
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
CHAPTER 1 INTRODUCTION TO COMPILER SUNG-DONG KIM, DEPT. OF COMPUTER ENGINEERING, HANSUNG UNIVERSITY.
CS 3304 Comparative Languages
Compiler Design (40-414) Main Text Book:
Chapter 1 Introduction.
Lecture 2 Lexical Analysis
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
CSc 453 Lexical Analysis (Scanning)
Finite-State Machines (FSMs)
CSc 453 Lexical Analysis (Scanning)
Chapter 1 Introduction.
PROGRAMMING LANGUAGES
Finite-State Machines (FSMs)
Compiler Lecture 1 CS510.
Lexical Analysis & Syntactic Analysis
Review: Compiler Phases:
CS 3304 Comparative Languages
Lecture 4: Lexical Analysis & Chomsky Hierarchy
CS 3304 Comparative Languages
Compiler design.
Compiler Structures 2. Lexical Analysis Objectives
Compiler Construction
Lec00-outline May 18, 2019 Compiler Design CS416 Compiler Design.
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

Lexical Analysis: Regular Expressions CS 671 January 22, 2008

CS 671 – Spring A program that translates a program in one language to another language the essential interface between applications & architectures Typically lowers the level of abstraction analyzes and reasons about the program & architecture We expect the program to be optimized i.e., better than the original ideally exploiting architectural strengths and hiding weaknesses Last Time … Compiler High-Level Programmin g Languages Machine Code Error Messages

CS 671 – Spring Phases of a Compiler Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Source program Target program Lexical Analyzer Group sequence of characters into lexemes – smallest meaningful entity in a language (keywords, identifiers, constants) Characters read from a file are buffered – helps decrease latency due to i/o. Lexical analyzer manages the buffer Makes use of the theory of regular languages and finite state machines Lex and Flex are tools that construct lexical analyzers from regular expression specifications

CS 671 – Spring Phases of a Compiler Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Source program Target program Parser Convert a linear structure – sequence of tokens – to a hierarchical tree-like structure – an AST The parser imposes the syntax rules of the language Work should be linear in the size of the input (else unusable)  type consistency cannot be checked in this phase Deterministic context free languages and pushdown automata for the basis Bison and yacc allow a user to construct parsers from CFG specifications

CS 671 – Spring Phases of a Compiler Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Source program Target program Semantic Analysis Calculates the program’s “meaning” Rules of the language are checked (variable declaration, type checking) Type checking also needed for code generation (code gen for a + b depends on the type of a and b)

CS 671 – Spring Phases of a Compiler Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Source program Target program Intermediate Code Generation Makes it easy to port compiler to other architectures (e.g. Pentium to MIPS) Can also be the basis for interpreters (such as in Java) Enables optimizations that are not machine specific

CS 671 – Spring Phases of a Compiler Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Source program Target program Intermediate Code Optimization Constant propagation, dead code elimination, common sub-expression elimination, strength reduction, etc. Based on dataflow analysis – properties that are independent of execution paths

CS 671 – Spring Phases of a Compiler Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Source program Target program Native Code Generation Intermediate code is translated into native code Register allocation, instruction selection Native Code Optimization Peephole optimizations – small window is optimized at a time

CS 671 – Spring Administration 1. Compiling to assembly 1. HW1 on website: Fun with Lex/Yacc 2. Questionnaire Results…

CS 671 – Spring Useful Tools! tar – archiving program gzip/bzip2 – compression svn – version control Make/Scons – build/run utility Other useful tools: –Man! –Which –Locate –Diff (or sdiff)

CS 671 – Spring Makefiles Target: dependent source file(s) command proj1 main.oio.odata.o main.cio.hio.cdata.hdata.c

CS 671 – Spring First Step: Lexical Analysis (Tokenizing) Breaking the program down into words or “tokens” Input: stream of characters Output: stream of names, keywords, punctuation marks Side effect: Discards white space, comments Source code: if (b==0) a = “Hi”; Token Stream: Lexical Analysis Parsing

CS 671 – Spring Lexical Tokens Identifiers: x y11 elsex _i00 Keywords: if else while break Integers: L Floating point: e5 0.e-10 Symbols: + * { } ++ = Strings: “x” “He said, \“Are you?\”” Comments: /** ignore me **/

CS 671 – Spring Lexical Tokens float match0(char *s) /* find a zero */ { if (!strncmp(s, “0.0”, 3)) return 0.; } FLOAT ID(match0) _______ CHAR STAR ID(s) RPAREN LBRACE IF LPAREN BANG _______ LPAREN ID(s) COMMA STRING(0.0) ______ NUM(3) RPAREN RPAREN RETURN REAL(0.0) ______ RBRACE EOF

CS 671 – Spring Ad-hoc Lexer Hand-write code to generate tokens How to read identifier tokens? Token readIdentifier( ) { String id = “”; while (true) { char c = input.read(); if (!identifierChar(c)) return new Token(ID, id, lineNumber); id = id + String(c); }

CS 671 – Spring Problems Don’t know what kind of token we are going to read from seeing first character –if token begins with “i’’ is it an identifier? –if token begins with “2” is it an integer? constant? –interleaved tokenizer code is hard to write correctly, harder to maintain More principled approach: lexer generator that generates efficient tokenizer automatically (e.g., lex, flex)

CS 671 – Spring Issues How to describe tokens unambiguously 2.e0 20.e “” “x” “\\” “\”\’” How to break text down into tokens if (x == 0) a = x<<1; if (x == 0) a = x<1; How to tokenize efficiently –tokens may have similar prefixes –want to look at each character ~1 time

CS 671 – Spring How To Describe Tokens Programming language tokens can be described using regular expressions A regular expression R describes some set of strings L(R) L(R) is the language defined by R – L(abc) = { abc } – L(hello|goodbye) = {hello, goodbye} – L([1-9][0-9]*) = _______________ Idea: define each kind of token using RE

CS 671 – Spring Regular expressions Language – set of strings String – finite sequence of symbols Symbols – taken from a finite alphabet Specify languages using regular expressions Symbolaone instance of a Epsilon  empty string AlternationR | Sstring from either L(R) or L(S) ConcatenationR ∙ Sstring from L(R) followed by L(S) RepetitionR*

CS 671 – Spring Convenient Shorthand [abcd] one of the listed characters (a | b | c | d) [b-g] [bcdefg] [b-gM-Qkr]____________ [^ab]anything but one of the listed chars [^a-f]____________ M?Zero or one M M+One or more M M*____________ “a.+*”literally a.+*.Any single character (except \n)

CS 671 – Spring Examples Regular Expression Strings in L(R) digit = [0-9] “0” “1” “2” “3” … posint = digit+ “8” “412” … int = -? posint “-42” “1024” … real = int (ε | (. posint)) “-1.56” “12” “1.0” [a-zA-Z_][a-zA-Z0-9_]* C identifiers Lexer generators support abbreviations –But they cannot be recursive

CS 671 – Spring More Examples Whitespace: Integers: Hex numbers: Valid UVa User Ids: Loop keywords in C:

CS 671 – Spring Breaking up Text elsex=0; REs alone not enough: need rules for choosing Most languages: longest matching token wins –even if a shorter token is only way Ties in length resolved by prioritizing tokens RE’s + priorities + longest-matching token rule = lexer definition else x = 0 ;

CS 671 – Spring Lexer Generator Specification Input to lexer generator: –list of regular expressions in priority order –associated action for each RE (generates appropriate kind of token, other bookkeeping) Output: –program that reads an input stream and breaks it up into tokens according to the REs. (Or reports lexical error -- “Unexpected character” )

CS 671 – Spring Lex: A Lexical Analyzer Generator Lex produces a C program from a lexical specification % DIGITS [0-9]+ ALPHA [A-Za-z] CHARACTER {ALPHA}|_ IDENTIFIER {ALPHA}({CHARACTER}|{DIGITS})* % if{return IF; } {IDENTIFIER}{return ID; } {DIGITS}{return NUM; } ([0-9]+”.”[0-9]*)|([0-9]*”.”[0-9]+){return ____; }.{error(); }

CS 671 – Spring Lexer Generator Reads in list of regular expressions R1,…Rn, one per token, with attached actions -?[1-9][0-9]* { return new Token(Tokens.IntConst, Integer.parseInt(yytext()) } Generates scanning code that decides: 1.whether the input is lexically well-formed 2.corresponding token sequence Problem 1 is equivalent to deciding whether the input is in the language of the regular expression How can we efficiently test membership in L(R) for arbitrary R?

CS 671 – Spring Regular Expression Matching Sketch of an efficient implementation: –start in some initial state –look at each input character in sequence, update scanner state accordingly –if state at end of input is an accepting state, the input string matches the RE For tokenizing, only need a finite amount of state: (deterministic) finite automaton (DFA) or finite state machine

CS 671 – Spring High Level View Regular expressions = specification Finite automata = implementation Every regex has a FSA that recognizes its language Scanner Generator source code specification tokens Design time Compile time

CS 671 – Spring Finite Automata Takes an input string and determines whether it’s a valid sentence of a language –A finite automaton has a finite set of states –Edges lead from one state to another –Edges are labeled with a symbol –One state is the start state –One or more states are the final state 01 2 if IF 0 1 a-z 0-9 ID 26 edges

CS 671 – Spring Language Each string is accepted or rejected 1.Starting in the start state 2.Automaton follows one edge for every character (edge must match character) 3.After n-transitions for an n-character string, if final state then accept Language: set of strings that the FSA accepts 0 12 if IFID 3 [a-z0-9] [a-hj-z]