Lexical Analyzer in Perspective

Slides:



Advertisements
Similar presentations
Chapter 5: Languages and Grammar 1 Compiler Designs and Constructions ( Page ) Chapter 5: Languages and Grammar Objectives: Definition of Languages.
Advertisements

COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Chapter 3 Lexical Analysis Yu-Chen Kuo.
Chapter 3 Lexical Analysis. Definitions The lexical analyzer produces a certain token wherever the input contains a string of characters in a certain.
CS-338 Compiler Design Dr. Syed Noman Hasany Assistant Professor College of Computer, Qassim University.
 Lex helps to specify lexical analyzers by specifying regular expression  i/p notation for lex tool is lex language and the tool itself is refered to.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
2. Lexical Analysis Prof. O. Nierstrasz
Chapter 3 Chang Chi-Chung. The Structure of the Generated Analyzer lexeme Automaton simulator Transition Table Actions Lex compiler Lex Program lexemeBeginforward.
Lexical Analysis Recognize tokens and ignore white spaces, comments
Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source program) – divides it into tokens.
Chapter 3 Lexical Analysis
Topic #3: Lexical Analysis
Languages & Strings String Operations Language Definitions.
1 Flex. 2 Flex A Lexical Analyzer Generator  generates a scanner procedure directly, with regular expressions and user-written procedures Steps to using.
Lexical Analysis Natawut Nupairoj, Ph.D.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis Hira Waseem Lecture
Grammars CPSC 5135.
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
CH3.1 CS 345 Dr. Mohamed Ramadan Saady Algebraic Properties of Regular Expressions AXIOMDESCRIPTION r | s = s | r r | (s | t) = (r | s) | t (r s) t = r.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Chapter 3. Lexical Analysis (1). 2 Interaction of lexical analyzer with parser.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Lexical Analyzer in Perspective
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
1.  It is the first phase of compiler.  In computer science, lexical analysis is the process of converting a sequence of characters into a sequence.
IN LINE FUNCTION AND MACRO Macro is processed at precompilation time. An Inline function is processed at compilation time. Example : let us consider this.
Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Scanner Introduction to Compilers 1 Scanner.
Overview of Previous Lesson(s) Over View  Syntax-directed translation is done by attaching rules or program fragments to productions in a grammar. 
The Role of Lexical Analyzer
Lexical Analysis (Scanning) Lexical Analysis (Scanning)
Lexical Analysis.
1st Phase Lexical Analysis
Deterministic Finite Automata Nondeterministic Finite Automata.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Chapter2 : Lexical Analysis
Last Chapter Review Source code characters combination lexemes tokens pattern Non-Formalization Description Formalization Description Regular Expression.
CS 3304 Comparative Languages
System Software Theory (5KS03).
Chapter 3 – Describing Syntax
Scanner Scanner Introduction to Compilers.
Chapter 3 Lexical Analysis.
Lexical and Syntax Analysis
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
CSc 453 Lexical Analysis (Scanning)
COMPILER DESIGN UNIT-I.
Formal Language Theory
Lexical analysis Jakub Yaghob
Lexical and Syntax Analysis
פרק 3 ניתוח לקסיקאלי תורת הקומפילציה איתן אביאור.
Chapter 3: Lexical Analysis
Review: Compiler Phases:
Recognition of Tokens.
R.Rajkumar Asst.Professor CSE
Scanner Scanner Introduction to Compilers.
CS 3304 Comparative Languages
Specification of tokens using regular expressions
Chapter 4: Lexical and Syntax Analysis Sangho Ha
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

Lexical Analyzer in Perspective parser symbol table source program token get next token Important Issue: What are Responsibilities of each Box ? Focus on Lexical Analyzer and Parser

Why to separate Lexical analysis and parsing Simplicity of design Improving compiler efficiency Enhancing compiler portability

Tokens, Patterns, and Lexemes A token is a pair a token name and an optional token attribute A pattern is a description of the form that the lexemes of a token may take A lexeme is a sequence of characters in the source program that matches the pattern for a token

Example Token Informal description Sample lexemes if Characters i, f else Characters e, l, s, e else <=, != relation < or > or <= or >= or == or != id Letter followed by letter and digits pi, score, D2 number Any numeric constant 3.14159, 0, 6.02e23 literal Anything but “ sorrounded by “ “core dumped”

Using Buffer to Enhance Efficiency Current token * M = E eof 2 C lexeme beginning forward (scans ahead to find pattern match) if forward at end of first half then begin reload second half ; forward : = forward + 1 end else if forward at end of second half then begin reload first half ; move forward to biginning of first half else forward : = forward + 1 ; Block I/O Block I/O

Algorithm: Buffered I/O with Sentinels Current token eof * M = E 2 C lexeme beginning forward (scans ahead to find pattern match) forward : = forward + 1 ; if forward is at eof then begin if forward at end of first half then begin reload second half ; forward : = forward + 1 end else if forward at end of second half then begin reload first half ; move forward to biginning of first half else / * eof within buffer signifying end of input * / terminate lexical analysis Block I/O Block I/O 2nd eof  no more input !

Chomsky Hierarchy 0 Unrestricted A   1 Context-Sensitive | LHS |  | RHS | 2 Context-Free |LHS | = 1 3 Regular |RHS| = 1 or 2 , A  a | aB, or A  a | Ba

Formal Language Operations DEFINITION union of L and M written L  M concatenation of L and M written LM Kleene closure of L written L* positive closure of L written L+ L  M = {s | s is in L or s is in M} LM = {st | s is in L and t is in M} L+= L* denotes “zero or more concatenations of “ L L*= L+ denotes “one or more concatenations of “ L

Formal Language Operations Examples L = {A, B, C, D } D = {1, 2, 3} L  D = {A, B, C, D, 1, 2, 3 } LD = {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 } L2 = { AA, AB, AC, AD, BA, BB, BC, BD, CA, … DD} L4 = L2 L2 = ?? L* = { All possible strings of L plus  } L+ = L* -  L (L  D ) = ?? L (L  D )* = ??

Language & Regular Expressions A Regular Expression is a Set of Rules / Techniques for Constructing Sequences of Symbols (Strings) From an Alphabet. Let  Be an Alphabet, r a Regular Expression Then L(r) is the Language That is Characterized by the Rules of r

Rules for Specifying Regular Expressions: fix alphabet   is a regular expression denoting {} If a is in , a is a regular expression that denotes {a} Let r and s be regular expressions with languages L(r) and L(s). Then (a) (r) | (s) is a regular expression  L(r)  L(s) (b) (r)(s) is a regular expression  L(r) L(s) (c) (r)* is a regular expression  (L(r))* (d) (r) is a regular expression  L(r) All are Left-Associative. Parentheses are dropped as allowed by precedence rules. precedence

EXAMPLES of Regular Expressions L = {A, B, C, D } D = {1, 2, 3} A | B | C | D = L (A | B | C | D ) (A | B | C | D ) = L2 (A | B | C | D )* = L* (A | B | C | D ) ((A | B | C | D ) | ( 1 | 2 | 3 )) = L (L  D)

Algebraic Properties of Regular Expressions AXIOM DESCRIPTION r | s = s | r r | (s | t) = (r | s) | t (r s) t = r (s t) r = r r = r r* = ( r |  )* r ( s | t ) = r s | r t ( s | t ) r = s r | t r r** = r* | is commutative | is associative concatenation is associative concatenation distributes over | relation between * and   Is the identity element for concatenation * is idempotent

Token Recognition How can we use concepts developed so far to assist in recognizing tokens of a source language ? Assume Following Tokens: if, then, else, relop, id, num Given Tokens, What are Patterns ? Grammar: stmt  |if expr then stmt |if expr then stmt else stmt | expr  term relop term | term term  id | num if  if then  then else  else relop  < | <= | > | >= | = | <> id  letter ( letter | digit )* num  digit + (. digit + ) ? ( E(+ | -) ? digit + ) ?

Overall ws if then else id num < <= = < > > >= - Regular Expression Token Attribute-Value ws if then else id num < <= = < > > >= - relop pointer to table entry LT LE EQ NE GT GE Note: Each token has a unique token identifier to define category of lexemes

Transition diagrams Transition diagram for relop

Transition diagrams (cont.) Transition diagram for reserved words and identifiers

Transition diagrams (cont.) Transition diagram for unsigned numbers

Transition diagrams (cont.) Transition diagram for whitespace

Lexical Analyzer Generator - Lex Lex Source program lex.l Lexical Compiler lex.yy.c C compiler lex.yy.c a.out a.out Sequence of tokens Input stream

Lexical errors Some errors are out of power of lexical analyzer to recognize: fi (a == f(x)) … However, it may be able to recognize errors like: d = 2r Such errors are recognized when no pattern for tokens matches a character sequence

Error recovery Panic mode: successive characters are ignored until we reach to a well formed token Delete one character from the remaining input Insert a missing character into the remaining input Replace a character by another character Transpose two adjacent characters Minimal Distance