Scanning & Regular Expressions CPSC 388 Ellen Walker Hiram College.

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

Finite Automata CPSC 388 Ellen Walker Hiram College.
Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.
Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.
Chapter 3 Lexical Analysis Yu-Chen Kuo.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
1 Regular Expressions & Automata Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.
Fall 2005 CSE 467/567 1 Formal languages regular expressions regular languages finite state machines.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Lexical Analysis Recognize tokens and ignore white spaces, comments
Admin Office hours 2:45-3:15 today due to department meeting if you change addresses during the semester, please unsubscribe the old one from the.
1 Regular Expressions/Languages Regular languages –Inductive definitions –Regular expressions syntax semantics Not covered in lecture.
Languages and Machines Unit two: Regular languages and Finite State Automata.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction
Regular Languages A language is regular over  if it can be built from ;, {  }, and { a } for every a 2 , using operators union ( [ ), concatenation.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Lexical Analysis Natawut Nupairoj, Ph.D.
Lexical Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University
C Tokens Identifiers Keywords Constants Operators Special symbols.
The string data type String. String (in general) A string is a sequence of characters enclosed between the double quotes "..." Example: Each character.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
C++ Character Set It is set of Characters/digits/symbol which is valid in C++. Example – A-Z, (white space) C++ Character Set It is set of.
___________________________________________ COMPILER Theory___________________________________________ Fourth Year (First Semester) Dr. Hamdy M. Mousa.
What is a language? An alphabet is a well defined set of characters. The character ∑ is typically used to represent an alphabet. A string : a finite.
BASICS CONCEPTS OF ‘C’.  C Character Set C Character Set  Tokens in C Tokens in C  Constants Constants  Variables Variables  Global Variables Global.
1 Module 14 Regular languages –Inductive definitions –Regular expressions syntax semantics.
Lexical Analyzer in Perspective
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
CPS 506 Comparative Programming Languages Syntax Specification.
Data Structure and c K.S.Prabhu Lecturer All Deaf Educational Technology.
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Brian Mitchell - Drexel University MCS680-FCS 1 Patterns, Automata & Regular Expressions int MSTWeight(int graph[][], int size)
Lexical Analysis (Scanning) Lexical Analysis (Scanning)
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
A Sample Program #include using namespace std; int main(void) { cout
Deterministic Finite Automata Nondeterministic Finite Automata.
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
Variables, Identifiers, Assignments, Input/Output
Lexical Analyzer in Perspective
CS314 – Section 5 Recitation 2
Theory of Computation Lecture #
CS510 Compiler Lecture 2.
Chapter 3 Lexical Analysis.
Lexical Analysis CSE 340 – Principles of Programming Languages
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
CSc 453 Lexical Analysis (Scanning)
Formal Language & Automata Theory
Variables A piece of memory set aside to store data
REGULAR LANGUAGES AND REGULAR GRAMMARS
Review: Compiler Phases:
Variables, Identifiers, Assignments, Input/Output
Specification of tokens using regular expressions
CSE 340 Recitation Week 3 : Sept 1st – 7th Regular Expressions
Compiler Construction
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

Scanning & Regular Expressions CPSC 388 Ellen Walker Hiram College

Scanning Input: characters from the source code Output: Tokens –Keywords: IF, THEN, ELSE, FOR … –Symbols: PLUS, LBRACE, SEMI … –Variable tokens: ID, NUM Augment with string or numeric value

TokenType Enumerated type (a c++ construct) Typedef enum {IF, THEN, ELSE …} TokenType IF, THEN, ELSE (etc) are now literals of type TokenType

Using TokenType void someFun(TokenType tt){ … switch (tt){ case IF: … break; case THEN: … break; … }

Token Class (partial) class Token { public: TokenType tokenval; string tokenchars; double numval; }

Interlude: References and Pointers Java has primitives and references –Primitives are int, char, double, etc. –References “point to” objects C++ has only primitives –But, one of the primitives is “address”, which serves the purpose of a reference.

Interlude: References and Pointers To declare a pointer, put * after the type char x;// a character char *y;// a pointer to a character Using pointers: x = ‘a’; y = &x; //y gets the address of x *y = ‘b’; //thing pointed at by y becomes ‘b’; //note that x is now also b!

Interlude: References and Pointers Continuing the example… cout << x << endl; // prints b cout << *y << endl; // prints b cout << y << endl; // prints a hex address cout << &x << endl; // same as above cout << &y << endl; // a different address - where the pointer is stored

GetToken(): A scanning function Token *getToken(istream &sin) –Read characters from sin until a complete token is extracted, return (a pointer to) the token –Usually called by the parser –Note: version in the book uses global variables and returns only the token type

Using GetToken Token *myToken = GetToken(cin); while (myToken != NULL){ //process the token switch (myToken->TokenType){ //cases for each token type } myToken = GetToken(cin); }

Result of GetToken

Tokens and Languages The set of valid tokens of a particular type is a Language (in the formal sense) More specifically, it is a Regular Language

Language Formalities Language: set of strings String: sequence of symbols Alphabet: set of legal symbols for strings –Generally  is used to denote an alphabet

Example Languages L1 = {aa, ab, bb},  = {a, b} L2 = { ,ab, abab, … },  = {a, b} L3 = {strings of N a’s where N is an odd integer},  = {a} L4 = {  } (one string with no symbols) L5 = { } (no strings at all) L5 = Ø

Denoting Languages Expressions (regular languages only) Grammars –Set of rewrite rules that express all and only the strings in the language Automata –Machines that “accept” all and only the strings in the language

Primitive Regular Expressions  –L(  ) = {}(no strings)  –L(  ) = {  }(one string, no symbols) a where a is a member of  –L(a) = {a}(one string, one symbol)

Combining Regular Expressions Choice: r | s (sometimes r+s) –L(r | s) = L(r )  L(s) Concatenation: rs –L(rs) = L(r)L(s) –All combinations of 1 from r and 1 from s Repetition: r* –L(r*) =   L(r )  L(rr)  L(rrr )  … –0 or more strings from r concatenated

Precedence Repetition before concatenation Concatenation before choice Use parentheses to override aa* vs. (aa)* ab|c vs. a(b|c)

Example Languages L1 = {aa, ab, bb},  = {a, b} L2 = { ,ab, abab, … }, S = {a, b} L3 = {strings of N a’s where N is an odd integer}, S = {a} L4 = {  } (one string with no symbols) L5 = { } (no strings at all) L5 = Ø

R.E.’s for Examples L1 = aa | ab | bb L1 = a(a|b) | bb L1 = aa | (a|b) b L2 = (ab)* not ab* ! L3 = a(aa)*

What are these languages? a* | b* | c* a*b*c* (a*b*)* a(a|b)*c (a|b|c)*bab(a|b|c)*

What are the RE’s? In the alphabet {a,b,c}: –All strings that are in alphabetical order –All strings that have the first a before the first b, before the first c, e.g. ababbabca –All strings that contain “abc” –All strings that do not contain “abc”

Extended Reg. Exp’s Additional operations for convenience r+ = rr* (one or more reps). ( any character in the alphabet).* = any possible string from the alphabet [a-z] = a|b|c|…|z [^aeiou] = b|c|d|f|g|h|j...