What is a language? An alphabet is a well defined set of characters. The character ∑ is typically used to represent an alphabet. A string : a finite.

Slides:



Advertisements
Similar presentations
Specifying Languages Our aim is to be able to specify languages for use in the computer. The sketch of the FSA is easy for us to understand, but difficult.
Advertisements

COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.
LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
College of Computer Science & Technology Compiler Construction Principles & Implementation Techniques -1- Compiler Construction Principles & Implementation.
1 Regular Expressions/Languages Regular languages –Inductive definitions –Regular expressions syntax semantics Not covered in lecture.
1 Overview Regular expressions Notation Patterns Java support.
CPSC 388 – Compiler Design and Construction
Topic #3: Lexical Analysis
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Lexical Analysis Natawut Nupairoj, Ph.D.
Regular Expressions and Finite State Automata  Themes  Finite State Automata (FSA)  Describing patterns with graphs  Programs that keep track of state.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis (I) Compiler Baojian Hua
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
CSCI 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Text search.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Lexical Analysis Lecture 2 Mon, Jan 19, Tokens A token has a type and a value. Types include ID, NUM, ASSGN, LPAREN, etc. Values are used primarily.
1 Module 14 Regular languages –Inductive definitions –Regular expressions syntax semantics.
Regular Expressions CIS 361. Need finite descriptions of infinite sets of strings. Discover and specify “regularity”. The set of languages over a finite.
1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a compiler PART II:
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
Lexical Analysis – Part I EECS 483 – Lecture 2 University of Michigan Monday, September 11, 2006.
Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains.
CSc 453 Lexical Analysis (Scanning)
Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Brian Mitchell - Drexel University MCS680-FCS 1 Patterns, Automata & Regular Expressions int MSTWeight(int graph[][], int size)
Finite Automata Chapter 1. Automatic Door Example Top View.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Scanning & Regular Expressions CPSC 388 Ellen Walker Hiram College.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
1 Chapter 3 Regular Languages.  2 3.1: Regular Expressions (1)   Regular Expression (RE):   E is a regular expression over  if E is one of:
 2004 SDU Lecture4 Regular Expressions.  2004 SDU 2 Regular expressions A third way to view regular languages. Say that R is a regular expression if.
Set, Alphabets, Strings, and Languages. The regular languages. Clouser properties of regular sets. Finite State Automata. Types of Finite State Automata.
Deterministic Finite Automata Nondeterministic Finite Automata.
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
Deterministic Finite-State Machine (or Deterministic Finite Automaton) A DFA is a 5-tuple, (S, Σ, T, s, A), consisting of: S: a finite set of states Σ:
CS314 – Section 5 Recitation 2
Theory of Computation Lecture #
CS510 Compiler Lecture 2.
Lexical Analysis.
Chapter 3 Lexical Analysis.
Lexical Analysis CSE 340 – Principles of Programming Languages
CSc 453 Lexical Analysis (Scanning)
Formal Language & Automata Theory
Review: Compiler Phases:
Specification of tokens using regular expressions
Compiler Construction
COMPILERS LECTURE(6-Aug-13)
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

What is a language? An alphabet is a well defined set of characters. The character ∑ is typically used to represent an alphabet. A string : a finite sequence of alphabet symbols, can be e, the empty string (Some texts use l as the empty string) A language, L, is simply any set of strings (infinite or finite)over a fixed alphabet. can be { }, the empty language.

What is a language? (cont’d) Examples: Alphabet: A-Z Language: English Alphabet: ASCII Language: C++

Suppose S = {a,b,c}. Some languages over S could be: {aa,ab,ac,bb,bc,cc} {ab,abc,abcc,abccc,. . .} { e } { } {a,b,c,e} …

What is a language? (cont’d) Alphabet Languages {0,1} {0,10,100,1000,10000 {0,1,00,11,000,111, {a,b,c} { abc, Aabbcc, Aaab,bbccc } , , } }

Regular Languages Formally describe tokens in the language Regular Expressions NFA DFA

Regular Expressions A Regular Expression is a set of rules , techniques for constructing sequences of Symbols (Strings) From an Alphabet. If A is a regular expression, then L(A) is the language defined by that regular expression. L(“c”) is the language with the single word “c”. L(“i” “f”) is the language with just “if” in it.

Regular Expressions (cont’d) L(“if” | “then” | “else”) is the language with just the words “if”, “then”, and “else”. L((“0” | “1”)(“0” | “1”)) is the language consisting of “00”, “01”, “10” and “11”.

Regular Expressions (cont’d) Let Σ Be an Alphabet, r a Regular Expression Then L(r) is the Language That is Characterized by the Rules of r

Rules fix alphabet Σ  is a regular exp. (denotes the language {}) If a is in Σ , a is a regular expression (that denotes the language {a} if r and s are regular exps. denoting L(r) and L(s) respectively, then so are: (r) | (s) is a regular expression ( denotes the language L(r)  L(s) (r)(s) is a regular expression ( denotes the language L(r)L(s) ) (r)* is a regular expression (denotes the language ( L(r)* )

Example

Regular Expression Operation There are three basic operations in regular expression : Alternation (union) RE1 | RE2 Concatenation (concatenation) RE1 RE2 Repetition (closure) RE* (zero or more RE’s)

Regular Expression Operation If P and Q are regular expressions over S, then so are: P | Q (union) If P denotes the set {a,…,e}, Q denotes the set {0,…,9} then P + Q denotes the set {a,…,e,0,…,9} PQ (concatenation) If P denotes the set {a,…,e}, Q denotes the set {0,…,9} then PQ denotes the set {a0,…,e0,a1,…,e9} Q* (closure) If Q denotes the set {0,…,9} then Q* denotes the set {0,…,9,00,…99,…}

Examples If S = {a,b} (a | b)*b b(a│b)*

Regular Expression Overview Expression Meaning  Empty pattern a Any pattern represented by ‘a’ ab Strings with pattern ‘a’ followed by ‘b’ a|b Strings consisting of pattern ‘a’ or ‘b’ a* Zero or more occurrences of patterns in ‘a’ a+ One or more occurrences of patterns in ‘a’ a3 Patterns in ‘a’ repeated exactly 3 times

L(R) = the language defined by R A regular expression R describes a set of strings of characters denoted L(R) L(R) = the language defined by R L(abc) = { abc } L(hello|goodbye) = { hello, goodbye } L(1(0|1)*) = all binary numbers that start with a 1 Each token can be defined using a regular expression

RE Notational Shorthand R+ one or more strings of R: R(R*) R? optional R: (R|) [abcd] one of listed characters: (a|b|c|d) [a-z] one character from this range: (a|b|c|d...|z) [^ab] anything but none of the listed chars [^a-z] any character not from this range

Regular Expression, R a ab a|b (ab)* (a| )b digit = [0-9] posint = digit+ Strings in L(R) “a” “ab” “a”, “b” “”, “ab”, “abab”, ... “ab”, “b” “0”, “1”, “2”, ... “8”, “412”, ... “23”, “34”, ...

More Examples All Strings that start with “tab” or end with bat”: tab{A,…,Z,a,...,z}*|{A,…,Z,a,....,z}*bat All Strings in which {1,2,3} exist in ascending order: {A,…,Z}*1 {A,…,Z}*2 {A,…,Z}*3 {A,…,Z}*

Defining Our Language The first thing we can define in our language are keywords. These are easy: if | else | while | find | … When we scan a file, we can either have a single token represent all keywords, or else break them down into groups, such as “commands”, “types”, etc.

Language Def (cont’d) float = {digit}+ “.” {digit}+ Next we will define integers in a language: digit = “0” | “1” | “2” | “3” | “4” | “5” | “6” | “7” | “8” | “9” integer = {digit}+ Note that we can abbreviate ranges using the dash (“-”). Thus, digit = 0-9 Relation = ‘<’ | ‘<=’ | ‘>’ | ‘>=’ | ‘<>’ | ‘=’ Floating point numbers are not much more complicated: float = {digit}+ “.” {digit}+

Language Def (cont’d) Identifiers are strings of letters, underscores, or digits beginning with a non-digit. Letter = a-z | A-Z digit = 0-9 Identifier = ({letter})({letter} | “_” | {digit})*

Real-world example What is the regular expression that defines all phone numbers? ∑ = { 0-9 } Area = {digit}3 Exchange = {digit}3 Local = {digit}4 Phone_number = “(” {Area} “)” {Exchange} {Local}