Copyright © Cengage Learning. All rights reserved.

Slides:



Advertisements
Similar presentations
Chapter 2 Revision of Mathematical Notations and Techniques
Advertisements

COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Regular Expressions, Backus-Naur Form and Reverse Polish Notation.
YES-NO machines Finite State Automata as language recognizers.
Week 13 - Wednesday.  What did we talk about last time?  Exam 3  Before review:  Graphing functions  Rules for manipulating asymptotic bounds  Computing.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
Regular Languages Sequential Machine Theory Prof. K. J. Hintz Department of Electrical and Computer Engineering Lecture 3 Comments, additions and modifications.
Regular Languages Sequential Machine Theory Prof. K. J. Hintz Department of Electrical and Computer Engineering Lecture 3 Comments, additions and modifications.
Languages and Machines Unit two: Regular languages and Finite State Automata.
CPSC 388 – Compiler Design and Construction
CMSC 330 Exercise: Write a Ruby function that takes an array of names in “Last, First Middle” format and returns the same list in “First Middle Last” format.
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
Copyright © Cengage Learning. All rights reserved. CHAPTER 2 THE LOGIC OF COMPOUND STATEMENTS THE LOGIC OF COMPOUND STATEMENTS.
Lecture 21: Languages and Grammars. Natural Language vs. Formal Language.
Chapter 4 Context-Free Languages Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
CSI 3120, Grammars, page 1 Language description methods Major topics in this part of the course: –Syntax and semantics –Grammars –Axiomatic semantics (next.
CS/IT 138 THEORY OF COMPUTATION Chapter 1 Introduction to the Theory of Computation.
CSC312 Automata Theory Lecture # 2 Languages.
CSC312 Automata Theory Lecture # 2 Languages.
1 Chapter 1 Automata: the Methods & the Madness Angkor Wat, Cambodia.
Lecture Two: Formal Languages Formal Languages, Lecture 2, slide 1 Amjad Ali.
Introduction to Theory of Automata
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
CMSC 330: Organization of Programming Languages Theory of Regular Expressions.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
1 INFO 2950 Prof. Carla Gomes Module Modeling Computation: Language Recognition Rosen, Chapter 12.4.
1 Chapter 1 Introduction to the Theory of Computation.
Languages, Grammars, and Regular Expressions Chuck Cusack Based partly on Chapter 11 of “Discrete Mathematics and its Applications,” 5 th edition, by Kenneth.
Grammars CPSC 5135.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 2 Mälardalen University 2006.
Introduction to Theory of Automata By: Wasim Ahmad Khan.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains.
CPS 506 Comparative Programming Languages Syntax Specification.
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Specifying Languages Our aim is to be able to specify languages for use in the computer. The sketch of an FSA is easy for us to understand, but difficult.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 2 Mälardalen University 2010.
Recursive Definations Regular Expressions Ch # 4 by Cohen
CS 208: Computing Theory Assoc. Prof. Dr. Brahim Hnich Faculty of Computer Sciences Izmir University of Economics.
Programming Languages and Design Lecture 2 Syntax Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
Theory of computation Introduction theory of computation: It comprises the fundamental mathematical properties of computer hardware, software,
Mathematical Foundations of Computer Science Chapter 3: Regular Languages and Regular Grammars.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
CSE 311 Foundations of Computing I Lecture 19 Recursive Definitions: Context-Free Grammars and Languages Autumn 2012 CSE
Week 13 - Wednesday.  What did we talk about last time?  Exam 3  Before review:  Graphing functions  Rules for manipulating asymptotic bounds  Computing.
Set, Alphabets, Strings, and Languages. The regular languages. Clouser properties of regular sets. Finite State Automata. Types of Finite State Automata.
Lecture 03: Theory of Automata:2014 Asif Nawaz Theory of Automata.
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Lecture 17: Theory of Automata:2014 Context Free Grammars.
Regular Expressions, Backus-Naur Form and Reverse Polish Notation
Theory of Computation Lecture #
Context-Free Grammars: an overview
Lecture 1 Theory of Automata
CIS Automata and Formal Languages – Pei Wang
Week 14 - Friday CS221.
Review: Compiler Phases:
CHAPTER 2 Context-Free Languages
Copyright © Cengage Learning. All rights reserved.
Introduction Reading: Sections 1.5 – 1.7.
Chapter 1 Introduction to the Theory of Computation
CSC312 Automata Theory Lecture # 2 Languages.
Presentation transcript:

Copyright © Cengage Learning. All rights reserved. CHAPTER 12 REGULAR EXPRESSIONS AND FINITE-STATE AUTOMATA Copyright © Cengage Learning. All rights reserved.

Formal Languages and Regular Expressions SECTION 12.1 Formal Languages and Regular Expressions Copyright © Cengage Learning. All rights reserved.

Formal Languages and Regular Expressions An English sentence can be regarded as a string of words, and an English word can be regarded as a string of letters. Not every string of letters is a legitimate word, and not every string of words is a grammatical sentence. We could say that a word is legitimate if it can be found in an unabridged English dictionary and that a sentence is grammatical if it satisfies the rules in a standard English grammar book.

Formal Languages and Regular Expressions Computer languages are similar to English in that certain strings of characters are legitimate words of the language and certain strings of words can be put together according to certain rules to form syntactically correct programs. A compiler for a computer language analyzes the stream of characters in a program—first to recognize individual word and sentence units (this part of the compiler is called a lexical scanner), then to analyze the syntax, or grammar, of the sentences (this part is called a syntactic analyzer), and finally to translate the sentences into machine code (this part is called a code generator).

Formal Languages and Regular Expressions In computer science it has proved useful to look at languages from a very abstract point of view as strings of certain fundamental units and allow any finite set of symbols to be used as an alphabet. It is common to denote an alphabet by a capital Greek sigma: . (This just happens to be the same symbol as the one used for summation, but the two concepts have no other connection.)

Formal Languages and Regular Expressions The definition of a string of characters of an alphabet  (or a string over ) is a generalization of the definition of string introduced earlier. A formal language over an alphabet is any set of strings of characters of the alphabet.

Formal Languages and Regular Expressions These definitions are given formally as: Note that the empty set satisfies the criteria for being a formal language. Allowing the empty set to be a formal language turns out to be convenient in certain technical situations.

Example 1 – Examples of Formal Languages Let the alphabet  = {a, b}. a. Define a language L1 over  to be the set of all strings that begin with the character a and have length at most three characters. Find L1. b. A palindrome is a string that looks the same if the order of its characters is reversed. For instance, aba and baab are palindromes. Define a language L2 over  to be the set of all palindromes obtained using the characters of . Write ten elements of L2.

Example 1 – Solution a. L1 = {a, aa, ab, aaa, aab, aba, abb} b. L2 contains the following ten strings (among infinitely many others): , a, b, aa, bb, aaa, bab, abba, babaabab, abaabbbbbaaba

Formal Languages and Regular Expressions Note that n is essentially the Cartesian product of n copies of . The language ∗ is called the Kleene closure of , in honor of Stephen C. Kleene (pronounced CLAY-knee). + is the set of all strings over  except for  and is called the positive closure of .

In prefix notation, a b is written ab. Example 3 – Polish Notation: A Language Consisting of Postfix Expressions An expression such as a b, in which a binary operator such as sits between the two quantities on which it acts, is said to be written in infix notation. Alternative notations are called prefix notation (in which the binary operator precedes the quantities on which it acts) and postfix notation (in which the binary operator follows the quantities on which it acts). In prefix notation, a b is written ab. In postfix notation, a b is written ab .

Example 3 – Polish Notation: A Language Consisting of Postfix Expressions cont’d Prefix and postfix notations were introduced in 1920 by the Polish mathematician Jan Łukasiewicz (pronounced Wu-cash-AY-vich). In his honor—and because some people have difficulty pronouncing his name—they are often referred to as Polish notation and reverse Polish notation, respectively. A great advantage of these notations is that they eliminate the need for parentheses in writing arithmetic expressions.

Example 3 – Polish Notation: A Language Consisting of Postfix Expressions cont’d For instance, in postfix (or reverse Polish) notation, the expression 8 4 6 / is evaluated from left to right as follows: Add 8 and 4 to obtain 12, and then divide 12 by 6 to obtain 2. As another example, if the expression (a b)  c in infix notation is converted to postfix notation, the result is ab c .

Example 3 – Polish Notation: A Language Consisting of Postfix Expressions cont’d a. If the expression ab  cd  in postfix notation is converted to infix notation, what is the result? b. Let  = {4, 1, , −}, and let L = the set of all strings over  obtained by writing either a 4 or a 1 first, then either a 4 or a 1, and finally either a or a –. List all elements of L between braces, and evaluate the resulting expressions.

Example 3 – Solution a. b.

Formal Languages and Regular Expressions

The Language Defined by a Regular Expression

The Language Defined by a Regular Expression

The Language Defined by a Regular Expression As an example, one regular expression over  = {a, b, c} is If the alphabet happens to include symbols—such as ( | ) ∗—special provisions have to be made to avoid ambiguity. An escape character, usually a backslash, is added before the potentially ambiguous symbol. For instance, a left parenthesis would be written as \( and the backslash itself would be written as \\.

The Language Defined by a Regular Expression To eliminate parentheses, an order of precedence for the operations used to define regular expressions has been introduced. The highest is ∗, concatenation is next, and | is the lowest. It is also customary to eliminate the outer set of parentheses in a regular expression, because doing so does not produce ambiguity. Thus

Example 5 – Order of Precedence for the Operations in a Regular Expression a. Add parentheses to make the order of precedence clear in the following expression: b. Use the convention about order of precedence to eliminate the parentheses in the following expression: Solution: a. b.

The Language Defined by a Regular Expression Given a finite alphabet, every regular expression r over the alphabet defines a formal language L(r). The function L is defined recursively.

The Language Defined by a Regular Expression Note that any finite language can be defined by a regular expression. For instance, the language {cat, dog, bird} is defined by the regular expression (cat | dog | bird).

Use set notation to find this language, and describe it in words. Example 6 – Using Set Notation to Describe the Language Defined by a Regular Expression Let  = {a, b}, and consider the language defined by the regular expression (a | b)∗. Use set notation to find this language, and describe it in words. Solution: The language defined by (a | b)∗ is

Example 6 – Solution cont’d

The Language Defined by a Regular Expression Note that concatenating strings and taking unions of sets are both associative operations. Thus for any regular expressions r, s and t, Moreover,

The Language Defined by a Regular Expression Because of these relationships, it is customary to drop the parentheses in “associative” situations and write and

Practical Uses of Regular Expressions

Practical Uses of Regular Expressions Many applications of computers involve performing operations on pieces of text. For instance, word and text processing programs allow us to find certain words or phrases in a document and possibly replace them with others. A compiler for a computer language analyzes an incoming stream of characters to find groupings that represent aspects of the computer language such as keywords, constants, identifiers, and operators.

Practical Uses of Regular Expressions And in bioinformatics, pattern matching and flexible searching techniques are used extensively to analyze the long sequences of the characters A, C, G, and T that occur in DNA. Through their connection with finite-state automata, regular expressions provide an extremely useful way to describe a pattern in order to identify a string or a collection of strings within a piece of text. Regular expressions make it possible to replace a long, complicated set of if-then-else statements with code that is easy both to produce and to understand.

Practical Uses of Regular Expressions Because of their convenience, regular expressions were introduced into a number of UNIX utilities, such as grep (short for globally search for regular expression and print) and egrep (extended grep), in text editors, such as QED (short for Quick EDitor, the first text editor to use regular expressions), vi (short for visual interface), sed (short for stream editor and originally developed for UNIX but now used by many systems), and Emacs (short for Editor macros), and in the lexical scanner component of a compiler.

Practical Uses of Regular Expressions The computer language Perl has a particularly powerful implementation for regular expressions, which has become a de facto standard. The implementations used in Java and .NET are similar. A number of shorthand notations have been developed to facilitate working with regular expressions in text processing.

Practical Uses of Regular Expressions When characters in an alphabet or in a part of an alphabet are understood to occur in a standard order, the notation [beginning character– ending character] is commonly used to represent the regular expression that consists of a single character in the range from the beginning to the ending character. It is called a character class. Thus and

Practical Uses of Regular Expressions Character classes are also allowed to include more than one range of characters. For instance, As an example, consider the language defined by the regular expression The following are some strings in the language: Account Number, z23, jsmith109, Draft2rev.

Practical Uses of Regular Expressions In general, the language is the set of all strings that start with a letter followed by a sequence of digits or letters. This set is the same as the set of allowable identifiers in a number of computer languages. Other commonly used shorthands are and a single dot . to stand for an arbitrary character. Thus, for instance, if  = {A, B,C}, then

Practical Uses of Regular Expressions When the symbol ^ is placed at the beginning of a character class, it indicates that a character of the same type as those in the range of the class is to occur at that point in the string, except for one of the specific characters indicated after the ^ sign. For instance stands for any string starting with a letter of the alphabet different from D to Z, followed by any positive number of digits from 0 to 9. Examples are B3097,C0046, and so forth. If r is a regular expression, the notation denotes the concatenation of r with itself any positive finite number of times.

Practical Uses of Regular Expressions In symbols, For example, represents any nonempty string of capital letters. If r is a regular expression, then That is, r? denotes either zero occurrences or exactly one occurrence of r.

Practical Uses of Regular Expressions Finally, if m and n are positive integers with m ≤ n, r {n} denotes the concatenation of r with itself exactly n times. and r {m, n} denotes the concatenation of r with itself anywhere from m through n times.

Practical Uses of Regular Expressions Thus a check to help determine whether a given string is a local telephone number in the United States is to see whether it has the form or, equivalently, whether it has the form

Example 11 – A Regular Expression for a Date People often write dates in a variety of formats. For instance, in the United States the following all represent the fifth of February of 2050: Write a regular expression that would help check whether a given string might be a valid date written in one of these forms.

Example 11 – Solution The language defined by the following regular expression consists of all strings that begin with one or two digits followed by either a hyphen or a slash, followed by either one or two digits, followed by either a hyphen or a slash, followed by four digits. All valid dates of the given format are elements of the language defined by this expression, but the language also includes strings that are not valid dates.

Example 11 – Solution cont’d For instance, 09/54/1978 is in the language, but it is not a valid date because September does not have 54 days, and 38/12/2184 is not valid because there is no 38th month. It is possible to write a more complicated regular expression that could be used to check all aspects of the validity of a date, but the kind of simpler expression given above is nonetheless useful. For instance, it provides an easy way to notify a user of an interactive program that a certain kind of mistake was made and that information should be reentered.