Languages, Grammars, and Regular Expressions Chuck Cusack Based partly on Chapter 11 of “Discrete Mathematics and its Applications,” 5 th edition, by Kenneth.

Slides:



Advertisements
Similar presentations
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Advertisements

Regular Expressions, Backus-Naur Form and Reverse Polish Notation.
Grammars, Languages and Parse Trees. Language Let V be an alphabet or vocabulary V* is set of all strings over V A language L is a subset of V*, i.e.,
COGN1001: Introduction to Cognitive Science Topics in Computer Science Formal Languages and Models of Computation Qiang HUO Department of Computer.
Chapter Chapter Summary Languages and Grammars Finite-State Machines with Output Finite-State Machines with No Output Language Recognition Turing.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
CS 330 Programming Languages 09 / 13 / 2007 Instructor: Michael Eckmann.
104 Closure Properties of Regular Languages Regular languages are closed under many set operations. Let L 1 and L 2 be regular languages. (1) L 1  L 2.
Normal forms for Context-Free Grammars
Chapter 3: Formal Translation Models
COP4020 Programming Languages
Syntactic Pattern Recognition Statistical PR:Find a feature vector x Train a system using a set of labeled patterns Classify unknown patterns Ignores relational.
Languages and Grammars MSU CSE 260. Outline Introduction: E xample Phrase-Structure Grammars: Terminology, Definition, Derivation, Language of a Grammar,
Language Translation Principles Part 1: Language Specification.
Lee CSCE 314 TAMU 1 CSCE 314 Programming Languages Syntactic Analysis Dr. Hyunyoung Lee.
Chapter 2 Languages.
1 INFO 2950 Prof. Carla Gomes Module Modeling Computation: Languages and Grammars Rosen, Chapter 12.1.
Languages & Strings String Operations Language Definitions.
Modeling Computation Rosen, ch. 12.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Introduction Syntax: form of a sentence (is it valid) Semantics: meaning of a sentence Valid: the frog writes neatly Invalid: swims quickly mathematics.
Lexical Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University
CS/IT 138 THEORY OF COMPUTATION Chapter 1 Introduction to the Theory of Computation.
CSC312 Automata Theory Lecture # 2 Languages.
Lecture Two: Formal Languages Formal Languages, Lecture 2, slide 1 Amjad Ali.
Introduction to Theory of Automata
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Winter 2007SEG2101 Chapter 71 Chapter 7 Introduction to Languages and Compiler.
Context-free Grammars Example : S   Shortened notation : S  aSaS   | aSa | bSb S  bSb Which strings can be generated from S ? [Section 6.1]
1 Chapter 3 Describing Syntax and Semantics. 3.1 Introduction Providing a concise yet understandable description of a programming language is difficult.
1 INFO 2950 Prof. Carla Gomes Module Modeling Computation: Language Recognition Rosen, Chapter 12.4.
A sentence (S) is composed of a noun phrase (NP) and a verb phrase (VP). A noun phrase may be composed of a determiner (D/DET) and a noun (N). A noun phrase.
Grammars CPSC 5135.
Context-Free Grammars Chapter 11. Languages and Machines.
1 Syntax In Text: Chapter 3. 2 Chapter 3: Syntax and Semantics Outline Syntax: Recognizer vs. generator BNF EBNF.
Copyright © Curt Hill Languages and Grammars This is not English Class. But there is a resemblance.
Phrase-structure grammar A phrase-structure grammar is a quadruple G = (V, T, P, S) where V is a finite set of symbols called nonterminals, T is a set.
Regular Grammars Chapter 7. Regular Grammars A regular grammar G is a quadruple (V, , R, S), where: ● V is the rule alphabet, which contains nonterminals.
CMSC 330: Organization of Programming Languages Context-Free Grammars.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
CPS 506 Comparative Programming Languages Syntax Specification.
Strings and Languages CS 130: Theory of Computation HMU textbook, Chapter 1 (Sec 1.5)
1 A well-parenthesized string is a string with the same number of (‘s as )’s which has the property that every prefix of the string has at least as many.
ISBN Chapter 3 Describing Syntax and Semantics.
LESSON 04.
Grammars A grammar is a 4-tuple G = (V, T, P, S) where 1)V is a set of nonterminal symbols (also called variables or syntactic categories) 2)T is a finite.
CSC312 Automata Theory Lecture # 26 Chapter # 12 by Cohen Context Free Grammars.
September1999 CMSC 203 / 0201 Fall 2002 Week #14 – 25/27 November 2002 Prof. Marie desJardins clip art courtesy of
Programming Languages and Design Lecture 2 Syntax Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
Formal Languages and Grammars
Discrete Structures ICS252 Chapter 5 Lecture 2. Languages and Grammars prepared By sabiha begum.
Chapter 4: Syntax analysis Syntax analysis is done by the parser. –Detects whether the program is written following the grammar rules and reports syntax.
1 A well-parenthesized string is a string with the same number of (‘s as )’s which has the property that every prefix of the string has at least as many.
CSE 311 Foundations of Computing I Lecture 19 Recursive Definitions: Context-Free Grammars and Languages Spring
CSE 311 Foundations of Computing I Lecture 19 Recursive Definitions: Context-Free Grammars and Languages Autumn 2012 CSE
Week 14 - Friday.  What did we talk about last time?  Simplifying FSAs  Quotient automata.
Syntax Analysis By Noor Dhia Syntax analysis:- Syntax analysis or parsing is the most important phase of a compiler. The syntax analyzer considers.
Lecture 17: Theory of Automata:2014 Context Free Grammars.
Chapter 2. Formal Languages Dept. of Computer Engineering, Hansung University, Sung-Dong Kim.
Lecture #2 Advanced Theory of Computation. Languages & Grammar Before discussing languages & grammar let us deal with some related issues. Alphabet: is.
Modeling Arithmetic, Computation, and Languages Mathematical Structures for Computer Science Chapter 8 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesAlgebraic.
Describing Syntax and Semantics Chapter 3: Describing Syntax and Semantics Lectures # 6.
Regular Expressions, Backus-Naur Form and Reverse Polish Notation
Theory of Computation Lecture #
BCT 2083 DISCRETE STRUCTURE AND APPLICATIONS
Classification of Languages
Natural Language Processing - Formal Language -
REGULAR LANGUAGES AND REGULAR GRAMMARS
Review: Compiler Phases:
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Presentation transcript:

Languages, Grammars, and Regular Expressions Chuck Cusack Based partly on Chapter 11 of “Discrete Mathematics and its Applications,” 5 th edition, by Kenneth Rosen

Alphabets and Languages Definition: A vocabulary (or alphabet) V is a finite, nonempty set of symbols. Definition: A word or sentence over V is a finite string of symbols from V. Definition: The empty string or null string, denoted by, is the string containing no symbols. Definition: The set of all words over V is denoted by V*. Definition: A language over V is a subset of V*.

Language Examples Let V={0,1} 00110, 11111, 00, and 11 are words over V 012, a234, and 222 are not words over V V * ={0,1,00,01,10,11,000,…} In other words, V * is the set of all binary strings The set of strings consisting of only 0s is a language over V * {1,10,100,1000,10000,…} is a language over V *

Concatenation Definition: Let V be a vocabulary, and A and B be subsets of V *. The concatenation of A and B, denoted by AB, is the set of all strings of the form xy, where x  A and y  B. Example: Let A={0, 10}, and B={1,12}. Then –AB={01, 012, 101, 1012} –BA={10, 110, 120, 1210} –AA={00, 010, 100, 1010} –AAA=A(AA)={000, 0010, 0100, 01010, 1000, 10010, 10100, }

Concatenation: A n Definition: Let V be a vocabulary, and A a subset of V *. Then A 0 ={ }, and for n>0, we can define A n =A (n-1) A Example: Let A={0, 10}. Then –A 0 ={  –A 1 =A 0 A={  A=A={0,10} –A 2 =A 1 A ={00, 010, 100, 1010} –A 3 = A 2 A={000, 0010, 0100, 01010, 1000, 10010, 10100, }

Kleene Closure Definition: Let V be a vocabulary, and A a subset of V *. The Kleene closure of A, denoted by A *, is the set consisting of concatenations of an arbitrary number of strings from A. That is, Definition: A + is the set of nonempty strings over A. In other words,

Kleene Closure Example Example: Let A={0, 1}. Then –A 0 ={  –A 1 ={0,1} –A 2 ={00, 01, 10, 11} –A 3 ={000, 001, 010, 011, 100, 101, 110, 111} –A * ={0,1} * ={All binary strings} Example: Let B={111}. Then –B 0 ={  B 1 ={111}, B 2 ={111111} –B 3 ={ } –B * is the set of strings with 3n 1s, for every n 

Regular Sets Definition: A regular set is a set that can be generated starting from the empty set, empty string, and single elements from the vocabulary, using concatenations, unions, and Kleene closures in arbitrary order. We will give a more precise definition after we define a regular expression.

Regular Expressions Definition: The regular expressions over a set I are defined recursively by: –  (the empty set) is a regular expression, – (the set containing the empty string) is a regular expression, –x is a regular expression for all x  I, –(AB), (A  B), and A * are regular expressions if A and B are regular expressions Definition: A regular set is a set represented by a regular expression. Examples: 001 *, 1(0  (0  1) * 11, and AB * C are regular expressions

Regular Expression Example The regular set defined by the regular expression 01 * is the set of strings starting with a 0 followed by 0 or more 1s. The regular set defined by (10) * is the set of strings containing 0 or more copies of 10. The regular set defined by 0(0  1) * 1 is the set of all binary strings beginning with 0 and ending with 1. The regular set defined by (0  1)1(0  1) is the set of strings {010, 011, 110, 111}.

Regular Expression Applications Regular expressions are actually used quite often in computer science. For instance, if you are editing a file with vi, and want to see if it contains the string blah followed by a number followed by any character followed by the letter Q, you can use the regular expression blah[0-9][0-9]*.Q This works because vi uses regular expressions for searching.

Grammars and Languages Many languages can be defined by grammars. We are particularly interested in phrase-structure grammars. Before we can define phrase-structure grammars, we need to define a few more terms.

Special Symbols Definition: A nonterminal symbol (or just nonterminal) is a symbol which can be replaced by other symbols. Definition: A terminal symbol (or just terminal) is a symbol which cannot be replaced by other symbols. Definition: The start symbol is a special symbol, usually denoted by S. The set of terminals is denoted by T, and the set of nonterminals by N. S is a nonterminal.

Productions Definition: A production is a rule which tells how to replace one string from V * with another string. Productions are denoted by a  b, which denotes that a can be replaced by b. Example –Let S  A0, A  A1, and A  0 be productions –Then I can replace S with A0 –Since I can replace A with A1, A0 can become A10 –Since I can replace A with 0, A10 can become 010 –Thus, I can replace S with 010

Phrase-Structure Grammars Definition: A phrase-structure grammar is a 4- tuple G=(V,T,S,P), where –V is a vocabulary –T  V is a set of terminals –S  V is a start symbol –P is a set of productions N=V-T is the set of nonterminals Each production contains at least one nonterminal on its left side. We will always use S as the start symbol.

Direct Derivations Let G=(V,T,S,P) be a phrase-structure grammar. Let A=lar and B=lbr, where l, a, b, r  V *. Let a  b be a production. Then we can derive B from A. Thus we say that A is directly derivable from B. We write this as A  B

Derivations Let G=(V,T,S,P) be a phrase-structure grammar Let A 1, A 2,…,A n  V * be such that A1A2…AnA1A2…An Then we say that A n is derivable from A 1. We write A 1 *  A n The sequence of productions used is called a derivation.

Generating Languages Let G=(V,T,S,P) be a grammar Definition: The language generated by G, denoted L(G), is the set of all strings of terminals that are derivable from S. Put another way, L(G)={w  T * | S *  w }

Example 1 Let G be the grammar with –V={S,0,1} –T={0,1} –P={S  S0, S  0} Clearly S  0, so 0  L(G) Also, S  S0  00, so 00  L(G) And, S  S0  S00  000, so 000  L(G) It is not hard to see that L(G) is the language consisting of all strings with 1 or more 0s.

Example 2 Let G be the grammar with V={S,0,1}, T={0,1}, and P={S  SS, S  1, S  0} Clearly S  0, so 0  L(G) Also, S  1, so 1  L(G) Since S  SS  S1  01, so 01  L(G) In general, we can get a sequence of Ss, and replace each with either 0 or 1. Given this fact, it is easy to see that L(G) ={0,1} +, the set of all non-empty binary strings

Example 3 Let G be the grammar with V={S,A,B,0,1}, T={0,1}, and P={S  AB, B  BB, A  AA, A  0, B  1} Clearly S  AB  0B  01, so 01  L(G) Also, S  AB  AAB  0AB  00B  001, so 001  L(G) Similarly, we can get 011, 0011, 0001, etc. In general, we can get a sequence of n 0s followed by m 1s, where n>0, m>0. Thus L(G) ={0 n 1 m | m and n are positive integers}

Type 0 Grammars Type 0 grammars have no restrictions on the types of productions that are allowed. Thus type 0 grammars are just phrase-structure grammars. This is not too exciting, so we will move on to type 1 grammars.

Type 1 Grammars In a type 1 grammar, productions are of the form –aXb  acb,where X  N and a,b,c  V * with c  –(or S , but ignore this for now) Thus, a production can only be applied if the symbol X is surrounded by a and b. In other words, the production can only be applied in a certain context. This is why type 1 grammars are also called context-sensitive grammars.

Type 2 Grammars Productions are of the form –X  a, where X  N and a  V *. Thus, if X is in a string, we can replace X with a no matter what surrounds X. In other words, the context in which X appears does not matter. This is why type 2 grammars are called context-free grammars. Context-free grammars produce context-free languages.

Type 3 Grammars Productions are of the form –X  a, where X  N and a  T –X  aY, where X,Y  N and a  T –S  Type 3 grammars are called regular grammars. Regular grammars produce regular languages. It is easy to see that a type 3 grammar is a type 2 grammar.

Types of Grammars TypeProductions allowed 0Almost any kind allowed 1 aXb  acb, where X  N, a,b,c  V *, c  S  2 X  a, where X  N and a  V * 3 X  a, where X  N and a  T X  aY, where X,Y  N and a  T S 

Types of Grammars The following summarizes the relationships between the types of grammars Type 0: phrase-structure Type 1: context-sensitive Type 2: context-free Type 3: regular

Regular Grammar Example Let G be the grammar with –V={S,A,0,1}, –T={0,1}, and –P={S  0A, A  0A, A  1A, A  1} We can determine what the language is by constructing a few words. –S  0A  01 –S  0A  00A  001S  0A  01A  011 –S  0A  00A  000A  0001 S  0A  00A  001A  0011 –S  0A  01A  010A  0101 S  0A  01A  011A  0111 We can see that in general, L(G) is the set of binary strings beginning with 0 and ending with 1.

Regular Languages and Sets Theorem: Let A be a subset of V *. Then A is a regular language if and only if A is a regular set. In other words, a language defined by a regular grammar can also be defined by a regular expression, and vice-versa. Example: We just saw that the grammar with V={S,A,0,1}, T={0,1}, and P={S  0A, A  0A, A  1A, A  1} generates the set of binary strings beginning with 0 and ending with 1. Recall that the regular set defined by 0(0  1) * 1 is also the set of all binary strings beginning with 0 and ending with 1.

Grammar Applications Context-free grammars are used to define the syntax of most programming languages. Regular grammars are used in several applications, including the following –Searching text for patterns –Lexical analysis (during program compilation) Efficient algorithms exist to determine if a string is in a context-free or regular language. This is important for tasks like determining whether or not a program is syntactically valid.

Backus-Naur Form Backus-Naur form (BNF) is a more compact representation of productions in a type 2 grammar. All productions with the same left hand side are combined into one production The symbol  is replaced with ::= All terminals are enclosed in The right hand sides of the various productions are combined, and separated by |

Backus-Naur Form Example Consider the set of productions –S  AB –B  BB –A  AA –A0–A0 –B1–B1 In BNF, they are represented by – ::= – ::= | 1 – ::= | 0

Backus-Naur Form Example 2 The Backus Naur form for the production of a signed integer is – ::= – ::= + | - – ::= | – ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Backus-Naur Form Applications Specifying the syntax for programming languages including –Java –LISP Specifying database languages –SQL Specifying markup languages –XML