Introduction to Language Theory Programming Language Translators Prepared by Manuel E. Bermúdez, Ph.D. Associate Professor University of Florida
Introduction to Language Theory Definition: An alphabet (or vocabulary) Σ is a finite set of symbols. Example: Alphabet of Pascal: + - * / < … (operators) begin end if var (keywords) <identifier> (identifiers) <string> (strings) <integer> (integers) ; : , ( ) [ ] (punctuators) Note: All identifiers are represented by one symbol, because Σ must be finite.
Introduction to Language Theory Definition: A sequence t = t1t2…tn of symbols from an alphabet Σ is a string. Definition: The length of a string t = t1t2…tn (denoted |t|) is n. If n = 0, the string is ε, the empty string. Definition: Given strings s = s1s2…sn and t = t1t2…tm, the concatenation of s and t, denoted st, is the string s1s2…snt1t2…tm.
Introduction to Language Theory Note: εu = u = uε, uεv = uv, for any strings u,v (including ε) Definition: Σ* is the set of all strings of symbols from Σ. Note: Σ* is called the reflexive, transitive closure of Σ. Σ* is described by the graph (Σ*, ·), where “·” denotes concatenation, and there is a designated “start” node, ε.
Introduction to Language Theory Example: Σ = {a, b}. (Σ*, ·) Σ* is countably infinite, so can’t compute all of Σ*, and can only compute finite subsets of Σ*, but can compute whether a given string is in Σ*. aa a a aba a b a ab b abb ε b a ba b b bb
Introduction to Language Theory Example: Σ = Pascal vocabulary. Σ* = all possible alleged Pascal programs, i.e. all possible inputs to Pascal compiler. Need to specify L Σ*, the correct Pascal programs. Definition: A language L over an alphabet Σ is a subset of Σ*.
Introduction to Language Theory Example: Σ = {a, b}. L1 = ø is a language L2 = {ε} is a language L3 = {a} is a language L4 = {a, ba, bbab} is a language L5 = {anbn / n >= 0} is a language where an = aa…a, n times L6 = {a, aa, aaa, …} is a language Note: L5 is an infinite language, but described finitely.
Introduction to Language Theory THIS IS THE MAIN GOAL OF LANGUAGE SPECIFICATION : To describe (infinite) programming languages finitely, and to provide corresponding finite inclusion-test algorithms.
Language Constructors Definition: The catenation (or product) of two languages L1 and L2, denoted L1L2, is the set {uv | uL1, vL2}. Example: L1 = {ε, a, bb}, L2 = {ac, c} L1L2 = {ac, c, aac, ac, bbac, bbc} = {ac, c, aac, bbac, bbc}
Language Constructors Definition: Ln = LL…L (n times), and L0 = {ε}. Example: L = {a, bb} L3 = {aaa, aabb, abba, abbbb, bbaa, bbabb, bbbba, bbbbbb}
Language Constructors Definition: The union of two languages L1 and L2 is the set L1 L2 = {u | uL1} { v | vL2} Definition: The Kleene star (L*) of a language is the set L* = U Ln, n >0. Example: L = {a, bb} L* = {any string composed of a’s and bb’s} Definition: The Transitive Closure (L+) of a language L is the set L+ = U Ln, n > 1. ∩ ∩
Language Constructors Note: In general, L* = L+ U {ε}, but L+ ≠ L* - {ε}. For example, consider L = {ε}. Then {ε} = L+ ≠ L* – {ε} = {ε} – {ε} = ø.
Grammars Goal: Providing a means for describing languages finitely. Method: Provide a subgraph (Σ*, →*) of (Σ*, ·), and a start node S, such that the set of reachable nodes (from S) are the strings in the language.
Grammars Example: Σ = {a, b} L = {anbn / n > 0} ε a aaa a aaba a aa aabb ε a ba bbaa a b a b bba b bb b bbab b bbb
Grammars “=>” (derives) is a relation defined by a finite set of rewrite rules known as productions. Definition: Given a vocabulary V, a production is a pair (u, v) V* x V*, denoted u → v. u is called the left-part; v is called the right-part.
Grammars Example: Pseudo-English. V = {Sentence, NP, VP, Adj, N, V, boy, girl, the, tall, jealous, hit, bit} Sentence → NP VP (one production) NP → N NP → Adj NP N → boy N → girl Adj → the Adj → tall Adj → jealous VP → V NP V → hit V → bit Note: English is much too complicated to be described this way.
Grammars Definition: Given a finite set of productions P V* x V* the relation => is defined such that , β, u, v V* , uβ => vβ iff u → v P is a production. Example: Sentence → NP VP Adj → the NP → N Adj → tall NP → Adj NP Adj → jealous N → boy VP → V NP N → girl V → hit V → bit
Grammars => Adj NP VP => the NP VP => the Adj NP VP Sentence => NP VP => Adj NP VP => the NP VP => the Adj NP VP => the jealous NP VP => the jealous N VP => the jealous girl VP => the jealous girl V NP => the jealous girl hit NP => the jealous girl hit Adj NP => the jealous girl hit the NP => the jealous girl hit the N => the jealous girl hit the boy
Grammars Definition: A grammar is a 4-tuple G = (Φ, Σ, P, S) where Φ is a finite set of nonterminals, Σ is a finite set of terminals, V = Φ U Σ is the grammar’s vocabulary, S Φ is called the start or goal symbol, and P V* x V* is a finite set of productions. Example: Grammar for {anbn / n > 0}. G = (Φ, Σ, P, S), where Φ = {S}, Σ = {a, b}, and P = {S → aSb, S → ε}
Grammars Derivations: S => aSb => aaSbb => aaaSbbb => aaaaSbbbb → … ε ab aabb aaabbb aaaabbbb Note: Normally, grammars are given by simply listing the productions. => => => => =>
Grammar Conventions TWS convention Upper case letter (identifier) – nonterminal Lower case letter (string) – terminal Lower case greek letter – strings in V* Left part of the first production is assumed to be the start symbol, e.g. S → aSb S → ε Left part omitted if same as for preceeding production, e.g. → ε
Grammars Example: Grammar for identifiers. Identifier → Letter → Identifier Digit Letter → ‘a’ → ‘A’ → ‘b’ → ‘B’ . → ‘z’ → ‘Z’ Digit → ‘0’ → ‘1’ → ‘9’
Grammars Definition: The language generated by a grammar G, is the set L(G) = { Σ* | S =>* } Definition: A sentential form generated by a grammar G is any string α such that S =>* . Definition: A sentence generated by a grammar G is any sentential form such that Σ*.
Grammars Example: sentential forms S => aSb => aaSbb => aaaSbbb => aaaaSbbbb > … ε ab aabb aaabbb aaaabbbb Lemma: L(G) = { | is a sentence} Proof: Trivial. => => => => => sentences
Grammars Example: A → aABC → aBC aB → ab bB → bb bC → bc CB → BC cC → cc
Grammars Derivations: A => aABC => aaABCBC => … => aBC aaBCBC aaaBCBCBC abC aabCBC aaaBBCBCC abc aabBCC aaaBBBCCC aabbCC aaabBBCCC (2) aabbcC aaabbbCCC aabbcc aaabbbcCC aaabbbccc L (G) = {anbncn | n > 1} => => => => => => => => => => => => => => => =>
The Chomsky Hierarchy A hierarchy of grammars, the languages they generate, and the machines the accept those languages.
The Chomsky Hierarchy Type Language Name Grammar Restrictions On grammar Accepting Machine Recursively Enumerable Unrestricted re-writing system None Turing Machine 1 Context-Sensitive Language Context- Sensitive Grammar For all →, ||≤|| Linear Bounded Automaton 2 Context- Free Language Context- Free Grammar Φ. Push-Down Automaton (parser) 3 Regular Φ, U ΦU{} Finite- State Automaton
24/04/2017 Language Hierarchy 0: Recursively Enumerable Languages 1: Context-Sensitive Languages 2: Context-free Languages We will deal with type 2 (syntax) and type 3 (lexicon) languages. 3: Regular Languages {an | n > 0} Be selective. You do not need to cover both research and education. It does not need to be a long list. You can put down just one opportunity that you are really excited about. Just identify what you think are the biggest opportunities for your department faculty. Strike a balance between “thinking big” and being realistic. One way to think would be to say that if you were the Dean, you would invest in these opportunities. Remember the goal is to have national level prominence and visibility where our peer group will recognize our activities and accomplishments. For example, the NSF ERC on Particle Science and Technology As you go to the next slide, please bear in mind that there may well be very strong connections between this slide and the next on multi-disciplinary collaborations. {anbn | n>0} {anbncn | n>0} English?