Lecture 4: Lexical Analysis & Chomsky Hierarchy (Revised based on the Tucker’s slides) 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Revisit Expression Grammar Let us consider the following Grammar for Assignment: Assignment -> ID ‘=‘ Exp Exp -> Exp + Term | Term Term -> Term * Integer | Integer | ID Integer -> 0 | 1 | …| 9 | 0 Integer | 1 Integer | …| 9 Integer ID -> a | b | … | z | a ID | b ID | … | z ID Build a parse tree abc = x + y 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Lecture 4: Lexical Analysis & Chomsky Grammar Levels of Syntax Lexical syntax = all the basic symbols of the language (names, values, operators, etc.) Concrete syntax = rules for writing expressions, statements and programs. Abstract syntax = internal representation of the program, favoring content over form. E.g., 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Lecture 4: Lexical Analysis & Chomsky Grammar So Expression Grammar For the following grammar: Assignment -> ID ‘=‘ Exp Exp -> Exp + Term | Term Term -> Term * Integer | Integer Concrete Syntax Integer -> 0 | 1 | …| 9 | 0 Integer | 1 Integer | …| 9 Integer ID -> a | b | … | z | a ID | b ID | … | z ID Lexical Syntax 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Lecture 4: Lexical Analysis & Chomsky Grammar Regular Grammar Simplest; least powerful Concentrate on the lexical syntax Right regular grammar: T*, B N, a T A → B A → ε (ε is an empty string) or A → A → a 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Lecture 4: Lexical Analysis & Chomsky Grammar Regular Grammar Left regular grammar: T*, B N, a T A → B A → ε A → a A regular grammar is either a left regular grammar or right regular grammar Consider the following grammar: S → aA A → Sb S → ε 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Lecture 4: Lexical Analysis & Chomsky Grammar Regular Grammars Equivalent to: Regular expression Finite-state automaton Used in construction of tokenizers Less powerful than context-free grammars Not a regular language { aⁿ bⁿ | n ≥ 1 } and { am bⁿ | 1 ≤m≤n } i.e., cannot balance: ( ), { }, begin end 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Compilers & Interpreters Intermediate Code (IC) Intermediate Code (IC) Abstract Syntax Tokens Source Program Machine Code Lexical Analyzer Syntactic Analyzer Semantic Analyzer Code Optimizer Code Generator Find syntax errors Find semantic errors 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Lecture 4: Lexical Analysis & Chomsky Grammar Purpose: transform program representation Input: printable Ascii characters Output: tokens (Terminals T) Discard: whitespace, comments Defn: A token is a logically cohesive sequence of characters representing a single symbol. A token is corresponding to a Terminal Symbol in CFG 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Lecture 4: Lexical Analysis & Chomsky Grammar Example Tokens Identifiers Literals: 123, 5.67, 'x', true Keywords: bool char ... Operators: + - * / ... Punctuation: ; , ( ) { } 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Lecture 4: Lexical Analysis & Chomsky Grammar Other Sequences Whitespace: space tab Comments // any-char* end-of-line End-of-line End-of-file All of the above languages can be defined by the CFG Grammar. 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Lecture 4: Lexical Analysis & Chomsky Grammar Regular Expressions RegExpr Meaning x a character x \x an escaped character, e.g., \n { name } a reference to a name M | N M or N M N M followed by N M* zero or more occurrences of M 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Lecture 4: Lexical Analysis & Chomsky Grammar RegExpr Meaning M+ One or more occurrences of M M? Zero or one occurrence of M [aeiou] the set of vowels [0-9] the set of digits . Any single character 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Lecture 4: Lexical Analysis & Chomsky Grammar Clite Lexical Syntax Category Definition anyChar [ -~] Letter [a-zA-Z] Digit [0-9] Whitespace [ \t] Eol \n Eof \004 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Lecture 4: Lexical Analysis & Chomsky Grammar Category Definition Identifier {Letter}({Letter} | {Digit})* integerLit {Digit}+ floatLit {Digit}+\.{Digit}+ charLit ‘{anyChar}’ 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Lecture 4: Lexical Analysis & Chomsky Grammar Category Definition Operator = | || | && | == | != | < | <= | >| >= | + | - | * | / |! | [ | ] Separator : | . | { | } | ( | ) Comment // ({anyChar} | {Whitespace})* {eol} 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Lecture 4: Lexical Analysis & Chomsky Grammar Generators Input: usually regular expression Output: table (slow), code C/C++: Lex, Flex Java: JLex 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Lecture 4: Lexical Analysis & Chomsky Grammar Chomsky Hierarchy Regular grammar -- least powerful Context-free grammar (BNF) Context-sensitive grammar Unrestricted grammar 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Context-free Grammars BNF a stylized form of CFG Equivalent to a pushdown automaton For a wide class of unambiguous CFGs, there are table-driven, linear time parsers 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Context-Sensitive Grammars Production: α → β |α| ≤ |β| α, β (N T)* ie, lefthand side can be composed of strings of terminals and nonterminals 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Regular Expression Exercise Describe the languages denoted by the following REs 0(0|1)*0 ((|0)1*)* (0|1)*0(0|1)(0|1) 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Regular Expression Exercise Consider a small language using only the letter “z” and “o”, and the slash char “/”. A comment in this language start with “/o” and ends after the very next “o/”. Comments do not nest. (The regular notations that can be used are A|B, AB, A*, A+, Valid: /o/zzzz/oo/, /ozz/oz////o/ Invalid: /o/, /ozzzooo/zzzo/ 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Regular Expression Exercise Consider a small language using only the letters “z”, “o”, and the slash char “/”. A comment in this language start with “/o” and ends after the very next “o/”. Comments do not nest. (The regular notations that can be used are A|B, AB, A*, A+, /o(o*z|/)*o+/ /o(o|z|/)*o/ /o/*(o*z/*)*o+/ /o(/|oz|oo)*o+/ /o(/*o*z)*/*o+/ 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar
Regular Expression Exercise All Strings of 0’s and 1’s to satisfy the following condition all binary strings except empty string contains at least three 1s does not contain the substring 110 length is at least 1 and at most 3 12/10/2018 Lecture 4: Lexical Analysis & Chomsky Grammar