Compiler Construction Sohail Aslam Lecture 6 compiler: intro
How to Describe Tokens? Regular Languages are the most popular for specifying tokens Simple and useful theory Easy to understand Efficient implementations
Languages Let S be a set of characters. S is called the alphabet. A language over S is set of strings of characters drawn from S.
Example of Languages Alphabet = English characters Language = English sentences Alphabet = ASCII Language = C++ programs, Java, C#
Notation Languages are sets of strings (finite sequence of characters) Need some notation for specifying which sets we want
Notation For lexical analysis we care about regular languages. Regular languages can be described using regular expressions.
Regular Languages Each regular expression is a notation for a regular language (a set of words). If A is a regular expression, we write L(A) to refer to language denoted by A.
Regular Expression A regular expression (RE) is defined inductively a ordinary character from S e the empty string
Regular Expression R|S = either R or S RS = R followed by S (concatenation) R* = concatenation of R zero or more times (R*= e |R|RR|RRR...)
RE Extentions R? = e | R (zero or one R) R+ = RR* (one or more R) (R) = R (grouping)
RE Extentions [abc] = a|b|c (any of listed) [a-z] = a|b|....|z (range) [^ab] = c|d|... (anything but ‘a’‘b’)
Regular Expression RE Strings in L(R) a “a” ab “ab” a|b “a” “b” (a|e)b “ab” “b”
Example: integers integer: a non-empty string of digits integer = digit digit*
Example: identifiers identifier: string or letters or digits starting with a letter C identifier: [a-zA-Z_][a-zA-Z0-9_]*
Recap Tokens: strings of characters representing lexical units of programs such as identifiers, numbers, operators.
Recap Regular Expressions: concise description of tokens. A regular expression describes a set of strings.
Recap Language L(R): set of strings represented by a regular expression R. L(R) is the language denoted by regular expression R.
How to Use REs We need mechanism to determine if an input string w belongs to L(R), the language denoted by regular expression R.
Acceptor Such a mechanism is called an acceptor. input string w yes, if w e L acceptor no, if w e L language L
Finite Automata (FA) Specification: Regular Expressions Implementation: Finite Automata
Finite Automata Finite Automaton consists of An input alphabet (S) A set of states A start (initial) state A set of transitions A set of accepting (final) states
Finite Automaton State Graphs A state The start state An accepting state
Finite Automaton State Graphs a A transition
Finite Automata A finite automaton accepts a string if we can follow transitions labelled with characters in the string from start state to some accepting state.
FA Example A FA that accepts only “1” 1
FA Example A FA that accepts any number of 1’s followed by a single 0
FA Example A FA that accepts ab*a Alphabet: {a,b} b a a end of lecture 6 compiler: intro