Compiler Construction 2011, Lecture 2 http://lara.epfl.ch http://tiny.cc/compilers Drawing Hands M.C. Escher, 1948 Compiler Construction 2011, Lecture 2 Staff: Viktor Kuncak – Lectures Etienne Kneuss and Philippe Suter – {labs} Eva Darulova and Giuliano Losa – Exercises Regis Blanc – assistant Yvette Gallay – secretary Good starting 4min seemed to soon, but it’s good to start questions Minute 8 - quite dynamic. Good examples minute 14.5 – questions stressed in the start – better later Every 15 minutes some questions – and pause map rhoteric real questions (write on a piece of paper, or tell to person next to you) Group discussions unclear 30 minutes is good time to introduce the course ----- merging both slides and board worked well initially, not so good later when asking questions, if they don’t answer, then ask them to discuss with a colleague Look less towards the blackboard 15-20min attention span Organization of the blackboard 2-5min to solve the small subproblem Good structure of the lecture vk: give questions after every lecture Grading percentage- but also give examples Did you understand the question Relounching the question
Reminder Register on: IS academia Moodle – so you can get our emails Our wonderful repository (reachable from course page) So please form the groups
(e.g. x86, arm, JVM) efficient to execute Compiler i=0 while (i < 10) { a[i] = 7*i+3 i = i + 1 } source code (e.g. Scala, Java,C) easy to write Construction data-flow graphs Compiler (scalac, gcc) optimizer i = 0 LF w h i l e i = 0 while ( i < 10 ) lexer parser assign while i 0 + * 3 7 i a[i] < 10 type check code gen characters words trees mov R1,#0 mov R2,#40 mov R3,#3 jmp +12 mov (a+R1),R3 add R1, R1, #4 add R3, R3, #7 cmp R1, R2 blt -16 machine code (e.g. x86, arm, JVM) efficient to execute
println(“”,id3); id3 = id3 + 1 } source code Construction Compiler Id3 = 0 while (id3 < 10) { println(“”,id3); id3 = id3 + 1 } source code Construction Compiler (scalac, gcc) i d3 = 0 LF w id3 = 0 while ( id3 < 10 ) lexer parser assign while i 0 + * 3 7 i a[i] < 10 characters words (tokens) trees Lexer is specified using regular expressions. Groups characters into tokens and classifies them into token classes.
Today: Lexical Analysis. Summary: lexical analyzer maps a stream of characters into a stream of tokens while doing that, it typically needs only bounded memory we can specify tokens for a lexical analyzers using regular expressions it is not difficult to construct a lexical analyzer manually we give an example for manually constructed analyzers, we often use the first character to decide on token class; a notion first(L) = { a | aw in L } we follow the maximal munch rule: lexical analyzer should eagerly accept the longest token that it can recognize from the current point it is possible to automate the construction of lexical analyzers; the starting point is conversion of regular expressions to automata tools that automate this construction are part of compiler-compilers, such as JavaCC described in the Tiger book automated construction of lexical analyzers from regular expressions is an example of compilation for a domain-specific language
While Language – Idea Small language used to illustrate key concepts Also used in your first lab – interpreter later labs will use a more complex language we continue to use while in lectures ‘while’ and ‘if’ are the control statements no procedures, no exceptions the only variables are of ‘int’ type no variable declarations, they are initially zero no objects, pointers, arrays
While Language – Example Programs x = 13; while (x > 1) { println("x=", x); if (x % 2 == 0) { x = x / 2; } else { x = 3 * x + 1; } } while (i < 100) { j = i + 1; while (j < 100) { println(“ “,i); println(“,”,j); j = j + 1; } i = i + 1; Does the program terminate for every initial value of x? (Collatz conjecture - open) Nested loop
Tokens (Words) of the While Language Ident ::= letter (letter | digit)* integerConst ::= digit digit* stringConst ::= “ AnySymbolExceptQuote* “ keywords if else while println special symbols ( ) && < == + - * / % ! - { } ; , letter ::= a | b | c | … | z | A | B | C | … | Z digit ::= 0 | 1 | … | 8 | 9 regular expressions
Regular Expressions: Definition One way to denote (often infinite) languages Regular expression is an expression built from: empty language {ε}, denoted just ε {a} for a in Σ, denoted simply by a union, denoted | or, sometimes + concatenation, as multiplication or nothing Kleene star * Identifiers: letter (letter | digit)* (letter,digit are shorthands from before)
History: Kleene (from Wikipedia) Stephen Cole Kleene (January 5, 1909, Hartford, Connecticut, United States – January 25, 1994, Madison, Wisconsin) was an American mathematician who helped lay the foundations for theoretical computer science. One of many distinguished students of Alonzo Church, Kleene, along with Alan Turing, Emil Post, and others, is best known as a founder of the branch of mathematical logic known as recursion theory. Kleene's work grounds the study of which functions are computable. A number of mathematical concepts are named after him: Kleene hierarchy, Kleene algebra, the Kleene star (Kleene closure), Kleene's recursion theorem and the Kleene fixpoint theorem. He also invented regular expressions, and was a leading American advocate of mathematical intuitionism.
Manually Constructing Lexers
println(“”,id3); id3 = id3 + 1 } source code Construction Compiler Id3 = 0 while (id3 < 10) { println(“”,id3); id3 = id3 + 1 } source code Construction Compiler (scalac, gcc) i d3 = 0 LF w id3 = 0 while ( id3 < 10 ) lexer parser assign while i 0 + * 3 7 i a[i] < 10 characters words (tokens) trees Lexer is specified using regular expressions. Groups characters into tokens and classifies them into token classes.
Lexer input and Output Stream of Char-s Stream of Token-s lexer class CharStream(fileName : String){ val file = new BufferedReader( new FileReader(fileName)) var current : Char = ' ' var eof : Boolean = false def next = { if (eof) throw EndOfInput("reading" + file) val c = file.read() eof = (c == -1) current = c.asInstanceOf[Char] } next Stream of Token-s sealed abstract class Token case class ID(content : String) // “id3” extends Token case class IntConst(value : Int) // 10 extends Token case class AssignEQ() ‘=‘ extends Token case class CompareEQ // ‘==‘ case class MUL() extends Token // ‘*’ case class PLUS() extends Token // + case clas LEQ extends Token // ‘<=‘ case class OPAREN extends Token //( case class CPAREN extends Token //) ... case class IF() extends Token // ‘if’ case class WHILE() extends Token case class EOF() extends Token // End Of File i d3 = 0 LF w id3 = 0 while ( id3 < 10 ) lexer class Lexer(ch : CharStream) { var current : Token def next : Unit = { lexer code here }
Identifiers and Keywords regular expression for identifiers: letter (letter|digit)* if (isLetter) { b = new StringBuffer while (isLetter || isDigit) { b.append(ch.current) ch.next } } keywords.lookup(b.toString) { case None => token=ID(b.toString) case Some(kw) => token=kw } Keywords look like identifiers, but are simply indicated as keywords in language definition A constant Map from strings to keyword tokens if not in map, then it is ordinary identifier
Integer Constants regular expression for integers: digit digit* if (isDigit) { k = while (isDigit) { k = ch.next } token = IntConst(k) } Keywords look like identifiers, but are simply indicated as keywords in language definition A constant Map from strings to keyword tokens if not in map, then it is ordinary identifier
Decision Tree to Map Symbols to Tokens ch.current match { case '(' => {current = OPAREN; ch.next; return} case ')' => {current = CPAREN; ch.next; return} case '+' => {current = PLUS; ch.next; return} case '/' => {current = DIV; ch.next; return} case '*' => {current = MUL; ch.next; return} case '=' => { ch.next if (ch.current=='=') {ch.next; current = CompareEQ; return} else {current = AssignEQ; return} } case '<' => { if (ch.current=='=') {ch.next; current = LEQ; return} else {current = LESS; return}
Skipping Comments if (ch.current='/') { ch.next while (!isEOL && !isEOF) { } Nested comments? /* foo /* bar */ baz */
Further Important Topics Longest Match Rule Combining pieces together computing first symbols for regular expressions Example of tiny lexical analyzer see wiki
Computing first symbols
Computing nullable expressions
Automating Construction of Lexers
Example in javacc TOKEN: { <IDENTIFIER: <LETTER> (<LETTER> | <DIGIT> | "_")* > | <CONSTANT: <DIGIT> (<DIGIT>)* > | <LETTER: ["a"-"z"] | ["A"-"Z"]> | <DIGIT: ["0"-"9"]> } SKIP: { " " | "\n" | "\t"
Finite Automaton Kinds of finite automata: deterministic non-deterministic with epsilon transition with regular expressions on edges
Interpretation of Non-Determinism For a given string, some paths in automaton lead to accepting, some to rejecting states Does the automaton accept? yes, if there exists an accepting path Continued in next lecture