Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 432: Compiler Construction Lecture 3

Similar presentations


Presentation on theme: "CS 432: Compiler Construction Lecture 3"— Presentation transcript:

1 CS 432: Compiler Construction Lecture 3
Department of Computer Science Salisbury University Fall 2017 Instructor: Dr. Sophie Wang 6/18/2018

2 Compiler Team Project Write a compiler for a procedure-oriented source language that will generate code for the Java virtual machine (JVM). The source language should be a procedural, non object-oriented language or subset thereof, or a language that the team invents. Tip: Start with a simple language! No Scheme, Lisp, or Lisp-like languages. The object language must be Jasmin, the assembly language for the Java virtual machine. You should use the JavaCC compiler-compiler. You can also use any Java code from the WCI book. Compile and run source programs written in your language. 6/18/2018

3 Compiler Team Project, cont’d
Deliverables (what each team turns in) Java and JavaCC source files of a working compiler. The JavaCC .jj (or .jjt) grammar files. Do not include the Java files that JavaCC generated. Written report (5-10 pp. single spaced) Include: Code diagrams for key source language constructs. Optional: Syntax diagrams for key source language constructs. Optional: UML diagrams for key compiler components. Instructions on how to build your compiler. If it’s not standard or not obvious. Instructions on how to run your compiler (scripts OK). Sample source programs to compile and execute. 6/18/2018

4 Compiler Team Project, cont’d
Private individual post mortem report (up to 1 page from each student) What did you learn from this course? An assessment of your accomplishments for your project An assessment of each of your project team members. 6/18/2018

5 Minimum Acceptable Compiler Project
At least two data types with type checking. Basic arithmetic operations with operator precedence. Assignment statements. At least one conditional control statement (e.g., IF). At least one looping control statement. Procedures or functions with calls and returns. Parameters passed by value or by reference. Basic error recovery (skip to semicolon or end of line). “Nontrivial” sample programs written in the source language. Generate Jasmin code that can be assembled. Execute the resulting .class file standalone (preferred). No crashes (e.g., null pointer exceptions). 70/100 6/18/2018

6 Project Teams (So Far) Team S & A Smash Brothers Team B & B Skynet
Sam Andrew Team B & B Daniel Blake The Team A Jun Matt Smash Brothers Justin Henry Skynet Joe Jingwen Utsab Before I came here, I was confused about this subject. Having listened to your lecture, I am still confused, but on a higher level. Enrico Fermi, physicist, 6/18/2018

7 How to Scan for Tokens Suppose the source line contains IF (index >= 10) THEN The scanner skips over the leading blanks. The current character is I, so the next token must be a word. The scanner extracts a word token by copying characters up to but not including the first character that is not valid for a word, which in this case is a blank. The blank becomes the current character. The scanner determines that the word is a reserved word. 6/18/2018

8 How to Scan for Tokens, cont’d
The scanner skips over any blanks between tokens. The current character is (. The next token must be a special symbol. After extracting the special symbol token, the current character is i. The next token must be a word. After extracting the word token, the current character is a blank. 6/18/2018

9 How to Scan for Tokens, cont’d
Skip the blank. The current character is >. Extract the special symbol token. The current character is a blank. Skip the blank. The current character is 1, so the next token must be a number. After extracting the number token, the current character is ). 6/18/2018

10 How to Scan for Tokens, cont’d
Extract the special symbol token. The current character is a blank. Skip the blank. The current character is T, so the next token must be a word. Extract the word token. Determine that it’s a reserved word. The current character is \n, so the scanner is done with this line. 6/18/2018

11 Basic Scanning Algorithm
Skip any blanks until the current character is nonblank. In Pascal, a comment and the end-of-line character each should be treated as a blank. The current (nonblank) character determines what the next token is and becomes that token’s first character. Extract the rest of the next token by copying successive characters up to but not including the first character that does not belong to that token. Extracting a token consumes all the source characters that constitute the token. After extracting a token, the current character is the first character after the last character of that token. 6/18/2018

12 Pascal-Specific Front End Classes
6/18/2018

13 The Payoff Now that we have …
Source language-independent framework classes Pascal-specific subclasses Mostly just placeholders for now An end-to-end test (the program listing generator) … we can work on the individual components Without worrying (too much) about breaking the rest of the code. _ 6/18/2018

14 tokenval (token attribute)
Lexical analyzer y := *x <id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”> token tokenval (token attribute) Parser

15 Front End Framework Classes
6/18/2018

16 Pascal-Specific Subclasses
6/18/2018

17 PascalTokenType Each token is an enumerated value.
public enum PascalTokenType implements TokenType { // Reserved words. AND, ARRAY, BEGIN, CASE, CONST, DIV, DO, DOWNTO, ELSE, END, FILE, FOR, FUNCTION, GOTO, IF, IN, LABEL, MOD, NIL, NOT, OF, OR, PACKED, PROCEDURE, PROGRAM, RECORD, REPEAT, SET, THEN, TO, TYPE, UNTIL, VAR, WHILE, WITH, // Special symbols. PLUS("+"), MINUS("-"), STAR("*"), SLASH("/"), COLON_EQUALS(":="), DOT("."), COMMA(","), SEMICOLON(";"), COLON(":"), QUOTE("'"), EQUALS("="), NOT_EQUALS("<>"), LESS_THAN("<"), LESS_EQUALS("<="), GREATER_EQUALS(">="), GREATER_THAN(">"), LEFT_PAREN("("), RIGHT_PAREN(")"), LEFT_BRACKET("["), RIGHT_BRACKET("]"), LEFT_BRACE("{"), RIGHT_BRACE("}"), UP_ARROW("^"), DOT_DOT(".."), IDENTIFIER, INTEGER, REAL, STRING, ERROR, END_OF_FILE; ... } 6/18/2018

18 PascalTokenType, cont’d
The static set RESERVED_WORDS contains all of Pascal’s reserved word strings in lower case: "and" , "array" , "begin" , etc. We can test whether a token is a reserved word. if (RESERVED_WORDS.contains(text.toLowerCase())) ... // Set of lower-cased Pascal reserved word text strings. public static HashSet<String> RESERVED_WORDS = new HashSet<String>(); static { PascalTokenType values[] = PascalTokenType.values(); for (int i = AND.ordinal(); i <= WITH.ordinal(); ++i) { RESERVED_WORDS.add(values[i].getText().toLowerCase()); } 6/18/2018

19 PascalTokenType (cont’d)
Static hash table SPECIAL_SYMBOLS contains all of Pascal’s special symbols. Each entry’s key is the string, such as "<" , "=" , "<=" , etc. Each entry’s value is the corresponding enumerated value. We can test whether a token is a special symbol. if (PascalTokenType.SPECIAL_SYMBOLS containsKey(Character.toString(currentChar))) ... // Hash table of Pascal special symbols. // Each special symbol's text is the key to its Pascal token type. public static Hashtable<String, PascalTokenType> SPECIAL_SYMBOLS = new Hashtable<String, PascalTokenType>(); static { PascalTokenType values[] = PascalTokenType.values(); for (int i = PLUS.ordinal(); i <= DOT_DOT.ordinal(); ++i) { SPECIAL_SYMBOLS.put(values[i].getText(), values[i]); } 6/18/2018

20 Pascal-Specific Token Classes
Each class PascalWordToken, PascalNumberToken, PascalStringToken, PascalSpecial-SymbolToken, and PascalErrorToken is is a subclass of class PascalToken. PascalToken is a subclass of class Token. Each Pascal token subclass overrides the default extract() method of class Token. The default method could only create single-character tokens. Loosely coupled Highly cohesive 6/18/2018

21 Syntax Diagrams 6/18/2018

22 Class PascalScanner The first character determines the type
protected Token extractToken() throws Exception { skipWhiteSpace(); Token token; char currentChar = currentChar(); // Construct the next token. The current character determines the // token type. if (currentChar == EOF) { token = new EofToken(source); } else if (Character.isLetter(currentChar)) { token = new PascalWordToken(source); else if (Character.isDigit(currentChar)) { token = new PascalNumberToken(source); ... return token; The first character determines the type of the next token. 6/18/2018

23 Class PascalWordToken
protected void extract() throws Exception { StringBuilder textBuffer = new StringBuilder(); char currentChar = currentChar(); // Get the word characters (letter or digit). The scanner has // already determined that the first character is a letter. while (Character.isLetterOrDigit(currentChar)) { textBuffer.append(currentChar); currentChar = nextChar(); // consume character } text = textBuffer.toString(); // Is it a reserved word or an identifier? type = (RESERVED_WORDS.contains(text.toLowerCase())) ? PascalTokenType.valueOf(text.toUpperCase()) // reserved word : IDENTIFIER; // identifier 6/18/2018

24 Pascal String Tokens A Pascal string literal constant uses single quotes. Two consecutive single quotes represents a single quote character inside a string. 'Don''t' is the string consisting of the characters Don't. A Pascal character literal constant is simply a string with only a single character, such as 'a'. Pascal token subclass PascalStringToken. 6/18/2018

25 Pascal Number Tokens A Pascal integer literal constant is an unsigned integer. A Pascal real literal constant starts with an unsigned integer (the whole part) followed by either A decimal point followed by another unsigned integer (the fraction part), or An E or e, optionally followed by + or -, followed by an unsigned integer (the exponent part), or A fraction part followed by an exponent part. Any leading + or – sign before the literal constant is a separate token. 6/18/2018

26 Class PascalNumberToken
For the token string " e-4" , method extractNumber() passes the following parameter values to method computeFloatValue(): wholeDigits "31415" fractionDigits "926" exponentDigits "4" exponentSign '-' Compute variable exponentValue: 4 as computed by computeIntegerValue() -4 after negation since exponentSign is '-' -7 after subtracting fractionDigits.length() Compute x 10-7 = A bit of a hack! 6/18/2018

27 Syntax Error Handling Error handling is a three-step process:
Detection. Detect the presence of a syntax error. Flagging. Flag the error by pointing it out or highlighting it, and display a descriptive error message. Recovery. Move past the error and resume parsing. For now, we’ll just move on, starting with the current character, and attempt to extract the next token. SYNTAX_ERROR message source line number beginning source position token text syntax error message 6/18/2018

28 Class PascalParserTD public void parse() throws Exception { ...
// Loop over each token until the end of file. while (!((token = nextToken()) instanceof EofToken)) { TokenType tokenType = token.getType(); if (tokenType != ERROR) { // Format each token. sendMessage(new Message(TOKEN, new Object[] {token.getLineNumber(), token.getPosition(), tokenType, token.getText(), token.getValue()})); } else { errorHandler.flag(token, (PascalErrorCode) token.getValue(), this); 6/18/2018

29 Program: Pascal Tokenizer
Verify the correctness of the Pascal token subclasses. Verify the correctness the Pascal scanner. 6/18/2018

30 Can We Build a Better Scanner?
Our scanner in the front end is relatively easy to understand and follow. Separate scanner classes for each token type. However, it’s big and slow. Creates lots of objects and makes lots of method calls. We can write a more compact and faster scanner. However, it may be harder to understand and follow. _ 6/18/2018

31 Regular Expressions Notation to specify a language Declarative
Sort of like a programming language. Capable of describing the same thing as a NFA An alphabet  is a finite set of symbols (characters) A string s is a finite sequence of symbols from  s denotes the length of string s  denotes the empty string, thus  = 0 A language is a specific set of strings over some fixed alphabet 

32 String Operations The concatenation of two strings x and y is denoted by xy The exponentation of a string s is defined by s0 =  si = si-1s for i > 0 note that s = s = s

33 Language Operations Union L  M = {s  s  L or s  M}
Concatenation LM = {xy  x  L and y  M} Exponentiation L0 = {}; Li = Li-1L Kleene closure L* = i=0,…, Li Positive closure L+ = i=1,…, Li

34 Regular Expressions Basis symbols:
 is a regular expression denoting language {} a   is a regular expression denoting {a} If r and s are regular expressions denoting languages L(r) and M(s) respectively, then rs is a regular expression denoting L(r)  M(s) rs is a regular expression denoting L(r)M(s) r* is a regular expression denoting L(r*) (r +) is a regular expression denoting L(r+) A language defined by a regular expression is called a regular set

35 Regular Definitions Regular definitions introduce a naming convention: d1  r1 d2  r2 … dn  rn where each ri is a regular expression over   {d1, d2, …, di-1 } Any dj in ri can be textually substituted in ri to obtain an equivalent set of definitions

36 Examples Example: letter  AB…Zab…z digit  01…9 id  letter ( letterdigit )*

37 Notational Shorthand The following shorthands are often used: r+ = rr* r? = r [a-z] = abc…z Examples: digit  01…9 is simplified to digit  [0-9]

38 Regular Definitions and Grammars
stmt  if expr then stmt  if expr then stmt else stmt   expr  term relop term  term term  id  num Regular definitions if  if then  then else  else relop  <  <=  <>  >  >=  = id  letter ( letter | digit )* num  digit+ (. digit+)? ( E (+-)? digit+ )?

39 Finite Automaton

40 State Diagram for M1 1 q3 q1 q2 1 0, 1

41 Data Representation for M1

42 Language The language of machine M, written as L(M), is the set of all strings that machine M accepts. We can say that M recognizes the L(M). A language is called a regular language if some finite automaton recognizes it.

43 Non-determistic and deterministic Finite Automata
NFA and DFA stand for nondeterministic finite automaton and deterministic finite automaton, respectively. A nondeterministic finite automaton can be different from a deterministic one in that for any input symbol, nondeterministic one can transit to more than one states. epsilon transition Every nondeterministic finite automaton has an equivalent deterministic finite automaton.

44 Equivalence of FA and RE
Finite Automata and Regular Expressions are equivalent.

45 Deterministic Finite Automata (DFA)
Pascal identifier Regular expression: <letter> ( <letter> | <digit> )* Implement the regular expression with a finite automaton (AKA finite state machine): accepting state 1 2 3 letter digit [other] start state transition This automaton is a deterministic finite automaton (DFA). At each state, the next input character uniquely determines which transition to take to the next state. _ 6/18/2018

46 State-Transition Matrix
1 2 3 letter digit [other] Represent the behavior of a DFA by a state-transition matrix: 6/18/2018

47 DFA for a Pascal Number 6 9 10 4 7 11 digit + - E . 5 8 12 [other] 3 Note that this diagram allows only an upper-case E for an exponent. What changes are required to also allow a lower-case e? 6/18/2018

48 DFA for a Pascal Identifier or Number
private static final int matrix[][] = { /* letter digit E other */ /* 0 */ { 1, 4, 3, 3, ERR, 1, ERR }, /* 1 */ { 1, 1, -2, -2, -2, 1, -2 }, /* 2 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR }, /* 3 */ { ERR, 4, ERR, ERR, ERR, ERR, ERR }, /* 4 */ { -5, 4, -5, -5, 6, 9, -5 }, /* 5 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR }, /* 6 */ { ERR, 7, ERR, ERR, ERR, ERR, ERR }, /* 7 */ { -8, 7, -8, -8, -8, 9, -8 }, /* 8 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR }, /* 9 */ { ERR, 11, 10, 10, ERR, ERR, ERR }, /* 10 */ { ERR, 11, ERR, ERR, ERR, ERR, ERR }, /* 11 */ { -12, 11, -12, -12, -12, -12, -12 }, /* 12 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR }, }; Negative numbers in the matrix are the accepting states. digit 1 2 letter [other] 6 9 10 4 7 11 digit + - E . 5 8 12 [other] 3 Notice how the letter E is handled! 6/18/2018

49 A Simple DFA Scanner public class SimpleDFAScanner {
// Input characters. private static final int LETTER = 0; private static final int DIGIT = 1; private static final int PLUS = 2; private static final int MINUS = 3; private static final int DOT = 4; private static final int E = 5; private static final int OTHER = 6; private static final int ERR = ; // error state private static final int matrix[][] = { ... }; private char ch; // current input character private int state; // current state ... } 6/18/2018

50 A Simple DFA Scanner, cont’d
int typeOf(char ch) { return (ch == 'E') ? E : Character.isLetter(ch) ? LETTER : Character.isDigit(ch) ? DIGIT : (ch == '+') ? PLUS : (ch == '-') ? MINUS : (ch == '.') ? DOT : OTHER; } 6/18/2018

51 A Simple DFA Scanner, cont’d
private String nextToken() throws IOException { while (Character.isWhitespace(ch)) nextChar(); if (ch == 0) return null; // EOF? state = 0; // start state StringBuilder buffer = new StringBuilder(); while (state >= 0) { // not accepting state state = matrix[state][typeOf(ch)]; // transit if ((state >= 0) || (state == ERR)) { buffer.append(ch); // build token string nextChar(); } return buffer.toString(); This is the heart of the scanner. Table-driven scanners can be very fast! 6/18/2018

52 Simple DFA Scanner, cont’d
private void scan() throws IOException { nextChar(); while (ch != 0) { // EOF? String token = nextToken(); if (token != null) { System.out.print("=====> \"" + token + "\" "); String tokenType = (state == -2) ? "IDENTIFIER" : (state == -5) ? "INTEGER" : (state == -8) ? "REAL (fraction only)" : (state == -12) ? "REAL" : "*** ERROR ***"; System.out.println(tokenType); } How do we know which token we just got? 6/18/2018

53 Compiler-Compilers Professional compiler writers generally do not write scanners and parsers from scratch. A compiler-compiler is a tool for writing compilers. It can include: A scanner generator A parser generator Parse tree utilities General idea: Feed a compiler-compiler a grammar written in a textual form such as BNF or EBNF. The compiler-compiler outputs code written in a high-level language that implements a scanner, parser, and a parse tree. 6/18/2018

54 Popular Compiler-Compilers
Yacc “Yet another compiler-compiler” Generates a bottom-up parser written in C. GNU version: Bison Lex Generates a scanner written in C. GNU version: Flex JavaCC Generates a scanner and a top-down parser written in Java. JJTree Preprocessor for JavaCC grammars. Generates Java code to build and process parse trees. The code generated by a compiler-compiler can be in a high level language such as C or Java. However, you may find the code to be ugly and hard to read. 6/18/2018

55 JavaCC Compiler-Compiler
Feed JavaCC the grammar for a source language and it will automatically generate a scanner and a parser. Specify the source language tokens with regular expressions  JavaCC generates a scanner for the source language. Specify the source language syntax rules with Extended BNF  JavaCC generates a parser for the source language. The generated scanner and parser are written in Java. Note: JavaCC calls the scanner the “tokenizer”. _ 6/18/2018

56 Download and Install Download and install JavaCC and JJTree
Also download the examples from the book Generating Parsers with JavaCC 6/18/2018

57 JavaCC Regular Expressions
Literals <HELLO : "hello"> Character classes <SPACE_OR_COMMA : [" ", ","]> Character ranges <LOWER_CASE : ["a"-"z"]> Alternates <UPPER_OR_LOWER : ["A"-"Z"] | ["a"-"z"]> Token name Token string 6/18/2018

58 JavaCC Regular Expressions (cont'd)
Negation <NO_DIGITS : ~["0"-"9"]> Repetition <THREE_A : ("a"){3}> <TWO_TO_FOUR_A : ("a"){2,4}> Quantifiers <ONE_OR_MORE_A : ("a")+> <ZERO_OR_ONE_SIGN : (["+", "-"])?> <ZERO_OR_MORE_DIGITS : (["0"-"9"])*> 6/18/2018

59 Example: helloworld1.jj
options { BUILD_PARSER=false; } PARSER_BEGIN(HelloWorld) public class HelloWorld {} PARSER_END(HelloWorld) TOKEN_MGR_DECLS : { public static void main(String[] args) { java.io.StringReader sr = new java.io.StringReader(args[0]); SimpleCharStream scs = new SimpleCharStream(sr); HelloWorldTokenManager mgr = new HelloWorldTokenManager(scs); for (Token t = mgr.getNextToken(); t.kind != EOF; t = mgr.getNextToken()) { debugStream.println("Found token: " + t.image); TOKEN : { <HELLO : "hello"> | <WORLD : "world"> Recognize tokens “hello” and “world”. Main Java method of the tokenizer Literals 6/18/2018

60 Example: helloworld2.jj
options { BUILD_PARSER=false; IGNORE_CASE=true; } PARSER_BEGIN(HelloWorld) public class HelloWorld {} PARSER_END(HelloWorld) TOKEN_MGR_DECLS : { public static void main(String[] args) { java.io.StringReader sr = new java.io.StringReader(args[0]); SimpleCharStream scs = new SimpleCharStream(sr); HelloWorldTokenManager mgr = new HelloWorldTokenManager(scs); while (mgr.getNextToken().kind != EOF) {} SKIP : { <IGNORE : [" ", ","]> TOKEN : { <HELLO : "hello"> { debugStream.println("HELLO token: " + matchedToken.image); } | <WORLD : "world"> { debugStream.println("WORLD token: " + matchedToken.image); } Recognize tokens “hello” and “world” Ignore case. Ignore spaces and commas between tokens. Use lexical actions. Character class Lexical actions 6/18/2018

61 Example: helloworld3.jj
options { BUILD_PARSER=false; IGNORE_CASE=true; } PARSER_BEGIN(HelloWorld) public class HelloWorld {} PARSER_END(HelloWorld) TOKEN_MGR_DECLS : { public static void main(String[] args) { ... SKIP : { <IGNORE : [" ", ","]> TOKEN : { <HELLO : "hello"> { debugStream.println("HELLO token: " + matchedToken.image); } | <WORLD : "world"> { debugStream.println("WORLD token: " + matchedToken.image); } | <IDENTIFIER : ["a"-"z"](["a"-"z", "0"-"9", "_"])*> { debugStream.println("IDENTIFIER token: " + matchedToken.image); } Recognize words other than “hello” and “world” as identifiers. Why aren’t “hello” and “world” recognized as identifiers? 6/18/2018

62 Example: helloworld4.jj
options { BUILD_PARSER=false; IGNORE_CASE=true; } PARSER_BEGIN(HelloWorld) public class HelloWorld {} PARSER_END(HelloWorld) TOKEN_MGR_DECLS : { public static void main(String[] args) { ... SKIP : { <IGNORE : [" ", ","]> TOKEN : { <HELLO : "hello"> { debugStream.println("HELLO token: " + matchedToken.image); } | <WORLD : "world"> { debugStream.println("WORLD token: " + matchedToken.image); } | <IDENTIFIER : <LETTER> (<LETTER> | <DIGIT> | "_")*> { debugStream.println("IDENTIFIER token: " + matchedToken.image); } | <#DIGIT : ["0"-"9"]> | <#LETTER : ["a"-"z"]> Use private tokens DIGIT and LETTER. Private tokens 6/18/2018

63 Example: helloworld5.jj
options { BUILD_PARSER=false; IGNORE_CASE=true; DEBUG_TOKEN_MANAGER=true; } PARSER_BEGIN(HelloWorld) public class HelloWorld {} PARSER_END(HelloWorld) TOKEN_MGR_DECLS : { public static void main(String[] args) { ... SKIP : { <IGNORE : [" ", ","]> TOKEN : { <HELLO : "hello"> { debugStream.println("HELLO token: " + matchedToken.image); } | <WORLD : "world"> { debugStream.println("WORLD token: " + matchedToken.image); } | <IDENTIFIER : <LETTER> (<LETTER> | <DIGIT> | "_")*> { debugStream.println("IDENTIFIER token: " + matchedToken.image); } | <#DIGIT : ["0"-"9"]> | <#LETTER : ["a"-"z"]> What’s the tokenizer doing? Turn on debugging output. 6/18/2018

64 The JavaCC Eclipse Plug-in
How to install the plug-in: Start Eclipse. Help  Install New Software ... Click the Add button Name: SF Eclipse JavaCC Location: Click the OK button. Select “SF JavaCC Eclipse Plug-in” Or, visit 6/18/2018

65 Review: DFA for a Pascal Identifier or Number
private static final int matrix[][] = { /* letter digit E other */ /* 0 */ { 1, 4, 3, 3, ERR, 1, ERR }, /* 1 */ { 1, 1, -2, -2, -2, 1, -2 }, /* 2 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR }, /* 3 */ { ERR, 4, ERR, ERR, ERR, ERR, ERR }, /* 4 */ { -5, 4, -5, -5, 6, 9, -5 }, /* 5 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR }, /* 6 */ { ERR, 7, ERR, ERR, ERR, ERR, ERR }, /* 7 */ { -8, 7, -8, -8, -8, 9, -8 }, /* 8 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR }, /* 9 */ { ERR, 11, 10, 10, ERR, ERR, ERR }, /* 10 */ { ERR, 11, ERR, ERR, ERR, ERR, ERR }, /* 11 */ { -12, 11, -12, -12, -12, -12, -12 }, /* 12 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR }, }; Negative numbers in the matrix are accepting states. digit 1 2 letter [other] 6 9 10 4 7 11 digit + - E . 5 8 12 [other] 3 6/18/2018

66 The JavaCC Tokenizer The JavaCC tokenizer (scanner) is a DFA created from the regular expressions. The generated Java code implements state-transitions with switch statements instead of a matrix. Example (from HelloWorldTokenManager.java): static private int jjMoveStringLiteralDfa0_0() { switch(curChar) case 72: case 104: return jjMoveStringLiteralDfa1_0(0x4L); case 87: case 119: return jjMoveStringLiteralDfa1_0(0x8L); default : return jjMoveNfa_0(0, 0); } 6/18/2018


Download ppt "CS 432: Compiler Construction Lecture 3"

Similar presentations


Ads by Google