CS 432: Compiler Construction Lecture 3

Slides:



Advertisements
Similar presentations
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
Advertisements

COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
CSc 453 Lexical Analysis (Scanning)
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
College of Computer Science & Technology Compiler Construction Principles & Implementation Techniques -1- Compiler Construction Principles & Implementation.
Compiler Construction
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary  Quoted string in.
CS 153: Concepts of Compiler Design August 25 Class Meeting Department of Computer Science San Jose State University Fall 2014 Instructor: Ron Mak
Scanner Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Topic #3: Lexical Analysis
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
CS 153: Concepts of Compiler Design September 2 Class Meeting Department of Computer Science San Jose State University Fall 2015 Instructor: Ron Mak
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
CS 153: Concepts of Compiler Design August 31 Class Meeting Department of Computer Science San Jose State University Fall 2015 Instructor: Ron Mak
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analyzer (Checker)
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
CS 152: Programming Language Paradigms April 2 Class Meeting Department of Computer Science San Jose State University Spring 2014 Instructor: Ron Mak
CS 153: Concepts of Compiler Design August 26 Class Meeting Department of Computer Science San Jose State University Fall 2015 Instructor: Ron Mak
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
CS 153: Concepts of Compiler Design October 10 Class Meeting Department of Computer Science San Jose State University Fall 2015 Instructor: Ron Mak
CSc 453 Lexical Analysis (Scanning)
CS 153: Concepts of Compiler Design October 21 Class Meeting Department of Computer Science San Jose State University Fall 2015 Instructor: Ron Mak
C Chuen-Liang Chen, NTUCS&IE / 35 SCANNING Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University Taipei,
Lexical Analysis (Scanning) Lexical Analysis (Scanning)
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
1 Compiler Construction Vana Doufexi office CS dept.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Department of Software & Media Technology
CS 3304 Comparative Languages
Intro to compilers Based on end of Ch. 1 and start of Ch. 2 of textbook, plus a few additional references.
Lecture 2 Lexical Analysis
Chapter 3 Lexical Analysis.
Chapter 2 :: Programming Language Syntax
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Lexical Analysis (Sections )
CSc 453 Lexical Analysis (Scanning)
CS 153: Concepts of Compiler Design October 17 Class Meeting
CSc 453 Lexical Analysis (Scanning)
CS 153: Concepts of Compiler Design August 31 Class Meeting
RegExps & DFAs CS 536.
CS 153: Concepts of Compiler Design September 7 Class Meeting
CMPE 152: Compiler Design December 5 Class Meeting
Department of Software & Media Technology
Lexical and Syntax Analysis
CMPE 152: Compiler Design August 30 Class Meeting
CS 432: Compiler Construction Lecture 7
Lexical Analysis - An Introduction
CMPE 152: Compiler Design August 28 Class Meeting
Lecture 5: Lexical Analysis III: The final bits
R.Rajkumar Asst.Professor CSE
CS 3304 Comparative Languages
CMPE 152: Compiler Design August 23 Class Meeting
CS 3304 Comparative Languages
CMPE 152: Compiler Design January 31 Class Meeting
Chapter 2 :: Programming Language Syntax
Lexical Analysis - An Introduction
Chapter 2 :: Programming Language Syntax
CMPE 152: Compiler Design March 21 Class Meeting
Compiler Construction
Lecture 5 Scanning.
CMPE 152: Compiler Design March 19 Class Meeting
CSc 453 Lexical Analysis (Scanning)
CMPE 152: Compiler Design September 3 Class Meeting
Presentation transcript:

CS 432: Compiler Construction Lecture 3 Department of Computer Science Salisbury University Fall 2017 Instructor: Dr. Sophie Wang http://faculty.salisbury.edu/~xswang 6/18/2018

Compiler Team Project Write a compiler for a procedure-oriented source language that will generate code for the Java virtual machine (JVM). The source language should be a procedural, non object-oriented language or subset thereof, or a language that the team invents. Tip: Start with a simple language! No Scheme, Lisp, or Lisp-like languages. The object language must be Jasmin, the assembly language for the Java virtual machine. You should use the JavaCC compiler-compiler. You can also use any Java code from the WCI book. Compile and run source programs written in your language. 6/18/2018

Compiler Team Project, cont’d Deliverables (what each team turns in) Java and JavaCC source files of a working compiler. The JavaCC .jj (or .jjt) grammar files. Do not include the Java files that JavaCC generated. Written report (5-10 pp. single spaced) Include: Code diagrams for key source language constructs. Optional: Syntax diagrams for key source language constructs. Optional: UML diagrams for key compiler components. Instructions on how to build your compiler. If it’s not standard or not obvious. Instructions on how to run your compiler (scripts OK). Sample source programs to compile and execute. 6/18/2018

Compiler Team Project, cont’d Private individual post mortem report (up to 1 page from each student) What did you learn from this course? An assessment of your accomplishments for your project An assessment of each of your project team members. 6/18/2018

Minimum Acceptable Compiler Project At least two data types with type checking. Basic arithmetic operations with operator precedence. Assignment statements. At least one conditional control statement (e.g., IF). At least one looping control statement. Procedures or functions with calls and returns. Parameters passed by value or by reference. Basic error recovery (skip to semicolon or end of line). “Nontrivial” sample programs written in the source language. Generate Jasmin code that can be assembled. Execute the resulting .class file standalone (preferred). No crashes (e.g., null pointer exceptions). 70/100 6/18/2018

Project Teams (So Far) Team S & A Smash Brothers Team B & B Skynet Sam Andrew Team B & B Daniel Blake The Team A Jun Matt Smash Brothers Justin Henry Skynet Joe Jingwen Utsab Before I came here, I was confused about this subject. Having listened to your lecture, I am still confused, but on a higher level. Enrico Fermi, physicist, 1901-1954 6/18/2018

How to Scan for Tokens Suppose the source line contains IF (index >= 10) THEN The scanner skips over the leading blanks. The current character is I, so the next token must be a word. The scanner extracts a word token by copying characters up to but not including the first character that is not valid for a word, which in this case is a blank. The blank becomes the current character. The scanner determines that the word is a reserved word. 6/18/2018

How to Scan for Tokens, cont’d The scanner skips over any blanks between tokens. The current character is (. The next token must be a special symbol. After extracting the special symbol token, the current character is i. The next token must be a word. After extracting the word token, the current character is a blank. 6/18/2018

How to Scan for Tokens, cont’d Skip the blank. The current character is >. Extract the special symbol token. The current character is a blank. Skip the blank. The current character is 1, so the next token must be a number. After extracting the number token, the current character is ). 6/18/2018

How to Scan for Tokens, cont’d Extract the special symbol token. The current character is a blank. Skip the blank. The current character is T, so the next token must be a word. Extract the word token. Determine that it’s a reserved word. The current character is \n, so the scanner is done with this line. 6/18/2018

Basic Scanning Algorithm Skip any blanks until the current character is nonblank. In Pascal, a comment and the end-of-line character each should be treated as a blank. The current (nonblank) character determines what the next token is and becomes that token’s first character. Extract the rest of the next token by copying successive characters up to but not including the first character that does not belong to that token. Extracting a token consumes all the source characters that constitute the token. After extracting a token, the current character is the first character after the last character of that token. 6/18/2018

Pascal-Specific Front End Classes 6/18/2018

The Payoff Now that we have … Source language-independent framework classes Pascal-specific subclasses Mostly just placeholders for now An end-to-end test (the program listing generator) … we can work on the individual components Without worrying (too much) about breaking the rest of the code. _ 6/18/2018

tokenval (token attribute) Lexical analyzer y := 31 + 28*x <id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”> token tokenval (token attribute) Parser

Front End Framework Classes 6/18/2018

Pascal-Specific Subclasses 6/18/2018

PascalTokenType Each token is an enumerated value. public enum PascalTokenType implements TokenType { // Reserved words. AND, ARRAY, BEGIN, CASE, CONST, DIV, DO, DOWNTO, ELSE, END, FILE, FOR, FUNCTION, GOTO, IF, IN, LABEL, MOD, NIL, NOT, OF, OR, PACKED, PROCEDURE, PROGRAM, RECORD, REPEAT, SET, THEN, TO, TYPE, UNTIL, VAR, WHILE, WITH, // Special symbols. PLUS("+"), MINUS("-"), STAR("*"), SLASH("/"), COLON_EQUALS(":="), DOT("."), COMMA(","), SEMICOLON(";"), COLON(":"), QUOTE("'"), EQUALS("="), NOT_EQUALS("<>"), LESS_THAN("<"), LESS_EQUALS("<="), GREATER_EQUALS(">="), GREATER_THAN(">"), LEFT_PAREN("("), RIGHT_PAREN(")"), LEFT_BRACKET("["), RIGHT_BRACKET("]"), LEFT_BRACE("{"), RIGHT_BRACE("}"), UP_ARROW("^"), DOT_DOT(".."), IDENTIFIER, INTEGER, REAL, STRING, ERROR, END_OF_FILE; ... } 6/18/2018

PascalTokenType, cont’d The static set RESERVED_WORDS contains all of Pascal’s reserved word strings in lower case: "and" , "array" , "begin" , etc. We can test whether a token is a reserved word. if (RESERVED_WORDS.contains(text.toLowerCase())) ... // Set of lower-cased Pascal reserved word text strings. public static HashSet<String> RESERVED_WORDS = new HashSet<String>(); static { PascalTokenType values[] = PascalTokenType.values(); for (int i = AND.ordinal(); i <= WITH.ordinal(); ++i) { RESERVED_WORDS.add(values[i].getText().toLowerCase()); } 6/18/2018

PascalTokenType (cont’d) Static hash table SPECIAL_SYMBOLS contains all of Pascal’s special symbols. Each entry’s key is the string, such as "<" , "=" , "<=" , etc. Each entry’s value is the corresponding enumerated value. We can test whether a token is a special symbol. if (PascalTokenType.SPECIAL_SYMBOLS .containsKey(Character.toString(currentChar))) ... // Hash table of Pascal special symbols. // Each special symbol's text is the key to its Pascal token type. public static Hashtable<String, PascalTokenType> SPECIAL_SYMBOLS = new Hashtable<String, PascalTokenType>(); static { PascalTokenType values[] = PascalTokenType.values(); for (int i = PLUS.ordinal(); i <= DOT_DOT.ordinal(); ++i) { SPECIAL_SYMBOLS.put(values[i].getText(), values[i]); } 6/18/2018

Pascal-Specific Token Classes Each class PascalWordToken, PascalNumberToken, PascalStringToken, PascalSpecial-SymbolToken, and PascalErrorToken is is a subclass of class PascalToken. PascalToken is a subclass of class Token. Each Pascal token subclass overrides the default extract() method of class Token. The default method could only create single-character tokens. Loosely coupled Highly cohesive 6/18/2018

Syntax Diagrams 6/18/2018

Class PascalScanner The first character determines the type protected Token extractToken() throws Exception { skipWhiteSpace(); Token token; char currentChar = currentChar(); // Construct the next token. The current character determines the // token type. if (currentChar == EOF) { token = new EofToken(source); } else if (Character.isLetter(currentChar)) { token = new PascalWordToken(source); else if (Character.isDigit(currentChar)) { token = new PascalNumberToken(source); ... return token; The first character determines the type of the next token. 6/18/2018

Class PascalWordToken protected void extract() throws Exception { StringBuilder textBuffer = new StringBuilder(); char currentChar = currentChar(); // Get the word characters (letter or digit). The scanner has // already determined that the first character is a letter. while (Character.isLetterOrDigit(currentChar)) { textBuffer.append(currentChar); currentChar = nextChar(); // consume character } text = textBuffer.toString(); // Is it a reserved word or an identifier? type = (RESERVED_WORDS.contains(text.toLowerCase())) ? PascalTokenType.valueOf(text.toUpperCase()) // reserved word : IDENTIFIER; // identifier 6/18/2018

Pascal String Tokens A Pascal string literal constant uses single quotes. Two consecutive single quotes represents a single quote character inside a string. 'Don''t' is the string consisting of the characters Don't. A Pascal character literal constant is simply a string with only a single character, such as 'a'. Pascal token subclass PascalStringToken. 6/18/2018

Pascal Number Tokens A Pascal integer literal constant is an unsigned integer. A Pascal real literal constant starts with an unsigned integer (the whole part) followed by either A decimal point followed by another unsigned integer (the fraction part), or An E or e, optionally followed by + or -, followed by an unsigned integer (the exponent part), or A fraction part followed by an exponent part. Any leading + or – sign before the literal constant is a separate token. 6/18/2018

Class PascalNumberToken For the token string "31415.926e-4" , method extractNumber() passes the following parameter values to method computeFloatValue(): wholeDigits "31415" fractionDigits "926" exponentDigits "4" exponentSign '-' Compute variable exponentValue: 4 as computed by computeIntegerValue() -4 after negation since exponentSign is '-' -7 after subtracting fractionDigits.length() Compute 31415926 x 10-7 = 3.1415926 A bit of a hack! 6/18/2018

Syntax Error Handling Error handling is a three-step process: Detection. Detect the presence of a syntax error. Flagging. Flag the error by pointing it out or highlighting it, and display a descriptive error message. Recovery. Move past the error and resume parsing. For now, we’ll just move on, starting with the current character, and attempt to extract the next token. SYNTAX_ERROR message source line number beginning source position token text syntax error message 6/18/2018

Class PascalParserTD public void parse() throws Exception { ... // Loop over each token until the end of file. while (!((token = nextToken()) instanceof EofToken)) { TokenType tokenType = token.getType(); if (tokenType != ERROR) { // Format each token. sendMessage(new Message(TOKEN, new Object[] {token.getLineNumber(), token.getPosition(), tokenType, token.getText(), token.getValue()})); } else { errorHandler.flag(token, (PascalErrorCode) token.getValue(), this); 6/18/2018

Program: Pascal Tokenizer Verify the correctness of the Pascal token subclasses. Verify the correctness the Pascal scanner. 6/18/2018

Can We Build a Better Scanner? Our scanner in the front end is relatively easy to understand and follow. Separate scanner classes for each token type. However, it’s big and slow. Creates lots of objects and makes lots of method calls. We can write a more compact and faster scanner. However, it may be harder to understand and follow. _ 6/18/2018

Regular Expressions Notation to specify a language Declarative Sort of like a programming language. Capable of describing the same thing as a NFA An alphabet  is a finite set of symbols (characters) A string s is a finite sequence of symbols from  s denotes the length of string s  denotes the empty string, thus  = 0 A language is a specific set of strings over some fixed alphabet 

String Operations The concatenation of two strings x and y is denoted by xy The exponentation of a string s is defined by s0 =  si = si-1s for i > 0 note that s = s = s

Language Operations Union L  M = {s  s  L or s  M} Concatenation LM = {xy  x  L and y  M} Exponentiation L0 = {}; Li = Li-1L Kleene closure L* = i=0,…, Li Positive closure L+ = i=1,…, Li

Regular Expressions Basis symbols:  is a regular expression denoting language {} a   is a regular expression denoting {a} If r and s are regular expressions denoting languages L(r) and M(s) respectively, then rs is a regular expression denoting L(r)  M(s) rs is a regular expression denoting L(r)M(s) r* is a regular expression denoting L(r*) (r +) is a regular expression denoting L(r+) A language defined by a regular expression is called a regular set

Regular Definitions Regular definitions introduce a naming convention: d1  r1 d2  r2 … dn  rn where each ri is a regular expression over   {d1, d2, …, di-1 } Any dj in ri can be textually substituted in ri to obtain an equivalent set of definitions

Examples Example: letter  AB…Zab…z digit  01…9 id  letter ( letterdigit )*

Notational Shorthand The following shorthands are often used: r+ = rr* r? = r [a-z] = abc…z Examples: digit  01…9 is simplified to digit  [0-9]

Regular Definitions and Grammars stmt  if expr then stmt  if expr then stmt else stmt   expr  term relop term  term term  id  num Regular definitions if  if then  then else  else relop  <  <=  <>  >  >=  = id  letter ( letter | digit )* num  digit+ (. digit+)? ( E (+-)? digit+ )?

Finite Automaton

State Diagram for M1 1 q3 q1 q2 1 0, 1

Data Representation for M1

Language The language of machine M, written as L(M), is the set of all strings that machine M accepts. We can say that M recognizes the L(M). A language is called a regular language if some finite automaton recognizes it.

Non-determistic and deterministic Finite Automata NFA and DFA stand for nondeterministic finite automaton and deterministic finite automaton, respectively. A nondeterministic finite automaton can be different from a deterministic one in that for any input symbol, nondeterministic one can transit to more than one states. epsilon transition Every nondeterministic finite automaton has an equivalent deterministic finite automaton.

Equivalence of FA and RE Finite Automata and Regular Expressions are equivalent.

Deterministic Finite Automata (DFA) Pascal identifier Regular expression: <letter> ( <letter> | <digit> )* Implement the regular expression with a finite automaton (AKA finite state machine): accepting state 1 2 3 letter digit [other] start state transition This automaton is a deterministic finite automaton (DFA). At each state, the next input character uniquely determines which transition to take to the next state. _ 6/18/2018

State-Transition Matrix 1 2 3 letter digit [other] Represent the behavior of a DFA by a state-transition matrix: 6/18/2018

DFA for a Pascal Number 6 9 10 4 7 11 digit + - E . 5 8 12 [other] 3 Note that this diagram allows only an upper-case E for an exponent. What changes are required to also allow a lower-case e? 6/18/2018

DFA for a Pascal Identifier or Number private static final int matrix[][] = { /* letter digit + - . E other */ /* 0 */ { 1, 4, 3, 3, ERR, 1, ERR }, /* 1 */ { 1, 1, -2, -2, -2, 1, -2 }, /* 2 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR }, /* 3 */ { ERR, 4, ERR, ERR, ERR, ERR, ERR }, /* 4 */ { -5, 4, -5, -5, 6, 9, -5 }, /* 5 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR }, /* 6 */ { ERR, 7, ERR, ERR, ERR, ERR, ERR }, /* 7 */ { -8, 7, -8, -8, -8, 9, -8 }, /* 8 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR }, /* 9 */ { ERR, 11, 10, 10, ERR, ERR, ERR }, /* 10 */ { ERR, 11, ERR, ERR, ERR, ERR, ERR }, /* 11 */ { -12, 11, -12, -12, -12, -12, -12 }, /* 12 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR }, }; Negative numbers in the matrix are the accepting states. digit 1 2 letter [other] 6 9 10 4 7 11 digit + - E . 5 8 12 [other] 3 Notice how the letter E is handled! 6/18/2018

A Simple DFA Scanner public class SimpleDFAScanner { // Input characters. private static final int LETTER = 0; private static final int DIGIT = 1; private static final int PLUS = 2; private static final int MINUS = 3; private static final int DOT = 4; private static final int E = 5; private static final int OTHER = 6; private static final int ERR = -99999; // error state private static final int matrix[][] = { ... }; private char ch; // current input character private int state; // current state ... } 6/18/2018

A Simple DFA Scanner, cont’d int typeOf(char ch) { return (ch == 'E') ? E : Character.isLetter(ch) ? LETTER : Character.isDigit(ch) ? DIGIT : (ch == '+') ? PLUS : (ch == '-') ? MINUS : (ch == '.') ? DOT : OTHER; } 6/18/2018

A Simple DFA Scanner, cont’d private String nextToken() throws IOException { while (Character.isWhitespace(ch)) nextChar(); if (ch == 0) return null; // EOF? state = 0; // start state StringBuilder buffer = new StringBuilder(); while (state >= 0) { // not accepting state state = matrix[state][typeOf(ch)]; // transit if ((state >= 0) || (state == ERR)) { buffer.append(ch); // build token string nextChar(); } return buffer.toString(); This is the heart of the scanner. Table-driven scanners can be very fast! 6/18/2018

Simple DFA Scanner, cont’d private void scan() throws IOException { nextChar(); while (ch != 0) { // EOF? String token = nextToken(); if (token != null) { System.out.print("=====> \"" + token + "\" "); String tokenType = (state == -2) ? "IDENTIFIER" : (state == -5) ? "INTEGER" : (state == -8) ? "REAL (fraction only)" : (state == -12) ? "REAL" : "*** ERROR ***"; System.out.println(tokenType); } How do we know which token we just got? 6/18/2018

Compiler-Compilers Professional compiler writers generally do not write scanners and parsers from scratch. A compiler-compiler is a tool for writing compilers. It can include: A scanner generator A parser generator Parse tree utilities General idea: Feed a compiler-compiler a grammar written in a textual form such as BNF or EBNF. The compiler-compiler outputs code written in a high-level language that implements a scanner, parser, and a parse tree. 6/18/2018

Popular Compiler-Compilers Yacc “Yet another compiler-compiler” Generates a bottom-up parser written in C. GNU version: Bison Lex Generates a scanner written in C. GNU version: Flex JavaCC Generates a scanner and a top-down parser written in Java. JJTree Preprocessor for JavaCC grammars. Generates Java code to build and process parse trees. The code generated by a compiler-compiler can be in a high level language such as C or Java. However, you may find the code to be ugly and hard to read. 6/18/2018

JavaCC Compiler-Compiler Feed JavaCC the grammar for a source language and it will automatically generate a scanner and a parser. Specify the source language tokens with regular expressions  JavaCC generates a scanner for the source language. Specify the source language syntax rules with Extended BNF  JavaCC generates a parser for the source language. The generated scanner and parser are written in Java. Note: JavaCC calls the scanner the “tokenizer”. _ 6/18/2018

Download and Install Download and install JavaCC and JJTree http://javacc.java.net/ Also download the examples from the book Generating Parsers with JavaCC http://generatingparserswithjavacc.com/examples.html 6/18/2018

JavaCC Regular Expressions Literals <HELLO : "hello"> Character classes <SPACE_OR_COMMA : [" ", ","]> Character ranges <LOWER_CASE : ["a"-"z"]> Alternates <UPPER_OR_LOWER : ["A"-"Z"] | ["a"-"z"]> Token name Token string 6/18/2018

JavaCC Regular Expressions (cont'd) Negation <NO_DIGITS : ~["0"-"9"]> Repetition <THREE_A : ("a"){3}> <TWO_TO_FOUR_A : ("a"){2,4}> Quantifiers <ONE_OR_MORE_A : ("a")+> <ZERO_OR_ONE_SIGN : (["+", "-"])?> <ZERO_OR_MORE_DIGITS : (["0"-"9"])*> 6/18/2018

Example: helloworld1.jj options { BUILD_PARSER=false; } PARSER_BEGIN(HelloWorld) public class HelloWorld {} PARSER_END(HelloWorld) TOKEN_MGR_DECLS : { public static void main(String[] args) { java.io.StringReader sr = new java.io.StringReader(args[0]); SimpleCharStream scs = new SimpleCharStream(sr); HelloWorldTokenManager mgr = new HelloWorldTokenManager(scs); for (Token t = mgr.getNextToken(); t.kind != EOF; t = mgr.getNextToken()) { debugStream.println("Found token: " + t.image); TOKEN : { <HELLO : "hello"> | <WORLD : "world"> Recognize tokens “hello” and “world”. Main Java method of the tokenizer Literals 6/18/2018

Example: helloworld2.jj options { BUILD_PARSER=false; IGNORE_CASE=true; } PARSER_BEGIN(HelloWorld) public class HelloWorld {} PARSER_END(HelloWorld) TOKEN_MGR_DECLS : { public static void main(String[] args) { java.io.StringReader sr = new java.io.StringReader(args[0]); SimpleCharStream scs = new SimpleCharStream(sr); HelloWorldTokenManager mgr = new HelloWorldTokenManager(scs); while (mgr.getNextToken().kind != EOF) {} SKIP : { <IGNORE : [" ", ","]> TOKEN : { <HELLO : "hello"> { debugStream.println("HELLO token: " + matchedToken.image); } | <WORLD : "world"> { debugStream.println("WORLD token: " + matchedToken.image); } Recognize tokens “hello” and “world” Ignore case. Ignore spaces and commas between tokens. Use lexical actions. Character class Lexical actions 6/18/2018

Example: helloworld3.jj options { BUILD_PARSER=false; IGNORE_CASE=true; } PARSER_BEGIN(HelloWorld) public class HelloWorld {} PARSER_END(HelloWorld) TOKEN_MGR_DECLS : { public static void main(String[] args) { ... SKIP : { <IGNORE : [" ", ","]> TOKEN : { <HELLO : "hello"> { debugStream.println("HELLO token: " + matchedToken.image); } | <WORLD : "world"> { debugStream.println("WORLD token: " + matchedToken.image); } | <IDENTIFIER : ["a"-"z"](["a"-"z", "0"-"9", "_"])*> { debugStream.println("IDENTIFIER token: " + matchedToken.image); } Recognize words other than “hello” and “world” as identifiers. Why aren’t “hello” and “world” recognized as identifiers? 6/18/2018

Example: helloworld4.jj options { BUILD_PARSER=false; IGNORE_CASE=true; } PARSER_BEGIN(HelloWorld) public class HelloWorld {} PARSER_END(HelloWorld) TOKEN_MGR_DECLS : { public static void main(String[] args) { ... SKIP : { <IGNORE : [" ", ","]> TOKEN : { <HELLO : "hello"> { debugStream.println("HELLO token: " + matchedToken.image); } | <WORLD : "world"> { debugStream.println("WORLD token: " + matchedToken.image); } | <IDENTIFIER : <LETTER> (<LETTER> | <DIGIT> | "_")*> { debugStream.println("IDENTIFIER token: " + matchedToken.image); } | <#DIGIT : ["0"-"9"]> | <#LETTER : ["a"-"z"]> Use private tokens DIGIT and LETTER. Private tokens 6/18/2018

Example: helloworld5.jj options { BUILD_PARSER=false; IGNORE_CASE=true; DEBUG_TOKEN_MANAGER=true; } PARSER_BEGIN(HelloWorld) public class HelloWorld {} PARSER_END(HelloWorld) TOKEN_MGR_DECLS : { public static void main(String[] args) { ... SKIP : { <IGNORE : [" ", ","]> TOKEN : { <HELLO : "hello"> { debugStream.println("HELLO token: " + matchedToken.image); } | <WORLD : "world"> { debugStream.println("WORLD token: " + matchedToken.image); } | <IDENTIFIER : <LETTER> (<LETTER> | <DIGIT> | "_")*> { debugStream.println("IDENTIFIER token: " + matchedToken.image); } | <#DIGIT : ["0"-"9"]> | <#LETTER : ["a"-"z"]> What’s the tokenizer doing? Turn on debugging output. 6/18/2018

The JavaCC Eclipse Plug-in How to install the plug-in: Start Eclipse. Help  Install New Software ... Click the Add button Name: SF Eclipse JavaCC Location: http://eclipse-javacc.sourceforge.net/ Click the OK button. Select “SF JavaCC Eclipse Plug-in” Or, visit http://sourceforge.net/projects/eclipse-javacc/ 6/18/2018

Review: DFA for a Pascal Identifier or Number private static final int matrix[][] = { /* letter digit + - . E other */ /* 0 */ { 1, 4, 3, 3, ERR, 1, ERR }, /* 1 */ { 1, 1, -2, -2, -2, 1, -2 }, /* 2 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR }, /* 3 */ { ERR, 4, ERR, ERR, ERR, ERR, ERR }, /* 4 */ { -5, 4, -5, -5, 6, 9, -5 }, /* 5 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR }, /* 6 */ { ERR, 7, ERR, ERR, ERR, ERR, ERR }, /* 7 */ { -8, 7, -8, -8, -8, 9, -8 }, /* 8 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR }, /* 9 */ { ERR, 11, 10, 10, ERR, ERR, ERR }, /* 10 */ { ERR, 11, ERR, ERR, ERR, ERR, ERR }, /* 11 */ { -12, 11, -12, -12, -12, -12, -12 }, /* 12 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR }, }; Negative numbers in the matrix are accepting states. digit 1 2 letter [other] 6 9 10 4 7 11 digit + - E . 5 8 12 [other] 3 6/18/2018

The JavaCC Tokenizer The JavaCC tokenizer (scanner) is a DFA created from the regular expressions. The generated Java code implements state-transitions with switch statements instead of a matrix. Example (from HelloWorldTokenManager.java): static private int jjMoveStringLiteralDfa0_0() { switch(curChar) case 72: case 104: return jjMoveStringLiteralDfa1_0(0x4L); case 87: case 119: return jjMoveStringLiteralDfa1_0(0x8L); default : return jjMoveNfa_0(0, 0); } 6/18/2018