Tokenizers 26-Apr-19.

Slides:



Advertisements
Similar presentations
1 StringBuffer & StringTokenizer Classes Chapter 5 - Other String Classes.
Advertisements

Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.
Chapter 7 Strings F To process strings using the String class, the StringBuffer class, and the StringTokenizer class. F To use the String class to process.
Chapter 7 Strings F Processing strings using the String class, the StringBuffer class, and the StringTokenizer class. F Use the String class to process.
23-Apr-15 Abstract Data Types. 2 Data types I We type data--classify it into various categories-- such as int, boolean, String, Applet A data type represents.
Arrays Liang, Chpt 5. arrays Fintan Array of chars For example, a String variable contains an array of characters: An array is a data structure.
14-Jun-15 State Machines. 2 What is a state machine? A state machine is a different way of thinking about computation A state machine has some number.
©The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 4 th Ed Chapter Chapter 9 Characters and Strings (sections ,
String Tokenization What is String Tokenization?
23-Jun-15 Strings, Etc. Part I: String s. 2 About Strings There is a special syntax for constructing strings: "Hello" Strings, unlike most other objects,
StringBuffer class  Alternative to String class  Can be used wherever a string is used  More flexible than String  Has three constructors and more.
25-Jun-15 State Machines. 2 What is a state machine? A state machine is a different way of thinking about computation A state machine has some number.
Strings, Etc. Part I: Strings. About Strings There is a special syntax for constructing strings: "Hello" Strings, unlike most other objects, have a defined.
Lecture 2: Variables and Expressions Yoni Fridman 6/29/01 6/29/01.
Chapter 91 Streams and File I/O Chapter 9. 2 Reminders Project 6 released: due Nov 10:30 pm Project 4 regrades due by midnight tonight Discussion.
©The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 4 th Ed Chapter Chapter 9 Characters and Strings (sections ,
14-Jul-15 State Machines Abbreviated lecture. 2 What is a state machine? A state machine is a different way of thinking about computation A state machine.
Introduction to Programming David Goldschmidt, Ph.D. Computer Science The College of Saint Rose Java Fundamentals (Comments, Variables, etc.)
Chapter 7 Strings  Use the String class to process fixed strings.  Use the StringBuffer class to process flexible strings.  Use the StringTokenizer.
Chapter 9-Text File I/O. Overview n Text File I/O and Streams n Writing to a file. n Reading from a file. n Parsing and tokenizing. n Random Access n.
CompSci 100E 2.1 Java Basics - Expressions  Literals  A literal is a constant value also called a self-defining term  Possibilities: o Object: null,
COMP Parsing 3 of 4 Lectures 23. Using the Scanner Break input into tokens Use Scanner with delimiter: public void parse(String input ) { Scanner.
13-Nov-1513-Nov-1513-Nov-15 State Machines. What is a state machine? A state machine is a different way of thinking about computation A state machine.
Introduction to Java Lecture Notes 3. Variables l A variable is a name for a location in memory used to hold a value. In Java data declaration is identical.
Java Programming Java Basics. Data Types Java has two main categories of data types: –Primitive data types Built in data types Many very similar to C++
Tokenizers 29-Nov-15. Tokens A tokenizer is a program that extracts tokens from an input stream A token is a “word” or a significant punctuation mark.
Strings and Text File I/O (and Exception Handling) Corresponds with Chapters 8 and 17.
1 CHAPTER 3 StringTokenizer. 2 StringTokenizer CLASS There are BufferedReader methods to read a line (i.e. a record) and a character, but not just a single.
CS 115 OBJECT ORIENTED PROGRAMMING I LECTURE 9 GEORGE KOUTSOGIANNAKIS Copyright: 2014 Illinois Institute of Technology- George Koutsogiannakis 1.
Sections © Copyright by Pearson Education, Inc. All Rights Reserved.
19-Dec-15 Tokenizers. Tokens A tokenizer is a program that extracts tokens from an input stream A token has two parts: Its value—this is just the characters.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
 In the java programming language, a keyword is one of 50 reserved words which have a predefined meaning in the language; because of this,
© 2006 Pearson Addison-Wesley. All rights reserved 1-1 Chapter 1 Review of Java Fundamentals.
© 2011 Pearson Education, publishing as Addison-Wesley Chapter 1: Computer Systems Presentation slides for Java Software Solutions for AP* Computer Science.
© 2006 Pearson Addison-Wesley. All rights reserved 1-1 Chapter 1 Review of Java Fundamentals.
C++ Lesson 1.
Basic concepts of C++ Presented by Prof. Satyajit De
CIS3931 – Intro to JAVA Lecture Note Set 2 17-May-05.
Working with Java.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Building Java Programs
OBJECT ORIENTED PROGRAMMING II LECTURE 2 GEORGE KOUTSOGIANNAKIS
Variables and Arithmetic Operators in JavaScript
University of Central Florida COP 3330 Object Oriented Programming
Introduction to C++ October 2, 2017.
Control Structures – Selection
Chapter 7: Strings and Characters
MSIS 655 Advanced Business Applications Programming
null, true, and false are also reserved.
Introduction to Java Programming
(and a review of switch statements)
(and a review of switch statements)
An overview of Java, Data types and variables
Building Java Programs
Building Java Programs
Building Java Programs
OBJECT ORIENTED PROGRAMMING I LECTURE 9 GEORGE KOUTSOGIANNAKIS
Tokenizers 25-Feb-19.
2. Second Step for Learning C++ Programming • Data Type • Char • Float
State Machines 6-Apr-196-Apr-19.
The Recursive Descent Algorithm
Comments Any string of symbols placed between the delimiters /* and */. Can span multiple lines Can’t be nested! Be careful. /* /* /* Hi */ is an example.
Tokenizers 3-May-19.
State Machines 8-May-19.
Just Enough Java 17-May-19.
State Machines 16-May-19.
Problem 1 Given n, calculate 2n
Building Java Programs
Presentation transcript:

Tokenizers 26-Apr-19

Tokens A tokenizer is a program that extracts tokens from an input stream A token has two parts: Its value Its kind, or type For example, if we tokenize "while (x >= 0)" we might get these tokens: "while", keyword "(", punctuation "x", name ">=", operator "0", integer ")", punctuation

Tokenizers as state machines Tokenizers can be implemented as state machines, but with these important differences: To succeed (recognize a token), the tokenizer does not have to reach the end of input; it only has to reach a final state When the tokenizer returns a token, the remainder of the input string is kept for use in getting the remaining tokens Tokenizers are almost always implemented as state machines We’ll do a quick tokenizer to recognize tokens in arithmetic expressions: Integers (digits only) Variables (letters and digits, starting with a letter) Operators, + - * / % Parentheses, ( ) Errors (anything not in the above list)

TokenType public enum TokenType { INTEGER, VARIABLE, OPERATOR, PARENTHESIS, ERROR; }

Token public class Token { private TokenType type; private String value; public Token(TokenType type, String value) { this.type = type; this.value = value; } public TokenType getType() { return type; } public String getValue() { return value; } }

Additions to the Token class For my JUnit testing, I needed to ask whether my Tokenizer was returning the correct Tokens public boolean equals(Object object) { Token that = (Token)object; return this.type == that.type && this.value.equals(that.value); } Since my tests were failing, I wanted to see what tokens I was actually getting public String toString() { return value + ":" + type; }

The constructor and hasNext() public class Tokenizer { private String input; private int position; public Tokenizer(String input) { this.input = input.trim() + " "; // to simplify getting last token position = -1; } public boolean hasNext() { return position < input.length() - 2; } public Token next() { ... } }

The shell of next() public class Tokenizer { private enum States { READY, IN_NUMBER, IN_VARIABLE, ERROR }; public Token next() { States state; String value = ""; if (!hasNext()) { throw new IllegalStateException("No more tokens!"); } state = States.READY; while ((++position) < input.length()) { char ch = input.charAt(position); switch (state) { case READY: { ... } case IN_VARIABLE: { ... } case IN_NUMBER: { ... } default: { ... } return new Token(TokenType.ERROR, value); } } assert false; // should never get here return null; } }

The READY state case READY: value = ch + ""; if (Character.isWhitespace(ch)) break; if ("()".contains(ch + "")) { return new Token(TokenType.PARENTHESIS, value); } if ("+-*/%".contains(ch + "")) { return new Token(TokenType.OPERATOR, value); } if (Character.isLetter(ch)) { state = States.IN_VARIABLE; break; } if (Character.isDigit(ch)) { state = States.IN_NUMBER; break; } return new Token(TokenType.ERROR, value);

The IN_NUMBER state case IN_NUMBER: if (Character.isDigit(ch)) { value += ch; break; } else { position--; // save char for next time return new Token(TokenType.INTEGER, value); }

The IN_VARIABLE state case IN_VARIABLE: if (Character.isLetter(ch) || Character.isDigit(ch)) { value += ch; break; } else { position--; // save char for next time return new Token(TokenType.VARIABLE, value); }

The default case default: return new Token(TokenType.ERROR, value);

java.util.StringTokenizer StringTokenizer is a trivial tokenizer provided by Sun Everything is either a “token” or a “delimiter” The most important methods are hasMoreTokens() and nextToken() There are three constructors: StringTokenizer(String str) Delimiters are whitespace characters; any sequence of non-whitespace characters is returned as a token StringTokenizer(String str, String delim) Same as above, except you get to specify which characters are delimiters StringTokenizer(String str, String delim, boolean returnDelims) Same as above, except you get to say you also want the delimiters returned as tokens

java.io.StreamTokenizer StreamTokenizer is a much more powerful (and much more complex) tokenizer It is basically capable of tokenizing C and Java programs, including integers, doubles, and comments There are a large number of possible settings, so that the tokenizer can be customized The constructor is StreamTokenizer(Reader r), where Reader is an abstract class for reading character streams The most important method is int nextToken(), where the returned int tells you what kind of token it found Once you know what kind of token has been found, you access fields of the tokenizer to get its value I’m not going to cover StreamTokenizer in my lectures All the details are in the Java API You may want to use StreamTokenizer in subsequent assignments

The End