Lecture 19 Strings and Regular Expressions D&D 14 Date
Goals By the end of this lesson, you should: Be able to compare strings and extract substrings Be able to use regular expressions to check whether a string contains certain patterns Be able to use regular expression-based pattern replacement
Substring extraction Substring extraction String comparison Search Regular expressions Pattern and Match Summary import java.util.Scanner; import java.io.PrintStream; public class TestRoomDemo { public static void main(String[] args) { PrintStream p = System.out; // Lazy programmers... Scanner s = new Scanner(System.in); p.println("Please enter your first name: "); String firstName = s.next(); p.println("Please enter your last name: "); String lastName = s.next(); String upi = firstName.substring(0, 1) + lastName.substring(0, 3) + "123"; upi = upi.toLowerCase(); p.println("Your UPI could be: " + upi); … The second parameter to substring() is the index of the character after the last character we want.
String comparison Substring extraction String comparison Search Regular expressions Pattern and Match Summary if (upi.compareTo("whsu014") == 0) { p.println("You're William and meant to supervise MLT3"); } else if (upi.compareTo("csee015") <= 0) { p.println("You should have been in 206-220."); … else if (upi.compareTo("vwon320") >= 0) { p.println("You should have been in 423-348."); else { p.println("Were you really meant to sit the test?"); The compareTo() method returns 0 if its parameter matches the string that the method is called on, a negative value if the parameter precedes the string alphabetically, and a positive value if the parameter comes after the string in the alphabet.
Searching in strings Substring extraction String comparison Search Regular expressions Pattern and Match Summary … else if (upi.compareTo("jpet145") <= 0) { p.print("You should have been in "); String theatre = "HSB370/201N-370"; int slashPos = theatre.indexOf("/"); p.println(theatre.substring(slashPos + 1)); } The indexOf() method finds the first occurrence of its argument in the string that we invoke the method on. If we need a subsequent occurrence, we can add a second (int) parameter that gives the index at which to start the search. Note that we can also use the substring() method with only one parameter. In this case, we get the substring from the given position to the end of the string.
Regular expressions Substring extraction String comparison Search Regular expressions Pattern and Match Summary public static void main(String[] args) { PrintStream p = System.out; // Lazy programmers... Scanner s = new Scanner(System.in); p.println("Please enter a UPI to check: "); String upiCandidate = s.next(); p.println("You entered \"" + upiCandidate + "\"."); if (upiCandidate.matches("^[a-z]{3,4}\\d{3}$")) { p.println("Looks like a UPI to me!"); } else { p.println("Sorry, that's not a UPI."); The matches() method returns true if the regular expression that is passed as the parameter matches the string that the method is called on as a whole, and false if it doesn’t. Regular expressions (regexes) are a very powerful way of checking the format of strings, or finding whether a string contains a particular type of substring.
Regular expressions Substring extraction String comparison Search Regular expressions Pattern and Match Summary Regular expressions in Java are just packaged in strings. This means that we also need to escape backslashes (\): … if (upiCandidate.matches("^[a-z]{3,4}\\d{3}$")) { So the actual regular expression here is: Translated into English, this means: ^: the pattern to match must start at the beginning of the string (note: ignored by matches() as it is the default!) [a-z]: a lowercase character from a to z {3,4}: the previous character pattern (or subpattern in parentheses) occurring 3 to 4 times \d: a digit from 0 to 9 {3}: the previous character pattern (or subpattern in parentheses) occurring 3 times $: the pattern to match must end at the end of the string (note: ignored by matches() as it is the default!) ^[a-z]{3,4}\d{3}$
Useful regex examples A NZ car number plate: Substring extraction String comparison Search Regular expressions Pattern and Match Summary A NZ car number plate: Note this also matches plates such as ABC1234 – which it really shouldn’t. Better: The parentheses form subpatterns, and the | means OR: Start of the string followed by a parenthesised pattern followed by the end of the string. The parenthesised pattern consists of one of two alternative subpatterns, also parenthesized: Two uppercase characters followed by 1-4 digits OR Three uppercase characters followed by 1-3 digits ^[A-Z]{2,3}\d{1,4}$ ^(([A-Z]{2}\d{1,4})|([A-Z]{3}\d{1,3}))$
Useful regex examples A COMPSCI course number: Substring extraction String comparison Search Regular expressions Pattern and Match Summary A COMPSCI course number: An IPv4 network address (four integer numbers between 0 and 255 separated by dots): This also matches non-sensical addresses such as 999.000.0.555 though. Better (all in one line): A \. means “match a dot” We need a backslash escape here because a dot on its own means “match any character” ^COMPSCI\d{3}$ ^(\d{1,3}\.){3}\d{1,3}$ ^((\d|([1-9]\d)|(1\d{2})|(2[0-4]\d)|(25[0-5]))\.){3} (\d|([1-9]\d)|(1\d{2})|(2[0-4]\d)|(25[0-5]))$
Regex multipliers Substring extraction String comparison Search Regular expressions Pattern and Match Summary A “?” means that the previous character / parenthesized pattern may occur 0 or 1 time. Match “user” or “users”: A “+” means “one or more”. Match a non-empty string: A “*” means “any number of times”, and a “[^x] ” means “anything but x”. Match a pair of curly braces in a string: Backslash escapes are generally required for characters that are part of regular expression syntax, but may often be omitted when the syntax element would make no sense in the position otherwise. E.g., here the syntax has no opening curly brace, so we need not worry about escaping closing braces: ^users?$ .+ \{[^\}]*\} \{[^}]*}
Regex macros We have already met “\d”, which represents “a digit”. Substring extraction String comparison Search Regular expressions Pattern and Match Summary We have already met “\d”, which represents “a digit”. There are more: \n, \t, \r are the usual backslash escapes for newline, tab and carriage return. \w is a “word character”: any letter, digit or underscore (anything you might find in a Java variable name!) \s is any whitespace character \D means “any character that is not a digit” \W means “any character that is not a word character” \S means “anything that isn’t whitespace of sorts”
Regex character classes Substring extraction String comparison Search Regular expressions Pattern and Match Summary We have already met these in the UPI and IP address examples. Further examples: [a-zA-Z]: any upper of lowercase alphabet character [a-c]: any lowercase character a to c [^xyz]: anything but x, y, or z [bcdfjlqsvxyz]: any character listed (you can use this class to see whether a word could be from te reo māori, where these letters don’t occur) [-abc]: a, b, c or a hyphen
More on regexes Substring extraction String comparison Search Regular expressions Pattern and Match Summary This was just a rough introduction. More under: https://docs.oracle.com/javase/tutorial/essential/regex/index.html Test your own regexes on your own string example with the RegexChecker lecture example:
Regex replaceAll() Substring extraction String comparison Search Regular expressions Pattern and Match Summary String input = "We have 4,999 apples at only $4.99 a kg. …"; String newPrice = "5.49"; input = input.replaceAll("4\\.99", newPrice); System.out.println(input); The replaceAll() method returns a string in which all patterns matching the regular expression in the first parameter are replaced by the string in the second parameter. See also replaceFirst().
Pattern and Matcher Substring extraction String comparison Search Regular expressions Pattern and Match Summary A regular expression specified as a string need to be compiled into a Pattern object before they can be used. Together with the string that is to be matched, the expression is then used to generate a Matcher object that takes care of the actual matching. In the case of the matches()and replaceAll() methods etc., the methods perform these two steps internally for us. If we need to match multiple times with the same expression or string, this is inefficient. In these cases, it is better to pre-compile the expression into a Pattern object and re-use it. A Pattern object also allows more flexible matching with various flags that let us modify the matching behaviour.
What do we know Substring extraction String comparison Search Regular expressions Pattern and Match Summary We can extract substrings with substring(), find substrings with indexOf(), and compare strings alphabetically with compareTo(). Regular expressions are a powerful way to search for and manipulate complex patterns in strings. Regular expression syntax means that syntax characters must be backslash-escaped if they are meant to represent their literal character. In a Java string, the backslash from the escape needs a second backslash! If we want to use a regular expression many times, we should use a Pattern object.
Resources & Homework Substring extraction String comparison Search Regular expressions Pattern and Match Summary D&D Chapter 14 https://docs.oracle.com/javase/tutorial/essential/regex/ https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html Homework: Write a Java program that takes a String input from the console and checks whether it is a UPI (3 to 4 letters followed by 3 digits), an AUID (student ID number, 7 or 9 digits), or a name, or an e-mail address. Names for this purpose can contain any letters from the English alphabet, apostrophes or hyphens between letters, and spaces between parts of the name.
Next Lecture File I/O (Chapter 15)