9-Sep-15 Regular Expressions. About “Regular” Expressions In a theory course you should have learned about regular expressions Regular expressions describe.

Slides:



Advertisements
Similar presentations
Python: Regular Expressions
Advertisements

Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
1 Regular Expressions & Automata Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.
13-Jun-15 Regular Expressions in Java. 2 Regular Expressions A regular expression is a kind of pattern that can be applied to text ( String s, in Java)
Regular Expressions in Java. Namespace in XML Transparency No. 2 Regular Expressions Regular expressions are an extremely useful tool for manipulating.
Regular Expressions in Java. Regular Expressions A regular expression is a kind of pattern that can be applied to text ( String s, in Java) A regular.
CS 330 Programming Languages 10 / 10 / 2006 Instructor: Michael Eckmann.
1 A Quick Introduction to Regular Expressions in Java.
More Regular Expressions. List/Scalar Context for m// Last week, we said that m// returns ‘true’ or ‘false’ in scalar context. (really, 1 or 0). In list.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Regular Expressions. What are regular expressions? A means of searching, matching, and replacing substrings within strings. Very powerful (Potentially)
Characters and Strings. Characters In Java, a char is a primitive type that can hold one single character A character can be: –A letter or digit –A punctuation.
CSci 142 Data and Expressions. 2  Topics  Strings  Primitive data types  Using variables and constants  Expressions and operator precedence  Data.
Regular Expressions & Automata Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
1 Overview Regular expressions Notation Patterns Java support.
Scripting Languages Chapter 8 More About Regular Expressions.
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
An Introduction to TokensRegex
Applications of Regular Expressions BY— NIKHIL KUMAR KATTE 1.
Lesson 3 – Regular Expressions Sandeepa Harshanganie Kannangara MBCS | B.Sc. (special) in MIT.
1 Form Validation. Validation  Validation of form data can be cumbersome using the basic techniques  StringTokenizer  If-else statements  Most of.
Last Updated March 2006 Slide 1 Regular Expressions.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
ADSA: RegExprs/ Advanced Data Structures and Algorithms Objective –look at programming with regular expressions (REs) in Java Semester 2,
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
5 BASIC CONCEPTS OF ANY PROGRAMMING LANGUAGE Let’s get started …
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Regular Expressions CSC207 – Software Design. Motivation Handling white space –A program ought to be able to treat any number of white space characters.
Regular Expression in Java 101 COMP204 Source: Sun tutorial, …
Regular Expressions.
Chapter 3: Formatted Input/Output Copyright © 2008 W. W. Norton & Company. All rights reserved. 1 Chapter 3 Formatted Input/Output.
Regular Expressions – An Overview Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in.
Java The Java programming language was created by Sun Microsystems, Inc. It was introduced in 1995 and it's popularity has grown quickly since A programming.
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
Regular Expressions Regular Expressions. Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Module 6 – Generics Module 7 – Regular Expressions.
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
© 2004 Pearson Addison-Wesley. All rights reserved ComS 207: Programming I Instructor: Alexander Stoytchev
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
Programming Fundamentals. Overview of Previous Lecture Phases of C++ Environment Program statement Vs Preprocessor directive Whitespaces Comments.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
17-Mar-16 Characters and Strings. 2 Characters In Java, a char is a primitive type that can hold one single character A character can be: A letter or.
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
Chapter 3: Formatted Input/Output 1 Chapter 3 Formatted Input/Output.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
1 Problem Solving  The purpose of writing a program is to solve a problem  The general steps in problem solving are: Understand the problem Dissect the.
CSC201: Computer Programming
Lecture 19 Strings and Regular Expressions
CSC 594 Topics in AI – Natural Language Processing
Java Programming Course Regular Expression
Week 14 - Friday CS221.
CSC 594 Topics in AI – Natural Language Processing
Regular Expressions in Java
Theory of Computation Languages.
Regular Expressions in Java
Regular Expressions in Java
Regular Expression in Java 101
Regular Expressions in Java
Presentation transcript:

9-Sep-15 Regular Expressions

About “Regular” Expressions In a theory course you should have learned about regular expressions Regular expressions describe regular languages Regular expressions are equivalent to finite state machines Etc. Regular expressions in a programming language are based on regular expressions, but have super powers For example, you can use them to recognize αα, where α is any string Regular expressions are built into nearly every popular programming language, including Python, Java, and Scala The syntax of regular expressions is extremely consistent across languages Every programmer should know how to use regular expressions! 2

3 Regular Expressions A regular expression is a kind of pattern that can be applied to text ( String s, in Java) A regular expression either matches the text (or part of the text), or it fails to match If a regular expression matches a part of the text, then you can easily find out which part If a regular expression is complex, then you can easily find out which parts of the regular expression match which parts of the text With this information, you can readily extract parts of the text, or do substitutions in the text Regular expressions are extremely useful for manipulating text Regular expressions are used in the automatic generation of Web pages

4 Perl, Python, Java, Scala The Perl programming language is heavily used in server-side programming, because Much server-side programming is text manipulation Regular expressions are built into the syntax of Perl Java has a regular expression package, java.util.regex Java’s regular expressions are almost identical to those of Perl Scala’s regular expressions are even more similar to Perl Regular expressions in Java are just a normal package, with no new syntax to support them Java’s regular expressions are just as powerful as Perl’s, but Regular expressions are easier and more convenient in Perl In Python and Scala, regular expressions are intermediate in convenience

Matching and searching in Python >>> import re >>> match_object = re.match("abc", "abcdef") >>> if match_object: print match_object.group() else: print("No match") abc >>> match_object = re.search("def", "abcdef") >>> if match_object: print match_object.group() else: print("No match") def 5

Matching and searching in Scala scala> import scala.util.matching.Regex import scala.util.matching.Regex scala> val matchObject1 = "abc".r findPrefixMatchOf("abcdef") matchObject1: Option[scala.util.matching.Regex.Match] = Some(abc) scala> val matchObject2 = "def".r findPrefixMatchOf("abcdef") matchObject2: Option[scala.util.matching.Regex.Match] = None scala> val matchObject3 = "def".r findFirstIn("abcdef") matchObject3: Option[String] = Some(def) 6

Matching and searching in Java Pattern p = Pattern.compile("abc"); Matcher m = p.matcher("abcdef"); if (m.matches()) System.out.println("matches " + m.group(0)); m.reset(); if (m.lookingAt()) System.out.println("lookingAt " + m.group(0)); m.reset(); if (m.find()) System.out.println("find " + m.group(0)); lookingAt abc find abc 7

Raw strings Strings can contain various “escape characters,” such as \n for newline or \t for tab A raw string is one in which these escapes are not processed Python has raw strings (using an r prefix) >>> print "123\n456" >>> print r"123\n456" 123\n456 Scala has raw strings (using a raw prefix or triple quotes) scala> print("123\n456") scala> print(raw"123\n456") 123\n456 scala> print("""123\n456""") 123\n456 Java doesn’t have raw strings 8

9 Double backslashes in Java Backslashes have a special meaning in regular expressions; for example, \b means a word boundary The Java compiler treats backslashes specially; for example, \b in a String or as a char means the backspace character Java syntax rules apply first! If you write "\b[a-z]+\b" you get a string with backspace characters in it--this is not what you want! Remember, you can quote a backslash with another backslash, so "\\b[a-z]+\\b" gives the correct string This also works in Scala for non-raw strings Note: if you read in a String from somewhere, you are not compiling it, so you get whatever characters are actually there

Matching dates in Python \d will match any digit + means “one or more” Parentheses are used to group parts of the pattern >>> date_pattern = r"(\d+)/(\d+)/(\d+)" >>> parts = re.match(date_pattern, "11/25/2013") >>> for i in range(0, 4): print "group(" + str(i) + ") = " + parts.group(i) group(0) = 11/25/2013 group(1) = 11 group(2) = 25 group(3) =

Matching dates in Scala scala> val pattern = new Regex(raw"(\d+)/(\d+)/(\d+)") pattern: scala.util.matching.Regex = (\d+)/(\d+)/(\d+) scala> val pattern(month, day, year) = "11/25/2013" month: String = 11 day: String = 25 year: String = 2013 scala> "11/25/2013" match { | case pattern(month, day, year) => | println(s"Day $day of month $month of year $year") | case _ => "Illegal date format" | } Day 25 of month 11 of year

Matching dates in Java Pattern p = Pattern.compile("(\\d+)/(\\d+)/(\\d+)"); Matcher m = p.matcher("11/25/2013"); if (m.matches()) { for (int i = 0; i <= 3; i++) { System.out.println("group(" + i + ") = " + m.group(i)); } } group(0) = 11/25/2013 group(1) = 11 group(2) = 25 group(3) =

Things to notice The form of the regular expression is identical for all these languages Java doesn’t have raw strings, so backslashes must be doubled, but the resultant string is the same Each language may add a few unimportant features to regular expressions, but the core features are exactly the same Methods differ across languages, both in their names and what they do Scala can use regular expressions in pattern matching Despite differences in methods, regular expressions can always be used to: Match an entire string Match the prefix of a string Find one or all occurrences within a string Record what was matched by each group Perform substitutions (not shown in previous examples) 13

14 Ways to use a regular expression The regular expression "[a-z]+" will match a sequence of one or more lowercase letters [a-z] means any character from a through z, inclusive + means “one or more” Suppose we apply this pattern to the String "Now is the time" Here are some of the ways we can apply this pattern: To the entire string: it fails to match because the string contains characters other than lowercase letters To the beginning of the string: it fails to match because the string does not begin with a lowercase letter To search the string: it will succeed and match ow If the pattern is applied a second time, it will find is Further applications will find is, then the, then time After time, another application will fail

15 A complete Java program import java.util.regex.*; public class RegexTest { public static void main(String args[]) { String pattern = "[a-z]+"; String text = "Now is the time"; Pattern p = Pattern.compile(pattern); Matcher m = p.matcher(text); while (m.find()) { System.out.print(text.substring ( m.start(), m.end() ) + "*"); } } } Output: ow*is*the*time*

A complete Scala program scala> import scala.util.matching.Regex import scala.util.matching.Regex scala> object RegexTest { | def main(args: Array[String]) { | val pattern = "[a-z]+".r | val text = "Now is the time" | for (word RegexTest.main(null) ow*is*the*time* 16

17 Some simple patterns abc exactly this sequence of three letters [abc] any one of the letters a, b, or c [^abc] any character except one of the letters a, b, or c (immediately within an open bracket, ^ means “not,” but anywhere else it just means the character ^ ) [a-z] any one character from a through z, inclusive [a-zA-Z0-9] any one letter or digit

18 Sequences and alternatives If one pattern is followed by another, the two patterns must match consecutively For example, [A-Za-z]+[0-9] will match one or more letters immediately followed by one digit The vertical bar, |, is used to separate alternatives For example, the pattern abc|xyz will match either abc or xyz

19 Some predefined character classes. any one character except a line terminator \d a digit: [0-9] \D a non-digit: [^0-9] \s a whitespace character: [ \t\n\x0B\f\r] \S a non-whitespace character: [^\s] \w a word character: [a-zA-Z_0-9] \W a non-word character: [^\w] Notice the space. Spaces are significant in regular expressions!

20 Boundary matchers These patterns match the empty string if at the specified position: ^ the beginning of a line $ the end of a line \b a word boundary \B not a word boundary \A the beginning of the input (can be multiple lines) \Z the end of the input except for the final terminator, if any \z the end of the input \G the end of the previous match

21 Greedy quantifiers (The term “greedy” will be explained later) Assume X represents some pattern X? optional, X occurs once or not at all X*X occurs zero or more times X+X occurs one or more times X{n}X occurs exactly n times X{n,}X occurs n or more times X{n,m}X occurs at least n but not more than m times Note that these are all postfix operators, that is, they come after the operand

22 Types of quantifiers A greedy quantifier will match as much as it can, and back off if it needs to We’ll do examples in a moment A reluctant quantifier will match as little as possible, then take more if it needs to You make a quantifier reluctant by appending a ? : X?? X*? X+? X{n}? X{n,}? X{n,m}? A possessive quantifier will match as much as it can, and never let go You make a quantifier possessive by appending a + : X?+ X*+ X++ X{n}+ X{n,}+ X{n,m}+

23 Quantifier examples Suppose your text is aardvark Using the pattern a*ardvark ( a* is greedy): The a* will first match aa, but then ardvark won’t match The a* then “backs off” and matches only a single a, allowing the rest of the pattern ( ardvark ) to succeed Using the pattern a*?ardvark ( a*? is reluctant): The a*? will first match zero characters (the null string), but then ardvark won’t match The a*? then extends and matches the first a, allowing the rest of the pattern ( ardvark ) to succeed Using the pattern a*+ardvark ( a*+ is possessive): The a*+ will match the aa, and will not back off, so ardvark never matches and the pattern match fails

24 Capturing groups In regular expressions, parentheses are used for grouping, but they also capture (keep for later use) anything matched by that part of the pattern Example: ([a-zA-Z]*)([0-9]*) matches any number of letters followed by any number of digits If the match succeeds, \1 holds the matched letters and \2 holds the matched digits In addition, \0 holds everything matched by the entire pattern Capturing groups are numbered by counting their opening parentheses from left to right: ( ( A ) ( B ( C ) ) ) \0 = \1 = ((A)(B(C))), \2 = (A), \3 = (B(C)), \4 = (C) Example: ([a-zA-Z])\1 will match a double letter, such as letter

25 Capturing groups in Java If m is a matcher that has just performed a successful match, then m.group( n ) returns the String matched by capturing group n This could be an empty string This will be null if the pattern as a whole matched but this particular group didn’t match anything m.group() returns the String matched by the entire pattern (same as m.group(0) ) This could be an empty string If m didn’t match (or wasn’t tried), then these methods will throw an IllegalStateException

Capturing groups in Scala scala> val pattern = raw"(\d+)/(\d+)/(\d+)".r pattern: scala.util.matching.Regex = (\d+)/(\d+)/(\d+) scala> val m = pattern findAllIn "11/25/2013" m: scala.util.matching.Regex.MatchIterator = non-empty iterator scala> for(i <- 0 to 3) print(m.group(i) + "-") 11/25/ scala> val pattern(month, day, year) = "11/25/2013" month: String = 11 day: String = 25 year: String =

27 Pig Latin Pig Latin is a spoken “secret code” that many English- speaking children learn There are some minor variations (regional dialects?) The rules for (written) Pig Latin are: If a word begins with a consonant cluster, move it to the end and add “ay” If a word begins with a vowel, add “hay” to the end Example: regular expressions are fun!  egularray expressionshay arehay unfay!

28 Example use of capturing groups Suppose word holds a word in English Also suppose we want to move all the consonants at the beginning of word (if any) to the end of the word (so string becomes ingstr ) Pattern p = Pattern.compile("([^aeiou]*)(.*)"); Matcher m = p.matcher(word); if (m.matches()) { System.out.println(m.group(2) + m.group(1)); } Note the use of (.*) to indicate “all the rest of the characters”

29 Pig Latin translator Pattern wordPlusStuff = Pattern.compile("([a-zA-Z]+)([^a-zA-Z]*)"); Pattern consonantsPlusRest = Pattern.compile("([^aeiouAEIOU]+)([a-zA-Z]*)"); public String translate(String text) { Matcher m = wordPlusStuff.matcher(text); String translatedText = ""; while (m.find()) { translatedText += translateWord(m.group(1)) + m.group(2); } return translatedText; } private String translateWord(String word) { Matcher m = consonantsPlusRest.matcher(word); if (m.matches()) { return m.group(2) + m.group(1) + "ay"; } else return word + "hay"; }

Pig Latin translator in Scala scala> def translateWord(consonants: String, rest: String) = | if (consonants == "") rest + "hay" else | rest + consonants + "ay" translateWord: (consonants: String, rest: String)String scala> def translate(sentence: String): String = { | val wordPattern = raw"\b([^aeiouAEIOU ]*)([a-zA-Z]+\b)".r | wordPattern.replaceAllIn(sentence, | m => translateWord(m.group(1), m.group(2))) | } translate: (sentence: String)String scala> translate("regular expressions are fun!") res14: String = egularray expressionshay arehay unfay! 30

31 Additions to Java’s String class All of the following are public: public boolean matches(String regex ) public String replaceFirst(String regex, String replacement ) public String replaceAll(String regex, String replacement ) public String[] split(String regex ) public String[] split(String regex, int limit ) If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible Everything in the Java API is available to Scala, so of course the above all work in Scala

32 Escaping metacharacters A lot of special characters--parentheses, brackets, braces, stars, plus signs, etc.--are used in defining regular expressions; these are called metacharacters Suppose you want to search for the character sequence a* (an a followed by a star) "a*" ; doesn’t work; that means “zero or more a s” "a\*" ; doesn’t work; since a star doesn’t need to be escaped (in Java String constants), Java just ignores the \ "a\\*" does work; it’s the three-character string a, \, * Just to make things even more difficult, it’s illegal to escape a non-metacharacter in a regular expression Hence, you can’t backslash special characters “just in case”

33 Spaces There is only one thing to be said about spaces (blanks) in regular expressions, but it’s important: Spaces are significant! A space stands for a space--when you put a space in a pattern, that means to match a space in the text string It’s a really bad idea to put spaces in a regular expression just to make it look better Sometimes regular expression packages provide a setting to allow blanks in regular expressions to be ignored This makes them easier to read, but I’m not convinced it’s a good idea

34 Regular expressions are a language Regular expressions are not easy to use at first It’s a bunch of punctuation, not words The individual pieces are not hard, but it takes practice to learn to put them together correctly Regular expressions form a miniature programming language It’s a different kind of programming language, and requires learning new thought patterns In Java you can’t just use a regular expression; you have to first create Patterns and Matchers Java’s syntax for String constants doesn’t help, either Despite all this, regular expressions bring so much power and convenience to String manipulation that they are well worth the effort of learning

35 Thinking in regular expressions The fundamental concept in regular expressions is automatic backtracking You match the parts of a pattern left to right Some pattern parts, such as x (the letter “x”),. (any one character), and ^ (the beginning of the string) are deterministic: they either match or don’t match; there are no other alternatives to try Other pattern parts are nondeterministic: they have alternatives, such as x* (zero or more letter “x”s), x+ (one or more letter “x”s), [aeiou] (any vowel), and yes|no (either “yes” or “no”) If some part fails to match, you backtrack to the most recent nondeterministic part and look for a different match for that part

36 Backtracking examples Search cases for a [aeiou]s$, that is, a vowel followed by an “s” at the end of the string [aeiou] doesn’t match c [aeiou] matches a, s matches s, $ fails There is no other possible match for s in this position [aeiou] doesn’t match s [aeiou] matches a, s matches s, $ succeeds Search Java for J.*.+a J matches J, the.* matches ava, the.+ fails Backtrack to.* : The.* matches av, the.+ matches a, the a fails Backtrack to.* : The.* matches a, the.+ matches va, the a fails Backtrack to.+ : The.+ matches v, the a succeeds

Backreferences Within a regular expression, \1 matches the first matched group, \2 matches the second matched group, etc. scala> raw"([a-z])\1".r findFirstIn "It's a long way to Tipperary" res15: Option[String] = Some(pp) scala> raw"([a-z]).*\1".r findFirstIn "It's a long way to Tipperary" res16: Option[String] = Some(t's a long way t) 37

Recognizing αα Here’s a simple regular expression to test whether the second half of a string is identical to the first half: ^(.*)\1$ 38

39 Hazards of regular expressions Regular expressions are complex They are often used when you cannot guarantee “good” input, so you have to make them fail-safe Backtracking can be extremely expensive Avoid.* and other highly nondeterministic patterns Test with non-trivial data to make sure your patterns scale Test thoroughly! Break a complex regular expression into its components, and test each separately Every pattern is a program, and needs to be treated with respect Pay special attention to edge cases Consider alternatives Regular expressions are powerful, but... If you can get the job done with a few simple String methods, you probably are better off doing it that way

Regular expressions in Sublime Text 3 40 Sublime Text 3 shows you what is being matched as you type in the regular expression

41 The End A little learning is a dangerous thing; Drink deep, or taste not the Pierian spring: There shallow draughts intoxicate the brain, And drinking largely sobers us again. --Alexander Pope