Pattern Matching in Strings

Slides:



Advertisements
Similar presentations
Python: Regular Expressions
Advertisements

Asp.NET Core Vaidation Controls. Slide 2 ASP.NET Validation Controls (Introduction) The ASP.NET validation controls can be used to validate data on the.
PERL Part 3 1.Subroutines 2.Pattern matching and regular expressions.
Data types and variables
Chapter 2 Data Types, Declarations, and Displays
Scripting Languages Chapter 8 More About Regular Expressions.
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
Regular Expressions in ColdFusion Applications Dave Fauth DOMAIN technologies Knowledge Engineering : Systems Integration : Web.
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
Regular Expressions A regular expression defines a pattern of characters to be found in a string Regular expressions are made up of – Literal characters.
Last Updated March 2006 Slide 1 Regular Expressions.
Objectives You should be able to describe: Data Types
Regular Expressions Week 07 TCNJ Web 2 Jean Chu. Regular Expressions Regular Expressions are a powerful way to validate and format text strings that may.
Regular Expressions Dr. Ralph D. Westfall May, 2011.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Regular Expressions in.NET Ashraya R. Mathur CS NET Security.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
ASP.NET Programming with C# and SQL Server First Edition Chapter 5 Manipulating Strings with C#
Strings The Basics. Strings can refer to a string variable as one variable or as many different components (characters) string values are delimited by.
CIS 451: Regular Expressions Dr. Ralph D. Westfall January, 2009.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
BY Sandeep Kumar Gampa.. What is Regular Expression? Regex in.NET Regex Language Elements Examples Regular Expression API How to Test regex in.NET Conclusion.
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
Copyright © 2003 Pearson Education, Inc. Slide 6a-1 The Web Wizard’s Guide to PHP by David Lash.
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
1 Validating user input is the bane of every software developer’s existence. When you are developing cross-browser web applications (IE4+ and NS4+) this.
 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.
CGS – 4854 Summer 2012 Web Site Construction and Management Instructor: Francisco R. Ortega Chapter 5 Regular Expressions.
What are Regular Expressions?What are Regular Expressions?  Pattern to match text  Consists of two parts, atoms and operators  Atoms specifies what.
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
An Introduction to Programming with C++ Sixth Edition Chapter 5 The Selection Structure.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
IS 350 Course Introduction. Slide 2 Objectives Identify the steps performed in the software development life cycle Describe selected tools used to design.
CSC 108H: Introduction to Computer Programming Summer 2011 Marek Janicki.
Regular Expressions.
Regular Expressions, Backus-Naur Form and Reverse Polish Notation
Finding the needle(s) in the textual haystack
Regular Expressions 'RegEx'.
Chapter 6 JavaScript: Introduction to Scripting
Theory of Computation Lecture #
Looking for Patterns - Finding them with Regular Expressions
Lecture 19 Strings and Regular Expressions
CSC 594 Topics in AI – Natural Language Processing
The Selection Structure
Finding the needle(s) in the textual haystack
Week 14 - Friday CS221.
CSE 390a Lecture 7 Regular expressions, egrep, and sed
Finding the needle(s) in the textual haystack
CSC 594 Topics in AI – Natural Language Processing
Working with Forms and Regular Expressions
Advanced Find and Replace with Regular Expressions
CSE 390a Lecture 7 Regular expressions, egrep, and sed
Lecture 25: Regular Expressions
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
Lecture 23: Regular Expressions
Presentation transcript:

Pattern Matching in Strings Regular Expressions Pattern Matching in Strings September 17, 2018 Regular Expressions

More about strings In a previous lecture, we discussed the String class and its methods Powerful class with many string-handling methods that allow us to search strings, compare strings, modify string values, manage substrings, split, join, and so forth We saw examples of its functionality in C# code The String class lacks pattern-matching functionality September 17, 2018 Regular Expressions

Pattern-Matching in Data Validation Many everyday values in society must match a particular pattern in order to be valid. For example: Social Security Number: ddd-dd-dddd Telephone number: ddd-dddd or (ddd) ddd-dddd, etc. US Zip code: ddddd or ddddd-dddd The fact that a value does match a pattern does not guarantee it is valid If it fails to match the pattern, it cannot be valid 00000 matches the pattern for a Zip code, but no location uses that code 111-1111 matches the pattern for a phone number but that is not a US phone number assigned to any phone On the other hand, 23 and 234567890123450987 do not match the patterns and so could not be valid social security numbers or valid Zip codes September 17, 2018 Regular Expressions

Pattern Matching and Data Validation Data validation is important in most real world applications (Garbage-in, garbage-out) Some things (such as a person’s name) are impossible to fully validate unless we have a list of all possible valid values to compare against Many times it is not possible to fully validate an input value because we don’t have a list of all valid values to compare In those cases, if valid data is supposed to follow a particular pattern, we can at least assure that it could be valid because it matches the right pattern This allows us to rule out much invalid input data September 17, 2018 Regular Expressions

Introduction to Regular Expressions One could have a full course on regular expressions This is merely a very brief introduction to the use of regular expressions in C# Not intended to be in-depth coverage Many tools use regular expressions Most work with a common set of rules Some have their own extensions to the common rules that other tools may or may not support September 17, 2018 Regular Expressions

Regular Expressions Regular expressions provide a (somewhat) standard, formal way to specify patterns we want to match in unambiguous ways (we hope) There is little intuitive about the ways in which regular expressions are written There are tools that can help - more on this later September 17, 2018 Regular Expressions

Regular Expressions A regular expression is a string representing a specific pattern by use of an accepted notation Each character in a regular expression has a specific meaning. It can represent an occurrence of the character itself, or it can have a more “generic” meaning For example, a ‘d’ might be used to represent a digit However, the same character might represent a specific lower-case letter (namely, the 4th letter in the alphabet) as in “abcdefg” September 17, 2018 Regular Expressions

Regular Expressions Thus, regular expressions use an “escape” character (‘\’) to signify when the next character has its literal meaning and when it has a special meaning An Escape character is only needed when the next character is one that has both a literal meaning and an alternate meaning This is similar to the use of ‘\n’ to represent a newline character rather than the letter ‘n’ and the use of ‘\t’ to represent a tab character instead of the letter ‘t’ September 17, 2018 Regular Expressions

Some Escape Sequences 9/17/2018 Regular Expressions and Pattern Matching

Regular Expressions .[{()\*+?|^$ In regular expressions, all characters that are not escaped match themselves except for the following special characters: .[{()\*+?|^$ These characters have special meanings Note that some opening symbols such as { and [ are included in this list, but their closing counterparts are not in the list 9/17/2018 Regular Expressions and Pattern Matching

Wildcard in Regular Expressions The single character '.' when used outside of a character set (see definition of character set in a later slide) will match any single character It serves as a “wildcard” character “x.z” matches “xyz”, “x1z”, “xkz”, or “x.z” but not “abc”, “xabcz”, or “xz” 9/17/2018 Regular Expressions and Pattern Matching

Anchors in a Regular Expression A '^' character will match the start of a line It is used when the designated pattern must start at the beginning of a line “^All” – The letters “All” must come at the beginning of the line to match this pattern Examples (if at beginning of a line): “All good men . . .”, “Allen Likens”, “All-Star Game” It would not match “Ball”, “CALL”, “TAll”, or “That’s All” 9/17/2018 Regular Expressions and Pattern Matching

Anchors in Regular Expressions A '$' character will match the end of a line It is used when the designated pattern must match the characters at the end of a line “end$” – the letters “end” must be at the end of the line to match this regular expression Examples at the end of a line: “the end”, “around the bend”, “uranium comes from pitch blend” But not “Send some men to end the fire” September 17, 2018 Regular Expressions

Repetition in a Regular Expression Any atom can be repeated some designated number of times through the use of one or more of the *, +, ?, and { } operators The * operator will match the preceding atom zero or more times For example the expression a*b will match any of the following: b, ab, or aaaaaaaab but not “tom” or “ba” 9/17/2018 Regular Expressions and Pattern Matching

Repetition The ? operator will match the preceding atom zero or one time For example, the expression ca?b will match either of the following: cb cab but it will not match: caab 9/17/2018 Regular Expressions and Pattern Matching

Repetition in a Regular Expression The + operator will match the preceding atom one or more times For example the expression a+b will match either of the following items: ab aaaaaaaab but it will not match: b or ba 9/17/2018 Regular Expressions and Pattern Matching

Bounded Repetition An atom can also be repeated with a bounded repeat: a{n}  Matches 'a' repeated exactly n times a{n,}  Matches 'a' repeated n or more times a{n, m}  Matches 'a' repeated between n and m times, inclusively For example: ^a{2,4}$ will match any of these: aa aaa aaaa but it will match neither of the following: a aaaaaa 9/17/2018 Regular Expressions and Pattern Matching

Alternative values The | operator will match either of its arguments (the values on its left hand side or those on its right hand side), so for example: a|d will match either "a" or "d"  Parenthesis can be used to group alternations, for example: ab(d|e) will match either of "abd" or "abe" Empty alternatives are not allowed 9/17/2018 Regular Expressions and Pattern Matching

Character sets A character set is a bracketed expression starting with [ and ending with ], it defines a set of characters – those contained within the brackets, and it matches any single character that is a member of that set There are several ways the characters in the brackets can be specified; the following slides show those ways 9/17/2018 Regular Expressions and Pattern Matching

Character sets Single characters For example [abc], will match any of the single characters 'a', 'b', or 'c' The desired characters are listed together with no separating characters such as commas If commas or other separators are included, they become matching characters rather than separators [a, b, c] would match an ‘a’, a ‘b’, a ‘c’, a comma, or a space 9/17/2018 Regular Expressions and Pattern Matching

Character sets: ranges Character ranges A hyphen may be used to designate a range of characters For example [a-m] will match any single character in the range 'a' to 'm' in the Unicode character sequence This range includes both ‘a’ and ‘m’ as well as the lower case characters in the alphabet between the two September 17, 2018 Regular Expressions

Character Sets: negation If the bracketed expression begins with the ^ character, then it matches the complement of the characters it contains The complement is every character not in the specified set For example [^a-c] matches any single character that is not in the range a-c 9/17/2018 Regular Expressions and Pattern Matching

Word Boundaries – Escape Sequences 9/17/2018 Regular Expressions and Pattern Matching

In C#: escape character rules It is important to remember that C# has its own escape sequences such as ‘\n’ and ‘\t’ The backslash is used to designate an escape sequence in C# as well as in regular expressions used in C# programs To designate an escape sequence in C# that represents a regular expression escape sequence, the backslash must be repeated as in “\\w” Verbatim literals in C# allow us to avoid this issue Use @“\w” instead of “\\w” September 17, 2018 Regular Expressions

Example @"\b[0-1]?\d/[0-3]\d/(\d\d|19\d\d|200\d|201\d)\b” \b represents a word boundary [0-1]? represents zero or one occurrences of a 0 or a 1 \d represents a digit / represents a forward slash Summary so far: matches “04/”, “4/”, “12/”, …at the beginning of a word - but also matches “18/” and “00/” at the beginning of a word [0-3]\d/ matches a 0, 1, 2, or 3 followed by a digit followed by a forward slash (\d\d|19\d\d|200\d|201\d)\b matches two digits or 19dd or 200d or 201d on a word boundary Matches 04/18/92, 12/31/1982, 7/09/2007, and 11/11/2016 However, it also matches 18/36/2000 September 17, 2018 Regular Expressions

The REGEX Class Regular Expressions in .NET September 17, 2018

The Regex Class using System.Text.RegularExpressions; The Regex class provides support for using regular expressions to determine whether a string matches a specified pattern or whether it contains a substring that does Along with some supporting classes in the same namespace, it contains methods that let us Check for an exact match Search for any or all matches Perform search and replace operations Perform other such tasks September 17, 2018 Regular Expressions

The Regex and Match Classes Allows one to create a regular expression object from a string Constructor takes a string argument representing the pattern to be matched Example: Regex pattern = new Regex (@ “\d{5}”); The Match method in Regex returns a Match object that contains the first substring from a target string that matches the pattern Match match  =  pattern.Match (strTarget); The search is from left-to-right Note that there is a method named Match in the Regex class and a class named Match that is used to manage substrings that match the pattern September 17, 2018 Regular Expressions

RegexOptions When creating a regular expression, one may specify certain options public Regex (string pattern, RegexOptions options) For example, the following example creates a pattern that matches either an upper case letter or a lower case letter between “a” and “f” Regex pat = new Regex (@“[abcdef]”, RegexOptions.IgnoreCase); Would match “D”, “d”, “a”, “F”, but not “3”, “k”, or “&” September 17, 2018 Regular Expressions

Declare a Regular Expression representing a date Example Output Declare a Regular Expression representing a date Find First Match Match found? Display Match Get next Match – if any

The MatchCollection Class Find and return all matches using MatchCollection Returns all matches Number matches found Retrieve and display substring matching the pattern Loop through each match in Matches Output September 17, 2018 Regular Expressions

Regex.Replace method Find all matches and replace with a replacement string Output September 17, 2018 Regular Expressions

Regular Expression Tools on the Web There are several tools that can be of use in developing regular expressions and in attempting to verify their accuracy Some are free At the time this was written, some useful tools were the RegExLibrary and Expresso Both provide regular expressions for common situations and both give the ability to test regular expression candidates to see if they match the intended patterns These are free tools but many of the regular expressions have been provided by the public at large with no rigorous verification of their accuracy 9/17/2018 Regular Expressions and Pattern Matching

The Expresso Tool Interpretation of the regular expression Search when Run Match selected Regular expression to be tested Matches found in the sample text Sample text to be searched for matches 9/17/2018 Regular Expressions and Pattern Matching

Expresso Library Contents September 17, 2018 Regular Expressions