Operators Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Slides:



Advertisements
Similar presentations
CSCI 6962: Server-side Design and Programming Input Validation and Error Handling.
Advertisements

Slicing Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Control Flow Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
IS 1181 IS 118 Introduction to Development Tools Chapter 4 String Manipulation and Regular Expressions.
Document Type Definitions. XML and DTDs A DTD (Document Type Definition) describes the structure of one or more XML documents. Specifically, a DTD describes:
UNIX Filters.
Haskell Chapter 3, Part I. Pattern Matching  Pattern matching with tuples  Pattern matching with list comprehensions  As-patterns.
Regex Wildcards on steroids. Regular Expressions You’ve likely used the wildcard in windows search or coding (*), regular expressions take this to the.
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
More on Regular Expressions Regular Expressions More character classes \s matches any whitespace character (space, tab, newline etc) \w matches.
Regular Expression A regular expression is a template that either matches or doesn’t match a given string.
REGULAR EXPRESSIONS CHAPTER 14. REGULAR EXPRESSIONS A coded pattern used to search for matching patterns in text strings Commonly used for data validation.
Lesson 3 – Regular Expressions Sandeepa Harshanganie Kannangara MBCS | B.Sc. (special) in MIT.
Mechanics Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Chapter 06: Lecture Notes (CSIT 104) 1 Copyright © 2008 Pearson Prentice Hall. All rights reserved. 1 1 Copyright © 2008 Prentice-Hall. All rights reserved.
Last Updated March 2006 Slide 1 Regular Expressions.
1 Lab Session-III CSIT-120 Fall 2000 Revising Previous session Data input and output While loop Exercise Limits and Bounds Session III-B (starts on slide.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
Chapter 6 Functions -- QuickStart. "The Practice of Computing Using Python", Punch & Enbody, Copyright © 2013 Pearson Education, Inc. What is a function?
Finding the needle(s) in the textual haystack
CS 403: Programming Languages Fall 2004 Department of Computer Science University of Alabama Joel Jones.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Regular Expression (continue) and Cookies. Quick Review What letter values would be included for the following variable, which will be used for validation.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
5 5 Data types Logical/Boolean Has only two alternatives: Yes or no, on or off true or false Text/Alphanumeric Refers to all letters and numbers and other.
Copyright © 2008 Pearson Prentice Hall. All rights reserved Chapter 6 Data Tables and Amortization Tables Exploring Microsoft Office Excel 2007.
C STRUCTURES. A FIRST C PROGRAM  #include  void main ( void )  { float height, width, area, wood_length ;  scanf ( "%f", &height ) ;  scanf ( "%f",
Regular Expressions CSC207 – Software Design. Motivation Handling white space –A program ought to be able to treat any number of white space characters.
Regular Expressions.
Systems Life Cycle. Know the elements of the system that are created Understand the need for thorough testing Be able to describe the different tests.
WHAT IS A DATABASE? A DATABASE IS A COLLECTION OF DATA RELATED TO A PARTICULAR TOPIC OR PURPOSE OR TO PUT IT SIMPLY A GENERAL PURPOSE CONTAINER FOR STORING.
Chapter 12: gawk Yes it sounds funny. In this chapter … Intro Patterns Actions Control Structures Putting it all together.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
Introduction Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Searching and Regular Expressions. Proteins 20 amino acids Interesting structures beta barrel, greek key motif, EF hand... Bind, move, catalyze, recognize,
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,
Regular Expressions This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this.
More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Storage Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Working With Objects Tonga Institute of Higher Education.
More Patterns Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
1 DIG 3563: Lecture 2a: Regular Expressions Michael Moshell University of Central Florida Information Management.
Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Unit 10 – JavaScript Validation Instructor: Brent Presley.
Validation using Regular Expressions. Regular Expression Instead of asking if user input has some particular value, sometimes you want to know if it follows.
Finding Things Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
COMPE 111 Introduction to Computer Engineering Programming in Python Atılım University
Processing Text Excel can not only be used to process numbers, but also text. This often involves taking apart (parsing) or putting together text values.
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
Microsoft Access Prepared by the Academic Faculty Members of IT.
Functions CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
OVERVIEW OF CLIENT-SIDE SCRIPTING
Gollis University Faculty of Computer Engineering Chapter Five: Retrieval, Functions Instructor: Mukhtar M Ali “Hakaale” BCS.
Formatting in VB Please see speaker notes for additional information!
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
String and Lists Dr. José M. Reyes Álamo. 2 Outline What is a string String operations Traversing strings String slices What is a list Traversing a list.
Awk 2 – more awk. AWK INVOCATION AND OPERATION the "-F" option allows changing Awk's "field separator" character. Awk regards each line of input data.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
Regular Expressions Upsorn Praphamontripong CS 1110
Modified from Sharon Guffy
In Class Program: Today in History
CS 1111 Introduction to Programming Fall 2018
REGEX.
Presentation transcript:

Operators Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See for more information. Regular Expressions

Operators Notebook #1 Site Date Evil (millivaders) Baker Baker Baker Baker Baker Baker ⋮ ⋮ ⋮

Regular ExpressionsOperators Notebook #1 single tab as separator Site Date Evil (millivaders) Baker Baker Baker Baker Baker Baker ⋮ ⋮ ⋮

Regular ExpressionsOperators Notebook #1 spaces in site names Site Date Evil (millivaders) Baker Baker Baker Baker Baker Baker ⋮ ⋮ ⋮

Regular ExpressionsOperators Notebook #1 dates in international standard format (YYYY-MM-DD) Site Date Evil (millivaders) Baker Baker Baker Baker Baker Baker ⋮ ⋮ ⋮

Regular ExpressionsOperators Notebook #2 Site/Date/Evil Davison/May 22, 2010/ Davison/May 23, 2010/ Pertwee/May 24, 2010/ Davison/June 19, 2010/ Davison/July 6, 2010/ Pertwee/Aug 4, 2010/ Pertwee/Sept 3, 2010/ ⋮ ⋮ ⋮

Regular ExpressionsOperators Notebook #2 slashes as separators Site/Date/Evil Davison/May 22, 2010/ Davison/May 23, 2010/ Pertwee/May 24, 2010/ Davison/June 19, 2010/ Davison/July 6, 2010/ Pertwee/Aug 4, 2010/ Pertwee/Sept 3, 2010/ ⋮ ⋮ ⋮

Regular ExpressionsOperators Notebook #2 Site/Date/Evil Davison/May 22, 2010/ Davison/May 23, 2010/ Pertwee/May 24, 2010/ Davison/June 19, 2010/ Davison/July 6, 2010/ Pertwee/Aug 4, 2010/ Pertwee/Sept 3, 2010/ ⋮ ⋮ ⋮ month names and day numbers of varying length

Regular ExpressionsOperators Regular expressions are patterns that match text. 1. Letters and digits match themselves. 2. '|' means OR. 3. '.' matches any single character. 4. Use '()' to enforce grouping. 5. re.search returns a match object or None. 6. match.group(k) is the text that matched group k.

Regular ExpressionsOperators # get fields from Notebook #2 with simple string methods record = 'Davison/May 22, 2010/1721.3' site, date, reading = record.split('/') month, day, year = date.split(' ') if day[-1] == ',': day = day[:-1] print year, month, day 2010 May 22

Regular ExpressionsOperators # get fields from Notebook #2 with simple string methods record = 'Davison/May 22, 2010/1721.3' site, date, reading = record.split('/') month, day, year = date.split(' ') if day[-1] == ',': day = day[:-1] print year, month, day 2010 May 22 This is procedural (we tell the computer how )

Regular ExpressionsOperators # get fields from Notebook #2 with simple string methods record = 'Davison/May 22, 2010/1721.3' site, date, reading = record.split('/') month, day, year = date.split(' ') if day[-1] == ',': day = day[:-1] print year, month, day 2010 May 22 This is procedural (we tell the computer how ) Regular expressions are declarative (we tell the computer what, and it figures out how)

Regular ExpressionsOperators # use '*' to break whole record into pieces match = re.search('(.*)/(.*)/(.*)', 'Davison/May 22, 2010/1721.3') print match.group(1) print match.group(2) print match.group(3) Davison May 22,

Regular ExpressionsOperators # use '*' to break whole record into pieces match = re.search('(.*)/(.*)/(.*)', 'Davison/May 22, 2010/1721.3') print match.group(1) print match.group(2) print match.group(3) Davison May 22, '*' means "zero or more"

Regular ExpressionsOperators # use '*' to break whole record into pieces match = re.search('(.*)/(.*)/(.*)', 'Davison/May 22, 2010/1721.3') print match.group(1) print match.group(2) print match.group(3) Davison May 22, '*' means "zero or more" A postfix operator (like the 2 in x 2 )

Regular ExpressionsOperators # use '*' to break whole record into pieces match = re.search('(.*)/(.*)/(.*)', 'Davison/May 22, 2010/1721.3') print match.group(1) print match.group(2) print match.group(3) Davison May 22, '*' means "zero or more" A postfix operator (like the 2 in x 2 ) So (.*) means "zero or more characters"

Regular ExpressionsOperators # use '*' to break whole record into pieces match = re.search('(.*)/(.*)/(.*)', 'Davison/May 22, 2010/1721.3') print match.group(1) print match.group(2) print match.group(3) Davison May 22, '*' means "zero or more" A postfix operator (like the 2 in x 2 ) So (.*) means "zero or more characters" But the slashes must match exactly for it to work

Regular ExpressionsOperators # but this permits false positives match = re.search('(.*)/(.*)/(.*)', '//') print '*', match.group(1) print '*', match.group(2) print '*', match.group(3) *

Regular ExpressionsOperators # but this permits false positives match = re.search('(.*)/(.*)/(.*)', '//') print '*', match.group(1) print '*', match.group(2) print '*', match.group(3) *.* can match the empty string (zero characters)

Regular ExpressionsOperators # but this permits false positives match = re.search('(.*)/(.*)/(.*)', '//') print '*', match.group(1) print '*', match.group(2) print '*', match.group(3) *.* can match the empty string (zero characters) So this pattern will accept badly-formatted data

Regular ExpressionsOperators # force pattern to match _some_ characters print re.search('(.+)/(.+)/(.+)', '//') None

Regular ExpressionsOperators # force pattern to match _some_ characters print re.search('(.+)/(.+)/(.+)', '//') None '+' is a postfix operator meaning "1 or more"

Regular ExpressionsOperators # force pattern to match _some_ characters print re.search('(.+)/(.+)/(.+)', 'Davison/May 22, 2010/1721.3') print m.group(1) print m.group(2) print m.group(3) Davison May 22, Always check that it still works with valid data...

Regular ExpressionsOperators # write a function to show matched groups def show_groups(pattern, text): m = re.search(pattern, text) if m is None: print 'NO MATCH' return for i in range(1, 1 + len(m.groups())): print '%2d: %s' % (i, m.group(i)) show_groups('(.+)/(.+)/(.+)', 'Davison/May 22, 2010/1721.3') 1: Davison 2: May 22, :

Regular ExpressionsOperators # get the year, month, and day at the same time show_groups('(.+)/(.+) (.+), (.+)/(.+)', 'Davison/May 22, 2010/1721.3') 1: Davison 2: May 3: 22 4: :

Regular ExpressionsOperators # why doesn't this work? show_groups('(.+)/(.+) (.+), (.+)/(.+)', 'Davison/May /1721.3') None

Regular ExpressionsOperators no comma in the data # why doesn't this work? show_groups('(.+)/(.+) (.+), (.+)/(.+)', 'Davison/May /1721.3') None

Regular ExpressionsOperators # make the comma optional show_groups('(.+)/(.+) (.+),? (.+)/(.+)', 'Davison/May /1721.3') 1: Davison 2: May 3: 22 4: : '?' is a postfix operator meaning "0 or 1"

Regular ExpressionsOperators # make the comma optional show_groups('(.+)/(.+) (.+),? (.+)/(.+)', 'Davison/May /1721.3') 1: Davison 2: May 3: 22 4: : '?' is a postfix operator meaning "0 or 1" I.e., "optional"

Regular ExpressionsOperators # make the comma optional show_groups('(.+)/(.+) (.+),? (.+)/(.+)', 'Davison/May 22, 2010/1721.3') 1: Davison 2: May 3: 22 4: : Still works on data with a comma

Regular ExpressionsOperators # we _don't_ want to match this show_groups('(.+)/(.+) (.+),? (.+)/(.+)', 'Davison/May 22, 201/1721.3') 1: Davison 2: May 3: 22 4: 201 5:

Regular ExpressionsOperators # years should have four digits, shouldn't they? show_groups('(.+)/(.+) (.+),? (.+)/(.+)', 'Davison/May 22, 201/1721.3') 1: Davison 2: May 3: 22 4: 201 5: Nobody's prefect

Regular ExpressionsOperators # force the pattern to match four characters show_groups('(.+)/(.+) (.+),? (....)/(.+)', 'Davison/May 22, 201/1721.3') None

Regular ExpressionsOperators # force the pattern to match four characters show_groups('(.+)/(.+) (.+),? (....)/(.+)', 'Davison/May 22, 201/1721.3') None Won't win any awards for readability

Regular ExpressionsOperators # force the pattern to match four characters show_groups('(.+)/(.+) (.+),? (.{4})/(.+)', 'Davison/May 22, 201/1721.3') None show_groups('(.+)/(.+) (.+),? (.{4})/(.+)', 'Davison/May 22, 2010/1721.3') 1: Davison 2: May 3: 22 4: :

Regular ExpressionsOperators # force the pattern to match four characters show_groups('(.+)/(.+) (.+),? (.{4})/(.+)', 'Davison/May 22, 201/1721.3') None show_groups('(.+)/(.+) (.+),? (.{4})/(.+)', 'Davison/May 22, 2010/1721.3') 1: Davison 2: May 3: 22 4: : '{N}' is a postfix operator meaning "match N times"

Regular ExpressionsOperators # force the day to match to width tests = ( 'Davison/May, 2010/1721.3', 'Davison/May 2, 2010/1721.3', 'Davison/May 22, 2010/1721.3', 'Davison/May 222, 2010/1721.3', 'Davison/May 2, 201/1721.3', 'Davison/ 22, 2010/1721.3', '/May 22, 2010/1721.3', 'Davison/May 22, 2010/' ) pattern = '(.+)/(.+) (.{1,2}),? (.{4})/(.+)' show_matches(pattern, tests)

Regular ExpressionsOperators # force the day to match to width tests = ( 'Davison/May, 2010/1721.3', 'Davison/May 2, 2010/1721.3', 'Davison/May 22, 2010/1721.3', 'Davison/May 222, 2010/1721.3', 'Davison/May 2, 201/1721.3', 'Davison/ 22, 2010/1721.3', '/May 22, 2010/1721.3', 'Davison/May 22, 2010/' ) pattern = '(.+)/(.+) (.{1,2}),? (.{4})/(.+)' show_matches(pattern, tests) '{M,N}' matches from M to N times

Regular ExpressionsOperators # matching against '(.+)/(.+) (.{1,2}),? (.{4})/(.+)' ** Davison/May, 2010/ ** Davison/May 2, 2010/ ** Davison/May 22, 2010/ Davison/May 222, 2010/ Davison/May 2, 201/ Davison/ 22, 2010/ /May 22, 2010/ Davison/May 22, 2010/

Regular ExpressionsOperators # matching against '(.+)/(.+) (.{1,2}),? (.{4})/(.+)' ** Davison/May, 2010/ ** Davison/May 2, 2010/ ** Davison/May 22, 2010/ Davison/May 222, 2010/ Davison/May 2, 201/ Davison/ 22, 2010/ /May 22, 2010/ Davison/May 22, 2010/ Why does this match?

Regular ExpressionsOperators # look at that test case more closely show_groups('(.+)/(.+) (.{1,2}),? (.{4})/(.+)', 'Davison/May, 2010/1721.3') 1: Davison 2: May 3:, 4: :

Regular ExpressionsOperators '(.+)/(.+) (.{1,2}),? (.{4})/(.+)' 'Davison/May, 2010/1721.3' space matches space

Regular ExpressionsOperators '(.+)/(.+) (.{1,2}),? (.{4})/(.+)' 'Davison/May, 2010/1721.3' space matches space '.{1,2}' matches ','

Regular ExpressionsOperators '(.+)/(.+) (.{1,2}),? (.{4})/(.+)' 'Davison/May, 2010/1721.3' space matches space '.{1,2}' matches ',' ',?' matches nothing (it's optional)

Regular ExpressionsOperators '(.+)/(.+) (.{1,2}),? (.{4})/(.+)' 'Davison/May, 2010/1721.3' space matches space '.{1,2}' matches ',' ',?' matches nothing (it's optional) space matches space again

Regular ExpressionsOperators # force a match against digits show_groups('(.+)/(.+) ([0-9]{1,2}),? (.{4})/(.+)', 'Davison/May, 2010/1721.3') None show_groups('(.+)/(.+) ([0-9]{1,2}),? (.{4})/(.+)', 'Davison/May 22, 2010/1721.3') 1: Davison 2: May 3: 22 4: :

Regular ExpressionsOperators # force a match against digits show_groups('(.+)/(.+) ([0-9]{1,2}),? (.{4})/(.+)', 'Davison/May, 2010/1721.3') None show_groups('(.+)/(.+) ([0-9]{1,2}),? (.{4})/(.+)', 'Davison/May 22, 2010/1721.3') 1: Davison 2: May 3: 22 4: : '[...]' matches any character in a set

Regular ExpressionsOperators # force a match against digits show_groups('(.+)/(.+) ([0-9]{1,2}),? (.{4})/(.+)', 'Davison/May, 2010/1721.3') None show_groups('(.+)/(.+) ([0-9]{1,2}),? (.{4})/(.+)', 'Davison/May 22, 2010/1721.3') 1: Davison 2: May 3: 22 4: : '[...]' matches any character in a set E.g., '[aeiou]' matches vowels

Regular ExpressionsOperators # match everything against characters and width p = '(.+)/([A-Z][a-z]+) ([0-9]{1,2}),? ([0-9]{4})/(.+)'

Regular ExpressionsOperators # match everything against characters and width p = '(.+)/([A-Z][a-z]+) ([0-9]{1,2}),? ([0-9]{4})/(.+)' Month name begins with upper-case letter...

Regular ExpressionsOperators # match everything against characters and width p = '(.+)/([A-Z][a-z]+) ([0-9]{1,2}),? ([0-9]{4})/(.+)' Month name is an upper-case letter......followed by one or more lower-case letters

Regular ExpressionsOperators # match everything against characters and width p = '(.+)/([A-Z][a-z]+) ([0-9]{1,2}),? ([0-9]{4})/(.+)' Day is one or two digits

Regular ExpressionsOperators # match everything against characters and width p = '(.+)/([A-Z][a-z]+) ([0-9]{1,2}),? ([0-9]{4})/(.+)' Day is one or two digits This format allows '0', '00', '99', and so on

Regular ExpressionsOperators # match everything against characters and width p = '(.+)/([A-Z][a-z]+) ([0-9]{1,2}),? ([0-9]{4})/(.+)' Day is one or two digits This format allows '0', '00', '99', and so on Easiest to check that after converting to integer...

Regular ExpressionsOperators # match everything against characters and width p = '(.+)/([A-Z][a-z]+) ([0-9]{1,2}),? ([0-9]{4})/(.+)' Day is one or two digits This format allows '0', '00', '99', and so on Easiest to check that after converting to integer......since valid ranges depend on the month

Regular ExpressionsOperators # match everything against characters and width p = '(.+)/([A-Z][a-z]+) ([0-9]{1,2}),? ([0-9]{4})/(.+)' Year is exactly four digits

Regular ExpressionsOperators # match everything against characters and width p = '(.+)/([A-Z][a-z]+) ([0-9]{1,2}),? ([0-9]{4})/(.+)' Year is exactly four digits Again, check for '0000' and the like after conversion

Regular ExpressionsOperators # Put it all together. def get_date(record): '''Return (Y, M, D) as strings, or None.''' # m = re.search('([0-9]{4})-([0-9]{2})-([0-9]{2})', record) if m: return m.group(1), m.group(2), m.group(3) # Jan 1, 2010 (comma optional, day may be 1 or 2 digits) m = re.search('/([A-Z][a-z]+) ([0-9]{1,2}),? ([0-9]{4})/', record) if m: return m.group(3), m.group(1), m.group(2) return None

June 2010 created by Greg Wilson Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See for more information.