Regular Expressions Chapter 11 Python for Informatics: Exploring Information www.pythonlearn.com.

Slides:



Advertisements
Similar presentations
Session 3BBK P1 ModuleApril 2010 : [#] Regular Expressions.
Advertisements

Python for Informatics: Exploring Information
BBK P1 Module2010/11 : [‹#›] Regular Expressions.
Reading Files Chapter 7 Python for Informatics: Exploring Information
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
JavaScript, Third Edition
Regular expression. Validation need a hard and very complex programming. Sometimes it looks easy but actually it is not. So there is a lot of time and.
Scripting Languages Chapter 8 More About Regular Expressions.
Loops and Iteration Chapter 5 Python for Informatics: Exploring Information
More on Regular Expressions Regular Expressions More character classes \s matches any whitespace character (space, tab, newline etc) \w matches.
Lesson 3 – Regular Expressions Sandeepa Harshanganie Kannangara MBCS | B.Sc. (special) in MIT.
Game Programming © Wiley Publishing All Rights Reserved. The L Line The Express Line to Learning L Line L.
Regular Expressions Week 07 TCNJ Web 2 Jean Chu. Regular Expressions Regular Expressions are a powerful way to validate and format text strings that may.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved Streams Streams –Sequences of characters organized.
Regular Expressions Dr. Ralph D. Westfall May, 2011.
Overview of the grep Command Alex Dukhovny CS 265 Spring 2011.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
PHP Workshop ‹#› Data Manipulation & Regex. PHP Workshop ‹#› What..? Often in PHP we have to get data from files, or maybe through forms from a user.
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
Python for Informatics: Exploring Information
CIS 451: Regular Expressions Dr. Ralph D. Westfall January, 2009.
Regular Expression (continue) and Cookies. Quick Review What letter values would be included for the following variable, which will be used for validation.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Functions Chapter 4 Python for Informatics: Exploring Information
Instructor: Craig Duckett Lecture 08: Thursday, October 22 nd, 2015 Patterns, Order of Evaluation, Concatenation, Substrings, Trim, Position 1 BIT275:
By: Andrew Cory. Grouping Things & Hierarchical Matching Grouping characters – ( and ) Allows parts of a regular expression to be treated as a single.
Regular Expressions Regular Expressions. Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have.
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions. Overview Regular expressions allow you to do complex searches within text documents. Examples: Search 8-K filings for restatements.
Python for NLP Regular Expressions CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
Regular Expressions in Perl CS/BIO 271 – Introduction to Bioinformatics.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
CompSci 101 Introduction to Computer Science November 18, 2014 Prof. Rodger.
Unit 11 –Reglar Expressions Instructor: Brent Presley.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. ADVANCED.
Regular Expressions /^Hel{2}o\s*World\n$/ SoftUni Team Technical Trainers Software University
Regular expressions Day 11 LING Computational Linguistics Harry Howard Tulane University.
Regular Expressions /^Hel{2}o\s*World\n$/ SoftUni Team Technical Trainers Software University
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
Python for Informatics: Exploring Information
Python Pattern Matching and Regular Expressions Peter Wad Sackett.
Compsci 101.2, Fall Plan for LWoC l Power of Regular Expressions  From theoretical computer science to scraping webpages  Using documentation,
Editing Tons of Text? RegEx to the Rescue! Eric Cressey Senior UX Content Writer Symantec Corporation.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
Regular Expressions In Javascript cosc What Do They Do? Does pattern matching on text We use the term “string” to indicate the text that the regular.
Chapter 4 Computing With Strings Charles Severance Textbook: Python Programming: An Introduction to Computer Science, John Zelle.
Reading Files Chapter 7 Python for Everybody
Regular Expressions Upsorn Praphamontripong CS 1110
Looking for Patterns - Finding them with Regular Expressions
CompSci 101 Introduction to Computer Science
Concepts of Programming Languages
Regular Expressions Chapter 11 Python for Everybody
Python Lists Chapter 8 Python for Everybody
Chapter 4 Computing With Strings
Python for Informatics: Exploring Information
Strings Chapter 6 Slightly modified by Recep Kaya Göktaş in April Python for Informatics: Exploring Information
Advanced Find and Replace with Regular Expressions
Algorithm Discovery and Design
Python for Informatics: Exploring Information
Data Manipulation & Regex
Python for Informatics: Exploring Information
Introduction to Computer Science
L L Line CSE 420 Computer Games Lecture #4 Working with Data.
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
Introduction to Computer Science
Presentation transcript:

Regular Expressions Chapter 11 Python for Informatics: Exploring Information

Unless otherwise noted, the content of this course material is licensed under a Creative Commons Attribution 3.0 License. Copyright Charles Severance

Regular Expressions In computing, a regular expression, also referred to as "regex" or "regexp", provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor.computingf text,that can be int

Regular Expressions Really clever "wild card" expressions for matching and parsing strings.

Really smart "Find" or "Search"

Understanding Regular Expressions Very powerful and quite cryptic Fun once you understand them Regular expressions are a language unto themselves A language of "marker characters" - programming with characters It is kind of an "old school" language - compact

Regular Expression Quick Guide ^ Matches the beginning of a line $ Matches the end of the line. Matches any character \s Matches whitespace \S Matches any non-whitespace character * Repeats a character zero or more times *? Repeats a character zero or more times (non-greedy) + Repeats a chracter one or more times +? Repeats a character one or more times (non-greedy) [aeiou] Matches a single character in the listed set [^XYZ] Matches a single character not in the listed set [a-z0-9] The set of characters can include a range ( Indicates where string extraction is to start ) Indicates where string extraction is to end

The Regular Expression Module Before you can use regular expressions in your program, you must import the library using "import re" You can use re.search() to see if a string matches a regular expression similar to using the find() method for strings You can use re.findall() extract portions of a string that match your regular expression similar to a combination of find() and slicing: var[5:10]

Using re.search() like find() import rehand = open('mbox-short.txt')for line in hand: line = line.rstrip() if re.search('From:', line) : print line hand = open('mbox-short.txt')for line in hand: line = line.rstrip() if line.find('From:') >= 0: print line

Using re.search() like startswith() import rehand = open('mbox-short.txt')for line in hand: line = line.rstrip() if re.search('^From:', line) : print line hand = open('mbox-short.txt')for line in hand: line = line.rstrip() if line.startswith('From:') : print line We fine-tune what is matched by adding special characters to the string

Wild-Card Characters The dot character matches any character If you add the asterisk character, the character is "any number of times" X-Sieve: CMU Sieve 2.3 X-DSPAM-Result: InnocentX-DSPAM-Confidence: X-Content-Type-Message-Body: text/plain ^X.*:

Wild-Card Characters The dot character matches any character If you add the asterisk character, the character is "any number of times" X-Sieve: CMU Sieve 2.3 X-DSPAM-Result: InnocentX-DSPAM-Confidence: X-Content-Type-Message-Body: text/plain ^X.*:^X.*: Match the start of the line Match any character Many times

Wild-Card Characters The dot character matches any character If you add the asterisk character, the character is "any number of times" X-Sieve: CMU Sieve 2.3 X-DSPAM-Result: InnocentX-DSPAM-Confidence: X-Content-Type-Message-Body: text/plain ^X.*:^X.*: Match the start of the line Match any character Many times

Fine-Tuning Your Match Depending on how "clean" your data is and the purpose of your application, you may want to narrow your match down a bit X-Sieve: CMU Sieve 2.3 X-DSPAM-Result: Innocent X Plane is behind schedule: two weeks ^X.*:^X.*: Match the start of the line Match any character Many times

Fine-Tuning Your Match Depending on how "clean" your data is and the purpose of your application, you may want to narrow your match down a bit X-Sieve: CMU Sieve 2.3 X-DSPAM-Result: Innocent X Plane is behind schedule: two weeks ^X-\S+: Match the start of the line Match any non-whitespace character One or more times

Matching and Extracting Data The re.search() returns a True/False depending on whether the string matches the regular expression If we actually want the matching strings to be extracted, we use re.findall() >>> import re >>> x = 'My 2 favorite numbers are 19 and 42'>>> y = re.findall('[0-9]+',x)>>> print y['2', '19', '42'] [0-9]+ One or more digits

Matching and Extracting Data When we use re.findall() it returns a list of zero or more sub- strings that match the regular expression >>> import re >>> x = 'My 2 favorite numbers are 19 and 42'>>> y = re.findall('[0-9]+',x)>>> print y['2', '19', '42'] >>> y = re.findall('[AEIOU]+',x)>>> print y[]

Warning: Greedy Matching The repeat characters (* and +) push outward in both directions (greedy) to match the largest possible string >>> import re>>> x = 'From: Using the : character'>>> y = re.findall('^F.+:', x)>>> print y['From: Using the :'] ^F.+: One or more characters First character in the match is an F Last character in the match is a : Why not 'From:'?

Non-Greedy Matching Not all regular expression repeat codes are greedy! If you add a ? character - the + and * chill out a bit... >>> import re>>> x = 'From: Using the : character'>>> y = re.findall('^F.+?:', x)>>> print y['From:'] ^F.+?: One or more characters but not greedily First character in the match is an F Last character in the match is a :

Fine Tuning String Extraction You can refine the match for re.findall() and separately determine which portion of the match that is to be extracted using parenthesis From Sat Jan 5 09:14: >>> y = print y = re.findall('^From:.*? print At least one non-whitespace character

Fine Tuning String Extraction Parenthesis are not part of the match - but they tell where to start and stop what string to extract From Sat Jan 5 09:14: >>> y = print y = re.findall('^From print ^From

>>> data = 'From Sat Jan 5 09:14: '>>> atpos = print atpos21>>> sppos = data.find(' ',atpos)>>> print sppos31>>> host = data[atpos+1 : sppos]>>> print hostuct.ac.za From Sat Jan 5 09:14: Extracting a host name - using find and string slicing.

The Double Split Version Sometimes we split a line one way and then grab one of the pieces of the line and split that piece again From Sat Jan 5 09:14:

The Double Split Version Sometimes we split a line one way and then grab one of the pieces of the line and split that piece again From Sat Jan 5 09:14: words = line.split() = words[1] pieces = print pieces[1] ['stephen.marquard', 'uct.ac.za'] 'uct.ac.za'

The Regex Version From Sat Jan 5 09:14: import re lin = 'From Sat Jan 5 09:14: 'y = ]*)',lin) print y['uct.ac.za'] ]*)' Look through the string until you find an at-sign

The Regex Version From Sat Jan 5 09:14: import re lin = 'From Sat Jan 5 09:14: 'y = ]*)',lin) print y['uct.ac.za'] ]*)' Match non-blank characterMatch many of them

The Regex Version From Sat Jan 5 09:14: import re lin = 'From Sat Jan 5 09:14: 'y = ]*)',lin) print y['uct.ac.za'] ]*)' Extract the non-blank characters

Even Cooler Regex Version From Sat Jan 5 09:14: import re lin = 'From Sat Jan 5 09:14: 'y = ]*)',lin) print y['uct.ac.za'] ]*)' Starting at the beginning of the line, look for the string 'From '

Even Cooler Regex Version From Sat Jan 5 09:14: import re lin = 'From Sat Jan 5 09:14: 'y = ]*)',lin) print y['uct.ac.za'] ]*)' Skip a bunch of characters, looking for an at-sign

Even Cooler Regex Version From Sat Jan 5 09:14: import re lin = 'From Sat Jan 5 09:14: 'y = ]*)',lin) print y['uct.ac.za'] ]*)' Start 'extracting'

Even Cooler Regex Version From Sat Jan 5 09:14: import re lin = 'From Sat Jan 5 09:14: 'y = ]*)',lin) print y['uct.ac.za'] ]*)' Match non-blank characterMatch many of them

Even Cooler Regex Version From Sat Jan 5 09:14: import re lin = 'From Sat Jan 5 09:14: 'y = ]*)',lin) print y['uct.ac.za'] ]*)' Stop 'extracting'

Spam Confidence import rehand = open('mbox-short.txt')numlist = list()for line in hand: line = line.rstrip() stuff = re.findall('^X-DSPAM-Confidence: ([0-9.]+)', line) if len(stuff) != 1 : continue num = float(stuff[0]) numlist.append(num)print 'Maximum:', max(numlist) python ds.py Maximum:

Regular Expression Quick Guide ^ Matches the beginning of a line $ Matches the end of the line. Matches any character \s Matches whitespace \S Matches any non-whitespace character * Repeats a character zero or more times *? Repeats a character zero or more times (non-greedy) + Repeats a chracter one or more times +? Repeats a character one or more times (non-greedy) [aeiou] Matches a single character in the listed set [^XYZ] Matches a single character not in the listed set [a-z0-9] The set of characters can include a range ( Indicates where string extraction is to start ) Indicates where string extraction is to end

Escape Character If you want a special regular expression character to just behave normally (most of the time) you prefix it with '\' >>> import re>>> x = 'We just received $10.00 for cookies.'>>> y = re.findall('\$[0-9.]+',x)>>> print y['$10.00'] \$[0-9.]+ A digit or period A real dollar sign At least one or more

Summary Regular expressions are a cryptic but powerful language for matching strings and extracting elements from those strings Regular expressions have special characters that indicate intent