REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Slides:



Advertisements
Similar presentations
Regular expressions Day 2
Advertisements

Searching using regular expressions. A regular expression is also a ‘special text string’ for describing a search pattern. Regular expressions define.
Strings and regular expressions Day 10 LING Computational Linguistics Harry Howard Tulane University.
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
JavaScript, Third Edition
1 Overview Regular expressions Notation Patterns Java support.
Variable & Constants. A variable is a name given to a storage area that our programs can manipulate. Each variable in C has a specific type, which determines.
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
UNICODE & CONTROL DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Last Updated March 2006 Slide 1 Regular Expressions.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
COMPUTATION WITH STRINGS 4 DAY 5 - 9/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
UNICODE DAY /22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
COMPUTATION WITH STRINGS 2 DAY 2 - 8/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Python Regular Expressions Easy text processing. Regular Expression  A way of identifying certain String patterns  Formally, a RE is:  a letter or.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
SCRIPTS & FUNCTIONS DAY /06/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER DAY /07/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
WEB TEXT DAY /14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
Regular Expressions Regular Expressions. Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Python for NLP Regular Expressions CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
Pattern Matching CSCI N321 – System and Network Administration.
REGULAR EXPRESSIONS 3 DAY 8 - 9/12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
COMPUTATION WITH STRINGS 1 DAY 2 - 8/27/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
1 Perl, Beyond the Basics: Regular Expressions, Subroutines, and Objects in Perl CSCI 431 Programming Languages Fall 2003.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
CompSci 6 Introduction to Computer Science November 8, 2011 Prof. Rodger.
CompSci 101 Introduction to Computer Science November 18, 2014 Prof. Rodger.
REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. ADVANCED.
A.4b Synthetic Division To thine own self be true, And it must follow, as the night the day, Thou canst not then be false to any man. -William Shakespeare.
CONTROL 2 DAY /26/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 3 DAY 4 - 9/03/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Regular expressions Day 11 LING Computational Linguistics Harry Howard Tulane University.
Today… Style, Cont. – Naming Things! Methods and Functions Aside - Python Help System Punctuation Winter 2016CISC101 - Prof. McLeod1.
CONTROL 3 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
CompSci 101 Introduction to Computer Science April 7, 2015 Prof. Rodger.
Lists 1 Day /17/14 LING 3820 & 6820 Natural Language Processing
Regular Expressions Upsorn Praphamontripong CS 1110
Strings and Serialization
CSC 594 Topics in AI – Natural Language Processing
Lists 2 Day /19/14 LING 3820 & 6820 Natural Language Processing
Computation with strings 2 Day 3 - 9/02/16
Computation with strings 3 Day 4 - 9/07/16
CSC 594 Topics in AI – Natural Language Processing
Chapter 12: Text Processing
Chapter 12: Text Processing
Regular expressions 2 Day /23/16
Escape Sequences Some Java escape sequences: See Roses.java (page 68)
control 4 Day /01/14 LING 3820 & 6820 Natural Language Processing
The backslash is used to escape characters that are used in Python
Control 3 Day /05/16 LING 3820 & 6820 Natural Language Processing
NLP 2 Day /07/16 LING 3820 & 6820 Natural Language Processing
CSCI 431 Programming Languages Fall 2003
CS 1111 Introduction to Programming Fall 2018
Regular expressions 3 Day /26/16
Computation with strings 4 Day 5 - 9/09/16
Regular Expression: Pattern Matching
Control 1 Day /30/16 LING 3820 & 6820 Natural Language Processing
LING 388: Computers and Language
PYTHON - VARIABLES AND OPERATORS
Presentation transcript:

REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization 15-Sept-2014NLP, Prof. Howard, Tulane University 2   The syllabus is under construction. 

The quiz was the review. Review 15-Sept NLP, Prof. Howard, Tulane University

Summary table meta-charactermatchesnamenotes a|ba or bdisjunction (ab)a and bgrouping only outputs what is in (); (?:ab) for rest of pattern [ab]a or brange [a-z] lowercase, [A-Z] uppercase, [0-9] digits [^a]all but anegation a{m, n}from m to n of arepetitiona{n} a number n of a ^aa at start of S a$a at end of S a+one or more of a a+? lazy + a*zero or more of aKleene stara*? lazy * a?with or without aoptionalitya?? lazy ? 15-Sept-2014NLP, Prof. Howard, Tulane University 4

There is a bit more to say. §4. Regular expressions 4 15-Sept NLP, Prof. Howard, Tulane University

Open Spyder 15-Sept NLP, Prof. Howard, Tulane University

Sample string import re >>> S = '''This above all: to thine own self be true, And it must follow, as the night the day, Thou canst not then be false to any man.''' 15-Sept-2014NLP, Prof. Howard, Tulane University 7

4.4. Character classes classabbreviatesnamenotes \w[a-zA-Z0-9_]alphanumericit’s really alphanumeric and underscore, but we are lazy \W[^a-zA-Z0-9_] not alphanumeric \d[0-9]digit \D[^0-9] not a digit \s[ tvnrf]whitespace \S[^ tvnrf] not whitespace \t horizontal tab \v vertical tab \n newline \r carriage return \f form-feed \b word boundary \B not a word boundary \A^ \Z$ 15-Sept-2014NLP, Prof. Howard, Tulane University 8

Raw string notation with r’‘  Python interprets regular expressions just like any other expression. This can lead to unexpected results with class meta-characters, because the backslash that they incorporate is sometimes also used by Python for its own constructs.  For instance, we just met a class meta-character \b, which marks the edge of a word. It will be extremely useful for us, but it happens to overlap with Python’s own backspace operator, \b. 15-Sept-2014NLP, Prof. Howard, Tulane University 9

Raw text  The way to resolve this ambiguity is to prefix an r to a regular expression. The r marks the regular expression as raw text, so Python does not process it for special characters. The previous example is augmented with the raw text notation below: 1. >>> re.findall(r'\b\w\w\b', S) 2. ['to', 'be', 'it', 'as', 'be', 'to'] 3. >>> re.findall(r'\b\w{2}\b', S) 4. ['to', 'be', 'it', 'as', 'be', 'to'] 15-Sept-2014NLP, Prof. Howard, Tulane University 10

More raw text  As a further illustration, what do you think are the non-alphanumeric characters in the Shakespeare text?: >>> re.findall(r'\W', S) [' ', ' ', ':', ' ', ' ', ' ', ' ', ' ', ' ', ',', '\n', ' ', ' ', ' ', ',', ' ', ' ', ' ', ' ', ' ', ',', '\n', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '.'] 15-Sept-2014NLP, Prof. Howard, Tulane University 11

Practice  Further practice of variable-length matching  4.6. Further practice  Practice with answers on a different page 15-Sept-2014NLP, Prof. Howard, Tulane University 12

There is a bit more to say. §5. Lists1 15-Sept NLP, Prof. Howard, Tulane University

Introduction  In working with re.findall(), you have seen many instances of a collection of strings held within square brackets, such as the one below: >>> S = '''This above all: to thine own self be true,... And it must follow, as the night the day,... Thou canst not then be false to any man.''' >>> re.findall(r'\b[a-zA-Z]{4}\b', S) ['This', 'self', 'true', 'must', 'Thou', 'then'] 15-Sept-2014NLP, Prof. Howard, Tulane University 14

Definition of list  A list in Python is a sequence of objects delimited by square brackets, []. The objects are separated by commas. Consider this sentence from Shakespeare’s A Midsummer Night’s Dream represented as a list:  >>> L = ['Love', 'looks', 'not', 'with', 'the', 'eyes', ',', 'but', 'with', 'the', 'mind', '.'] >>> type(L) >>> type(L[0])  L is a list of strings. You may think that a string is also a list of characters, and you would be correct for ordinary English, but in pythonic English, the word ‘list’ refers exclusively to a sequence of objects delimited by square brackets. 15-Sept-2014NLP, Prof. Howard, Tulane University 15

An example with numerical objects 1. >>> i = 2 2. >>> type(i) 3. >>> I = [0,1,i,3] 4. >>> type(I) 5. >>> type(I[0]) 6. >>> n = >>> type(n) 8. >>> N = [2.0,2.1,2.2,n] 9. >>> type(N) 10. >>> type(N[0]) 15-Sept-2014NLP, Prof. Howard, Tulane University 16

Most of the string methods work just as well on lists 1. >>> len(L) 2. >>> sorted(L) 3. >>> set(L) 4. >>> sorted(set(L)) 5. >>> len(sorted(set(L))) 6. >>> L+'!' 7. >>> len(L+'!') 8. >>> L*2 9. >>> len(L*2) 10. >>> L.count('the') 15-Sept-2014NLP, Prof. Howard, Tulane University 17

String methods work on lists, cont. 1. >>> L.count('Love') 2. >>> L.count('love') 3. >>> L.index('with') 4. >>> L.rindex('with') 5. >>> L[2:] 6. >>> L[:2] 7. >>> L[-2:] 8. >>> L[:-2] 9. >>> L[2:-2] 10. >>> L[-2:2] 11. >>> L[:] 12. >>> L[:-1]+['!'] 15-Sept-2014NLP, Prof. Howard, Tulane University 18

Q1  MIN 5.0  AVG 9.5  MAX Sept-2014NLP, Prof. Howard, Tulane University 19

More on lists Next time 15-Sept-2014NLP, Prof. Howard, Tulane University 20