REGULAR EXPRESSIONS 3 DAY 8 - 9/12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Slides:



Advertisements
Similar presentations
Regular expressions Day 2
Advertisements

Regular Expressions in Perl By Josue Vazquez. What are Regular Expressions? A template that either matches or doesn’t match a given string. Often called.
Regular Expressions (in Python). Python or Egrep We will use Python. In some scripting languages you can call the command “grep” or “egrep” egrep pattern.
Strings and regular expressions Day 10 LING Computational Linguistics Harry Howard Tulane University.
Finite-state automata 2 Day 13 LING Computational Linguistics Harry Howard Tulane University.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
Regular Expressions Comp 2400: Fall 2008 Prof. Chris GauthierDickey.
Scripting Languages Chapter 8 More About Regular Expressions.
Regex Wildcards on steroids. Regular Expressions You’ve likely used the wildcard in windows search or coding (*), regular expressions take this to the.
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
UNICODE & CONTROL DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
Programming Perl in UNIX Course Number : CIT 370 Week 4 Prof. Daniel Chen.
NLTK & BASIC TEXT STATS DAY /08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 4 DAY 5 - 9/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
ASPECTS OF LINGUISTIC COMPETENCE 4 SEPT 09, 2013 – DAY 6 Brain & Language LING NSCI Harry Howard Tulane University.
UNICODE DAY /22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
CS 403: Programming Languages Fall 2004 Department of Computer Science University of Alabama Joel Jones.
1 111 Computability, etc. Midterm review. Turing machines. Finite state machines. Push down automata. Homework: FSA, PDA, TM problems (can work in teams)
Structured programming 3 Day 33 LING Computational Linguistics Harry Howard Tulane University.
COMPUTATION WITH STRINGS 2 DAY 2 - 8/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
CPSC 388 – Compiler Design and Construction Scanners – JLex Scanner Generator.
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
SCRIPTS & FUNCTIONS DAY /06/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER DAY /07/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER 2 DAY /10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
WEB TEXT DAY /14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,
©Brooks/Cole, 2001 Chapter 9 Regular Expressions.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 1 DAY 2 - 8/27/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
CompSci 6 Introduction to Computer Science November 8, 2011 Prof. Rodger.
CompSci 101 Introduction to Computer Science November 18, 2014 Prof. Rodger.
CGS – 4854 Summer 2012 Web Site Construction and Management Instructor: Francisco R. Ortega Chapter 5 Regular Expressions.
REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
ON-LINE DOCUMENTS DAY /13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
CONTROL 2 DAY /26/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER 3 DAY /12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 3 DAY 4 - 9/03/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Regular expressions Day 11 LING Computational Linguistics Harry Howard Tulane University.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
CONTROL 3 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
CompSci 101 Introduction to Computer Science April 7, 2015 Prof. Rodger.
Lists 1 Day /17/14 LING 3820 & 6820 Natural Language Processing
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Lists 2 Day /19/14 LING 3820 & 6820 Natural Language Processing
BASIC AND EXTENDED REGULAR EXPRESSIONS
Computation with strings 2 Day 3 - 9/02/16
Computation with strings 3 Day 4 - 9/07/16
Computation with strings 1 Day 2 - 8/31/16
Regular expressions 2 Day /23/16
control 4 Day /01/14 LING 3820 & 6820 Natural Language Processing
LING 3820 & 6820 Natural Language Processing Harry Howard
Control 3 Day /05/16 LING 3820 & 6820 Natural Language Processing
NLP 2 Day /07/16 LING 3820 & 6820 Natural Language Processing
What Are They? Who Needs ‘em? An Example: Scoring in Tennis
Regular expressions 3 Day /26/16
Regular Expressions and Grep
Computation with strings 4 Day 5 - 9/09/16
Control 1 Day /30/16 LING 3820 & 6820 Natural Language Processing
LING 388: Computers and Language
Presentation transcript:

REGULAR EXPRESSIONS 3 DAY 8 - 9/12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization 12-Sept-2014NLP, Prof. Howard, Tulane University 2   The syllabus is under construction. 

How did the homework go? Review 12-Sept NLP, Prof. Howard, Tulane University

Open Spyder 12-Sept NLP, Prof. Howard, Tulane University import re

4.3. Variable-length matching §4. Regular expressions 2 12-Sept NLP, Prof. Howard, Tulane University

Match an unknown number of characters with + and *  Imagine that you are given string S5 and told to match the prepositions at the end of each word: >>> S5 = 'break breakup breakout breakdown breakthrough ' >>> re.findall('break([a-z]+) ', S5) ['up', 'out', 'down', 'through'] >>> re.findall('break([a-z]*) ', S5) ['', 'up', 'out', 'down', 'through'] 12-Sept-2014NLP, Prof. Howard, Tulane University 6

Match optional characters with ?  A similar sort of problem is to match the singular and plural forms of fish in S6: >>> S6 = 'fish fishes fishy fisher ' >>> re.findall('fish(?:es)? ', S6) ['fish ', 'fishes '] >>> re.findall('fish |fishes ', S6) ['fish ', 'fishes ']  But this formulation doesn’t encode the asymmetry between the two words. Fishes is a variant of fish. That is to say, | over-fits the data.  Optionality is equivalent to matching zero or one instances of the string in question, so it can be mimicked with curly brackets: >>> re.findall('fish(?:es){0,1} ', S6) ['fish ', 'fishes '] 12-Sept-2014NLP, Prof. Howard, Tulane University 7

Match optional characters with ?, cont.  Optionality provides a means to do a rough morphological matching: >>> S6 = 'fish fishes fishy fisher fishers ' >>> re.findall('fish(?:er)?(?:s)? ', S6) ['fish ', 'fisher ', 'fishers ']  Restating the optional substrings as a choice among entire words obfuscates the linguistic generalization that er and s are suffixes to nouns: >>> re.findall('fish |fisher |fishers ', S6) ['fish ', 'fisher ', 'fishers '] 12-Sept-2014NLP, Prof. Howard, Tulane University 8

, * and ? match lazily with ?  You wouldn’t want to trust the plus, star and question meta-characters with your wallet, since they take as much as they can get. Consider the problem of matching one of a sequence of quotes: >>> S7 = "'Cool' never goes out of style, but 'gnarly' does." >>> re.findall("'.*'", S7) ["'Cool' never goes out of style, but 'gnarly'"]  This is because * (and + and ?) are greedy – they match the largest number of characters that they can. So even though you may see ‘Cool’ and gnarly as the only accurate matches, what * actually sees is something like ‘Cool never goes out of style, but gnarly’.  If such greedy matching is not desired, it can be turned off by suffixing the three meta- characters with a question mark: >>> re.findall("'.*?'", S7) ["'Cool'", "'gnarly'"]  Turning off the greediness of ? means typing two question marks, ??, which I have not found a good example of yet. 12-Sept-2014NLP, Prof. Howard, Tulane University 9

Summary table meta-charactermatchesnamenotes a|ba or bdisjunction (ab)a and bgrouping only outputs what is in (); (?:ab) for rest of pattern [ab]a or brange [a-z] lowercase, [A-Z] uppercase, [0-9] digits [^a]all but anegation a{m, n}from m to n of arepetitiona{n} a number n of a ^aa at start of S a$a at end of S a+one or more of a a+? lazy + a*zero or more of aKleene stara*? lazy * a?with or without aoptionalitya?? lazy ? 12-Sept-2014NLP, Prof. Howard, Tulane University 10

Practice with answers on a different page  Further practice of variable-length matching 12-Sept-2014NLP, Prof. Howard, Tulane University 11

Q2 to be ed to you and due in class on Monday, on material since last quiz, regular expressions, up to and including Summary table Finish regular expressions, maybe start lists Next time 12-Sept-2014NLP, Prof. Howard, Tulane University 12