LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong
Last time
Today's Topics Homework 8: a Perl regex homework Examples of advanced uses of Perl regexs: Recursive regexs Prime number testing
Homework 8: Part 1 It contains nearly 50,000 sentences (one per line). wsj.txt is a tokenized text file containg the Wall Street Journal (WSJ) corpus It contains nearly 50,000 sentences (one per line). (The syntactically annotated version is core data used to train and test many statistical parsers, e.g. the Stanford and Berkeley parsers.) Note: tokenized here means punctuation is spaced; also 's and n't are separated by spaces.
Homework 8: Part 1 English past participle forms are used in passives and perfectives: e.g. the apple was/is eaten the apple(s) will be eaten the apples were eaten the apples were being eaten the apples had been eaten the women have/had eaten the apples Mary has/had eaten the apples There are also negated versions of the passives and perfectives: e.g. the apple was n't eaten the apple was not eaten the apple was n't yet eaten the apple was not yet eaten Mary hasn't eaten the apples Mary has not eaten the apples Mary has not yet eaten the apples
Homework 8: Part 1 Based on the data shown in the previous slide, assuming past participle ending –en, write a Perl regex program that searches the WSJ corpus, computes and prints the frequency of regular passives, perfectives, and the negated counterparts, i.e. Hint: for readability you may want to incorporate regex variables (see qr/../ from previous lecture) Hint: be careful of regex precedence: (a|b)\s+c is not the same as a|b\s+c Note: your program will underreport the true frequencies for several reasons.
Homework 8: Part 2 One reason for underreporting: Question: not all past participles conveniently end in –en e.g. the cookies were burnt, the rope had been cut e.g. the demonstrators were arrested Question: what are some other possible reasons for underreporting? Give some examples from the WSJ corpus.
Homework 8: Part 2 File: irregular_verbs.txt http://www.gingersoftware.com/content/ grammar-rules/verbs/list-of-irregular- verbs/ File: irregular_verbs.txt Note: \t (tab) separates the three columns borne simplified
Homework 8: Part 2 Write a Perl program to extract the irregular past participles from file irregular_verbs.txt and print them out one per line Make sure to split alternate forms into two lines: e.g. burnt/burned Ignore: (been able) and … (ellipsis) Extra Credit: give a Perl one liner that does the job… Hint: you may want to make use of join("\n", split( "/", … ))
Homework 8: Part 3 Incorporate the past participles found in Part 2 into your program from Part 1. Hints: one method: do it in stages; e.g. save those irregular past participles into a file. Read them into your program for Part 3. another method: combine your programs so you parse irregular_verbs.txt directly another method: copy and paste them into your program for Part 3 directly Add to your program printout of the frequency of the irregular verb counterparts: i.e. regular passives: # irregular passives: # regular perfectives: # irregular perfectives: # negated regular passives: # etc.
General instructions for submission (repeated) One pdf file containing everything Code and output must both be submitted Summarize/explain what you did If you like, you may add your programs separately as attachments to the email (so I can download and run them if necessary). Submission due date: next Wednesday midnight (before Thursday class)
Regex Recursion Word pallindrome = a word that reads the same backwards or forwards, e.g. kayak and racecar. Normally regexs cannot express pallindromes but Perl regexs can because we can use backreferences recursively. Note: recursion here refers to the ability to repeatedly embed regexs inside
Regex Recursion Program: (?group-ref)
Regex Lookahead and Lookback Zero-width regexs: ^ (start of string) $ (end of string) \b (word boundary) matches the imaginary position between \w\W or \W\w, or just before beginning of string if ^\w, just after the end of the string if \w$ Current position of match (so far) doesn't change! (?=regex) (lookahead from current position) (?<=regex) (lookback from current position) (?!regex) (negative lookahead) (?<!regex) (negative lookback)
Regex Lookahead and Lookback Example: looks for a word beginning with _ such that there is a duplicate ahead without the _ Restriction: lookback cannot be variable length in Perl
Debugging Perl regex (?{ Perl code }) can be inserted anywhere in a regex can assist with debugging Example:
Prime Number Testing using Perl Regular Expressions Another example: the set of prime numbers is not a regular language Lprime = {2, 3, 5, 7, 11, 13, 17, 19, 23,.. } Turns out, we can use a Perl regex to determine membership in this set .. and to factorize numbers /^(11+?)\1+$/
Prime Number Testing using Perl Regular Expressions can be proved using the Pumping Lemma for regular languages (later) L = {1n | n is prime} is not a regular language Keys to making this work: \1 backreference unary notation for representing numbers, e.g. 11111 “five ones” = 5 111111 “six ones” = 6 unary notation allows us to factorize numbers by repetitive pattern matching (11)(11)(11) “six ones” = 6 (111)(111) “six ones” = 6 numbers that can be factorized in this way aren’t prime no way to get nontrivial subcopies of 11111 “five ones” = 5 Then /^(11+?)\1+$/ will match anything that’s greater than 1 that’s not prime
Prime Number Testing using Perl Regular Expressions Let’s analyze this Perl regex /^(11+?)\1+$/ ^ and $ anchor both ends of the strings, forces (11+?)\1+ to cover the string exactly (11+?) is non-greedy match version of (11+) \1+ provides one or more copies of what we matched in (11+?) Question: is the non-greedy operator necessary?
Prime Number Testing using Perl Regular Expressions Compare /^(11+?)\1+$/ with /^(11+)\1+$/ i.e. non-greedy vs. greedy matching finds smallest factor vs. largest 90021 factored using 3, not a prime (0 secs) vs. 90021 factored using 30007, not a prime (0 secs) affects computational efficiency for non-primes Puzzling behavior: same output non-greedy vs. greedy 900021 factored using 300007, not a prime (48 secs vs. 13 secs)
Prime Number Testing using Perl Regular Expressions Prime Numbers 100003 200003 300007 400009 500009 600011 700001 800011 900001 1000003 1100009 1200007 1300021 1400017 1500007 testing with prime numbers only can take a lot of time to compute …
Prime Number Testing using Perl Regular Expressions /^(11+?)\1+$/ vs. /^(11+)\1+$/ i.e. non-greedy vs. greedy matching finds smallest factor vs. largest 90021 factored using 3, not a prime (0 secs) vs. 90021 factored using 30007, not a prime (0 secs) Puzzling behavior: same output non-greedy vs. greedy 900021 factored using 300007, not a prime (48 secs vs. 13 secs)
Prime Number Testing using Perl Regular Expressions http://www.xav.com/perl/lib/Pod/perlre.html nearest primes to preset limit 32749 32771 3*32749 32766 3*32771 = 98247 = 98313
Prime Number Testing using Perl Regular Expressions When preset limit is exceeded: Perl’s regex matching fails quietly
Prime Number Testing using Perl Regular Expressions Can also get non-greedy to skip several factors Example: pick non-prime 164055 = 3 x 5 x 10937 (prime factorization) Non-greedy: missed factors 3 and 5 … Because 3 * 54685 = 164055 5 * 32811 = 164055 32766 limit 15 * 10937 = 164055 greedy version
Prime Number Testing using Perl Regular Expressions Results are still right so far though: wrt. prime vs. non-prime But we predict it will report an incorrect result for 1,070,009,521 it should claim (incorrectly) that this is prime since 1070009521 = 327112