LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong.

Slides:

Advertisements

Similar presentations

LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong. Administrivia Homework 3 graded.

Advertisements

LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong. Administrivia We'll postpone Homework 4 review until next week …

LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 3: 8/28.

LING 388: Language and Computers Sandiway Fong Lecture 9: 9/21.

LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia.

MCB 5472 Assignment #6: HMMER and using perl to perform repetitive tasks February 26, 2014.

LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 4: 8/30.

LING/C SC/PSYC 438/538 Lecture 2 Sandiway Fong. Today’s Topics Did you read Chapter 1 of JM? – Short Homework 2 (submit by midnight Friday) Today is Perl.

Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,

LING/C SC/PSYC 438/538 Lecture 13 Sandiway Fong. Administrivia Reading Homework – Chapter 3 of JM: Words and Transducers.

LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

LING/C SC/PSYC 438/538 Lecture 14 Sandiway Fong. Administrivia Homework 6 graded.

LING/C SC/PSYC 438/538 Lecture 6 Sandiway Fong. Homework 4 Submit one PDF file Your submission should include code and sample runs Due date Monday 21.

LING/C SC/PSYC 438/538 Lecture 9 Sandiway Fong. Adminstrivia Homework 4 graded Homework 5 out today – Due Saturday night by midnight – (Gives me Sunday.

LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong. Today's Topics A note on the UIUC POS Tagger Fun with POS Tagging Perl regex wrap-up.

LING/C SC/PSYC 438/538 Lecture 19 Sandiway Fong 1.

LING/C SC/PSYC 438/538 Lecture 5 Sandiway Fong.

If: expressing different scenarios through language

Regular Expressions 'RegEx'.

Grammars and Parsing.

Introduction to Parsing (adapted from CS 164 at Berkeley)

Containers and Lists CIS 40 – Introduction to Programming in Python

Algorithms Problem: Write pseudocode for a program that keeps asking the user to input integers until the user enters zero, and then determines and outputs.

LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong.

Grep Allows you to filter text based upon several different regular expression variants Basic Extended Perl.

LING/C SC/PSYC 438/538 Lecture 2 Sandiway Fong.

LING/C SC/PSYC 438/538 Lecture 23 Sandiway Fong.

LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong.

Intro to PHP & Variables

LING/C SC/PSYC 438/538 Lecture 21 Sandiway Fong.

Parsing Techniques.

LING/C SC/PSYC 438/538 Lecture 7 Sandiway Fong.

LING 388: Computers and Language

LING 388: Computers and Language

LING/C SC/PSYC 438/538 Lecture 4 Sandiway Fong.

More Selections BIS1523 – Lecture 9.

LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.

How to file a provisional patent application for your new invention

Chapter 11 Introduction to Programming in C

Conditions and Ifs BIS1523 – Lecture 8.

Number and String Operations

LING 581: Advanced Computational Linguistics

LING/C SC/PSYC 438/538 Lecture 6 Sandiway Fong.

LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong.

Subject Name:Sysytem Software Subject Code: 10SCS52

Advanced Algorithms Analysis and Design

Theory of Computation Languages.

LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong.

LING/C SC/PSYC 438/538 Lecture 23 Sandiway Fong.

LING/C SC/PSYC 438/538 Lecture 21 Sandiway Fong.

LING/C SC/PSYC 438/538 Lecture 22 Sandiway Fong.

3.1 Iteration Loops For … To … Next 18/01/2019.

LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong.

LING/C SC/PSYC 438/538 Lecture 15 Sandiway Fong.

LING/C SC/PSYC 438/538 Lecture 25 Sandiway Fong.

LING/C SC/PSYC 438/538 Lecture 18 Sandiway Fong.

LING/C SC/PSYC 438/538 Lecture 13 Sandiway Fong.

LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong.

LING/C SC/PSYC 438/538 Lecture 17 Sandiway Fong.

Boolean Expressions to Make Comparisons

6.001 SICP Interpretation Parts of an interpreter

Applications of Regular Closure

Huffman Coding Greedy Algorithm

LING 388: Computers and Language

LING/C SC/PSYC 438/538 Lecture 7 Sandiway Fong.

LING/C SC/PSYC 438/538 Lecture 4 Sandiway Fong.

LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.

LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong.

LING/C SC 581: Advanced Computational Linguistics

LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong.

Presentation transcript:

LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong

Last time

Today's Topics Homework 8: a Perl regex homework Examples of advanced uses of Perl regexs: Recursive regexs Prime number testing

Homework 8: Part 1 It contains nearly 50,000 sentences (one per line). wsj.txt is a tokenized text file containg the Wall Street Journal (WSJ) corpus It contains nearly 50,000 sentences (one per line). (The syntactically annotated version is core data used to train and test many statistical parsers, e.g. the Stanford and Berkeley parsers.) Note: tokenized here means punctuation is spaced; also 's and n't are separated by spaces.

Homework 8: Part 1 English past participle forms are used in passives and perfectives: e.g. the apple was/is eaten the apple(s) will be eaten the apples were eaten the apples were being eaten the apples had been eaten the women have/had eaten the apples Mary has/had eaten the apples There are also negated versions of the passives and perfectives: e.g. the apple was n't eaten the apple was not eaten the apple was n't yet eaten the apple was not yet eaten Mary hasn't eaten the apples Mary has not eaten the apples Mary has not yet eaten the apples

Homework 8: Part 1 Based on the data shown in the previous slide, assuming past participle ending –en, write a Perl regex program that searches the WSJ corpus, computes and prints the frequency of regular passives, perfectives, and the negated counterparts, i.e. Hint: for readability you may want to incorporate regex variables (see qr/../ from previous lecture) Hint: be careful of regex precedence: (a|b)\s+c is not the same as a|b\s+c Note: your program will underreport the true frequencies for several reasons.

Homework 8: Part 2 One reason for underreporting: Question: not all past participles conveniently end in –en e.g. the cookies were burnt, the rope had been cut e.g. the demonstrators were arrested Question: what are some other possible reasons for underreporting? Give some examples from the WSJ corpus.

Homework 8: Part 2 File: irregular_verbs.txt http://www.gingersoftware.com/content/ grammar-rules/verbs/list-of-irregular- verbs/ File: irregular_verbs.txt Note: \t (tab) separates the three columns borne simplified

Homework 8: Part 2 Write a Perl program to extract the irregular past participles from file irregular_verbs.txt and print them out one per line Make sure to split alternate forms into two lines: e.g. burnt/burned Ignore: (been able) and … (ellipsis) Extra Credit: give a Perl one liner that does the job… Hint: you may want to make use of join("\n", split( "/", … ))

Homework 8: Part 3 Incorporate the past participles found in Part 2 into your program from Part 1. Hints: one method: do it in stages; e.g. save those irregular past participles into a file. Read them into your program for Part 3. another method: combine your programs so you parse irregular_verbs.txt directly another method: copy and paste them into your program for Part 3 directly Add to your program printout of the frequency of the irregular verb counterparts: i.e. regular passives: # irregular passives: # regular perfectives: # irregular perfectives: # negated regular passives: # etc.

General instructions for submission (repeated) One pdf file containing everything Code and output must both be submitted Summarize/explain what you did If you like, you may add your programs separately as attachments to the email (so I can download and run them if necessary). Submission due date: next Wednesday midnight (before Thursday class)

Regex Recursion Word pallindrome = a word that reads the same backwards or forwards, e.g. kayak and racecar. Normally regexs cannot express pallindromes but Perl regexs can because we can use backreferences recursively. Note: recursion here refers to the ability to repeatedly embed regexs inside

Regex Recursion Program: (?group-ref)

Regex Lookahead and Lookback Zero-width regexs: ^ (start of string) $ (end of string) \b (word boundary) matches the imaginary position between \w\W or \W\w, or just before beginning of string if ^\w, just after the end of the string if \w$ Current position of match (so far) doesn't change! (?=regex) (lookahead from current position) (?<=regex) (lookback from current position) (?!regex) (negative lookahead) (?<!regex) (negative lookback)

Regex Lookahead and Lookback Example: looks for a word beginning with _ such that there is a duplicate ahead without the _ Restriction: lookback cannot be variable length in Perl

Debugging Perl regex (?{ Perl code }) can be inserted anywhere in a regex can assist with debugging Example:

Prime Number Testing using Perl Regular Expressions Another example: the set of prime numbers is not a regular language Lprime = {2, 3, 5, 7, 11, 13, 17, 19, 23,.. } Turns out, we can use a Perl regex to determine membership in this set .. and to factorize numbers /^(11+?)\1+$/

Prime Number Testing using Perl Regular Expressions can be proved using the Pumping Lemma for regular languages (later) L = {1n | n is prime} is not a regular language Keys to making this work: \1 backreference unary notation for representing numbers, e.g. 11111 “five ones” = 5 111111 “six ones” = 6 unary notation allows us to factorize numbers by repetitive pattern matching (11)(11)(11) “six ones” = 6 (111)(111) “six ones” = 6 numbers that can be factorized in this way aren’t prime no way to get nontrivial subcopies of 11111 “five ones” = 5 Then /^(11+?)\1+$/ will match anything that’s greater than 1 that’s not prime

Prime Number Testing using Perl Regular Expressions Let’s analyze this Perl regex /^(11+?)\1+$/ ^ and $ anchor both ends of the strings, forces (11+?)\1+ to cover the string exactly (11+?) is non-greedy match version of (11+) \1+ provides one or more copies of what we matched in (11+?) Question: is the non-greedy operator necessary?

Prime Number Testing using Perl Regular Expressions Compare /^(11+?)\1+$/ with /^(11+)\1+$/ i.e. non-greedy vs. greedy matching finds smallest factor vs. largest 90021 factored using 3, not a prime (0 secs) vs. 90021 factored using 30007, not a prime (0 secs) affects computational efficiency for non-primes Puzzling behavior: same output non-greedy vs. greedy 900021 factored using 300007, not a prime (48 secs vs. 13 secs)

Prime Number Testing using Perl Regular Expressions Prime Numbers 100003 200003 300007 400009 500009 600011 700001 800011 900001 1000003 1100009 1200007 1300021 1400017 1500007 testing with prime numbers only can take a lot of time to compute …

Prime Number Testing using Perl Regular Expressions /^(11+?)\1+$/ vs. /^(11+)\1+$/ i.e. non-greedy vs. greedy matching finds smallest factor vs. largest 90021 factored using 3, not a prime (0 secs) vs. 90021 factored using 30007, not a prime (0 secs) Puzzling behavior: same output non-greedy vs. greedy 900021 factored using 300007, not a prime (48 secs vs. 13 secs)

Prime Number Testing using Perl Regular Expressions http://www.xav.com/perl/lib/Pod/perlre.html nearest primes to preset limit 32749 32771 3*32749 32766 3*32771 = 98247 = 98313

Prime Number Testing using Perl Regular Expressions When preset limit is exceeded: Perl’s regex matching fails quietly

Prime Number Testing using Perl Regular Expressions Can also get non-greedy to skip several factors Example: pick non-prime 164055 = 3 x 5 x 10937 (prime factorization) Non-greedy: missed factors 3 and 5 … Because 3 * 54685 = 164055 5 * 32811 = 164055 32766 limit 15 * 10937 = 164055 greedy version

Prime Number Testing using Perl Regular Expressions Results are still right so far though: wrt. prime vs. non-prime But we predict it will report an incorrect result for 1,070,009,521 it should claim (incorrectly) that this is prime since 1070009521 = 327112