LING 388: Computers and Language Lecture 14
Today's Topics Homework 6 review More practice with regexs in class Extra Credit Homework 7 (due Friday midnight)
File: hw6.txt (Full text of the BBC News article used in Homework 5 as processed by the U. of Illinois Extended NER demo)
Homework 6 Review Question 1: Modify the template code to parse hw6.txt to create Counters for categories ORG, PERSON and GPE. Have your code print the Counters.
Homework 6 Review
Homework 6 Review Crucial part of the code… Code is available on the class webpage
Homework 6 Review Key part to grok: how to match what you want from a sample: [^]] means match any character as long as it's not ]
Homework 6 Review
Homework 6 Review Question 2: Using .sub(), modify the template code to parse hw6.txt and create a modified file hw6new.txt where abbreviations like OAR are labeled as GPEs. Now re-run your code from Question 1 on the modified hw6new.txt file.
Homework 6 Review Question 3: Sample: Along the lines of Question 2, further modify your program to also mark all pronouns and possessive pronouns as PERSON. Show your latest hw6new.txt file. Re-run your code from Question 1. Show your new output. Sample:
Homework 6 Review We will return to discuss the general case of abbreviations
Homework 6 Review Abbreviations: Problem: Advanced Solution: OAR [A-Z][A-Z]+ or [A-Z]{2,} (at least 2 times) {n,m} means range from n to m Problem: Our input file already has things that look like abbreviations E.g. [PERSON [ORG Advanced Solution: Match [A-Z]{2,} with the condition that it's not preceded by [ This is what's known as (negative) lookbehind
Homework 6 Review Look for something that isn't preceded by [ (?<!\[)([A-Z]{2,}) Look for something that isn't preceded by [ (Advanced usage: negative lookbehind) Then match a sequence of two or more capital letters
More Python regex practice Download wordlist.py (Brown Corpus words) to your computer Then run the following:
More Python regex practice Exercise 1: produce a list of all the words in wordlist that having two a's in a row aa = [word for word in wordlist if re.search('aa',word)] len(aa) Exercise 2: are there more words with two b's in a row? Exercise 3: words with two p's or b's or d's in a row – which is the most frequent?
More Python regex practice Exercise 4: find a word with both bb and dd in it Exercise 5: are there any words with pp and dd? Exercise 6: find words ending in zac. How many are there? Recall: meta-character for the end of line anchor Exercise 7: find words beginning in anti. How many are there? Hint: some cases may begin with a capital letter
Extra Credit: Python regex (Optional Homework) Familiarize yourself with British and American spelling https://en.oxforddictionaries.com/spelling/british-and-spelling Note: The Brown Corpus is an American English corpus Question 1: Find four examples of words spelled both ways but present in the corpus Question 2: Consider the –or vs. –our ending, as in color and colour Write Python code that searches wordlist and finds all examples present with both endings Show your code and the result of your code running