Download presentation
Presentation is loading. Please wait.
1
LING 388: Computers and Language
Lecture 14
2
Today's Topics Homework 6 review More practice with regexs in class
Extra Credit Homework 7 (due Friday midnight)
3
File: hw6.txt (Full text of the BBC News article used in Homework 5 as processed by the U. of Illinois Extended NER demo)
4
Homework 6 Review Question 1:
Modify the template code to parse hw6.txt to create Counters for categories ORG, PERSON and GPE. Have your code print the Counters.
5
Homework 6 Review
6
Homework 6 Review Crucial part of the code… Code is available on the
class webpage
7
Homework 6 Review Key part to grok: how to match what you want from a sample: [^]] means match any character as long as it's not ]
8
Homework 6 Review
9
Homework 6 Review Question 2:
Using .sub(), modify the template code to parse hw6.txt and create a modified file hw6new.txt where abbreviations like OAR are labeled as GPEs. Now re-run your code from Question 1 on the modified hw6new.txt file.
10
Homework 6 Review Question 3: Sample:
Along the lines of Question 2, further modify your program to also mark all pronouns and possessive pronouns as PERSON. Show your latest hw6new.txt file. Re-run your code from Question 1. Show your new output. Sample:
11
Homework 6 Review We will return to discuss the general case
of abbreviations
12
Homework 6 Review Abbreviations: Problem: Advanced Solution: OAR
[A-Z][A-Z]+ or [A-Z]{2,} (at least 2 times) {n,m} means range from n to m Problem: Our input file already has things that look like abbreviations E.g. [PERSON [ORG Advanced Solution: Match [A-Z]{2,} with the condition that it's not preceded by [ This is what's known as (negative) lookbehind
13
Homework 6 Review Look for something that isn't preceded by [
(?<!\[)([A-Z]{2,}) Look for something that isn't preceded by [ (Advanced usage: negative lookbehind) Then match a sequence of two or more capital letters
14
More Python regex practice
Download wordlist.py (Brown Corpus words) to your computer Then run the following:
15
More Python regex practice
Exercise 1: produce a list of all the words in wordlist that having two a's in a row aa = [word for word in wordlist if re.search('aa',word)] len(aa) Exercise 2: are there more words with two b's in a row? Exercise 3: words with two p's or b's or d's in a row – which is the most frequent?
16
More Python regex practice
Exercise 4: find a word with both bb and dd in it Exercise 5: are there any words with pp and dd? Exercise 6: find words ending in zac. How many are there? Recall: meta-character for the end of line anchor Exercise 7: find words beginning in anti. How many are there? Hint: some cases may begin with a capital letter
17
Extra Credit: Python regex
(Optional Homework) Familiarize yourself with British and American spelling Note: The Brown Corpus is an American English corpus Question 1: Find four examples of words spelled both ways but present in the corpus Question 2: Consider the –or vs. –our ending, as in color and colour Write Python code that searches wordlist and finds all examples present with both endings Show your code and the result of your code running
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.