LING 388: Computers and Language

Slides:



Advertisements
Similar presentations
Regular Expressions Pattern and Match objects Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
LING/C SC/PSYC 438/538 Lecture 5 9/8 Sandiway Fong.
LING 408/508: Programming for Linguists Lecture 19 November 4 th.
Binary Search Trees continued Trees Draw the BST Insert the elements in this order 50, 70, 30, 37, 43, 81, 12, 72, 99 2.
Regular Expressions A regular expression defines a pattern of characters to be found in a string Regular expressions are made up of – Literal characters.
Regular Expressions Dr. Ralph D. Westfall May, 2011.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
CIS 451: Regular Expressions Dr. Ralph D. Westfall January, 2009.
Regular Expressions CISC/QCSE 810. Recognizing Matching Strings ls *.exe translates to "any set of characters, followed by the exact string ".exe" The.
9/28/2015BCHB Edwards Basic Python Review BCHB Lecture 8.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Regular Expressions Pattern and String Matching in Text.
1 Introduction to Python LING 5200 Computational Corpus Linguistics Martha Palmer.
Prof. Alfred J Bird, Ph.D., NBCT Office – McCormick 3rd floor 607.
Prof. Alfred J Bird, Ph.D., NBCT Door Code for IT441 Students.
LING 408/508: Programming for Linguists Lecture 20 November 16 th.
LING/C SC/PSYC 438/538 Lecture 6 Sandiway Fong. Homework 4 Submit one PDF file Your submission should include code and sample runs Due date Monday 21.
Lecture 4 CS140 Dick Steflik. Reading Keyboard Input Import java.util.Scanner – A simple text scanner which can parse primitive types and strings using.
C++ Memory Management – Homework Exercises
RE Tutorial.
Lesson 06: Functions Class Participation: Class Chat:
EET 2259 Unit 13 Strings and File I/O
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Perl Regular Expression in SAS
Looking for Patterns - Finding them with Regular Expressions
Lesson 1 - Sequencing.
Finite State Machines Dr K R Bond 2009
(optional - but then again, all of these are optional)
Lesson 1 An Introduction
Big Data Analytics: HW#3
LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong.
Corpus Linguistics I ENG 617
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong.
LING 388: Computers and Language
LING 581: Advanced Computational Linguistics
Topics in Linguistics ENG 331
LING 388: Computers and Language
Learning to Program in Python
LING 581: Advanced Computational Linguistics
LING 388: Computers and Language
Basic Python Review BCHB524 Lecture 8 BCHB524 - Edwards.
LING 388: Computers and Language
LING 388: Computers and Language
Searching EIT, Author Gay Robertson, 2017.
LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong.
Coding Concepts (Basics)
LING/C SC/PSYC 438/538 Lecture 21 Sandiway Fong.
Text Analyzer BIS1523 – Lecture 14.
Introduction to Programming
LING 408/508: Computational Techniques for Linguists
Plan Attendance Files Posted on Campus Cruiser Homework Reminder
LING 388: Computers and Language
LING/C SC/PSYC 438/538 Lecture 13 Sandiway Fong.
LING 388: Computers and Language
Functions continued.
Basic Python Review BCHB524 Lecture 8 BCHB524 - Edwards.
EET 2259 Unit 13 Strings and File I/O
Lab 8: Regular Expressions
Introduction to Programming
Nate Brunelle Today: Regular Expressions
Nate Brunelle Today: Regular Expressions
Nate Brunelle Today: Regular Expressions
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
Primary School Computing
LING 388: Computers and Language
LING 388: Computers and Language
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong.
LING/C SC 581: Advanced Computational Linguistics
Presentation transcript:

LING 388: Computers and Language Lecture 14

Today's Topics Homework 6 review More practice with regexs in class Extra Credit Homework 7 (due Friday midnight)

File: hw6.txt (Full text of the BBC News article used in Homework 5 as processed by the U. of Illinois Extended NER demo)

Homework 6 Review Question 1: Modify the template code to parse hw6.txt to create Counters for categories ORG, PERSON and GPE. Have your code print the Counters.

Homework 6 Review

Homework 6 Review Crucial part of the code… Code is available on the class webpage

Homework 6 Review Key part to grok: how to match what you want from a sample: [^]] means match any character as long as it's not ]

Homework 6 Review

Homework 6 Review Question 2: Using .sub(), modify the template code to parse hw6.txt and create a modified file hw6new.txt where abbreviations like OAR are labeled as GPEs. Now re-run your code from Question 1 on the modified hw6new.txt file.

Homework 6 Review Question 3: Sample: Along the lines of Question 2, further modify your program to also mark all pronouns and possessive pronouns as PERSON. Show your latest hw6new.txt file. Re-run your code from Question 1. Show your new output. Sample:

Homework 6 Review We will return to discuss the general case of abbreviations

Homework 6 Review Abbreviations: Problem: Advanced Solution: OAR [A-Z][A-Z]+ or [A-Z]{2,} (at least 2 times) {n,m} means range from n to m Problem: Our input file already has things that look like abbreviations E.g. [PERSON [ORG Advanced Solution: Match [A-Z]{2,} with the condition that it's not preceded by [ This is what's known as (negative) lookbehind

Homework 6 Review Look for something that isn't preceded by [ (?<!\[)([A-Z]{2,}) Look for something that isn't preceded by [ (Advanced usage: negative lookbehind) Then match a sequence of two or more capital letters

More Python regex practice Download wordlist.py (Brown Corpus words) to your computer Then run the following:

More Python regex practice Exercise 1: produce a list of all the words in wordlist that having two a's in a row aa = [word for word in wordlist if re.search('aa',word)] len(aa) Exercise 2: are there more words with two b's in a row? Exercise 3: words with two p's or b's or d's in a row – which is the most frequent?

More Python regex practice Exercise 4: find a word with both bb and dd in it Exercise 5: are there any words with pp and dd? Exercise 6: find words ending in zac. How many are there? Recall: meta-character for the end of line anchor Exercise 7: find words beginning in anti. How many are there? Hint: some cases may begin with a capital letter

Extra Credit: Python regex (Optional Homework) Familiarize yourself with British and American spelling https://en.oxforddictionaries.com/spelling/british-and-spelling Note: The Brown Corpus is an American English corpus Question 1: Find four examples of words spelled both ways but present in the corpus Question 2: Consider the –or vs. –our ending, as in color and colour Write Python code that searches wordlist and finds all examples present with both endings Show your code and the result of your code running