LING 388: Computers and Language Lecture 15
Administrivia Reminder Homework 7 due Friday (or Saturday) night by midnight Printout of last lecture's terminal available as lecture14.txt
Python regex Methods: Key: RE = regex raw string, String = where to search import re re.match(RE, String) matching must start from start of String re.search(RE, String) searches anywhere in String re.findall(RE, String) re.finditer(RE, String) use with loop for m in re.finditer() re.sub(RE, SUB, String) SUB = regex raw string to substitute for RE
Substitution examples Using re.sub(RE, SUB, String) Example: import re text = "Google is a tech giant. Google is the most valuable company in the world." re.sub(r"Google","Microsoft",text) 'Microsoft is a tech giant. Microsoft is the most valuable company in the world.' text 'Google is a tech giant. Google is the most valuable company in the world.' re.sub(r"Google","Microsoft",text,1) 'Microsoft is a tech giant. Google is the most valuable company in the world.'
Substitution examples Using re.sub(RE, SUB, String) Substitution using .sub() with backreferences and grouping: Suppose we want to change section{one} into subsection{one} [^}] means any character but } (..) capturing group
Running Python on the command line in Windows
More Python regex practice Download wordlist.py (Brown Corpus words) to your computer Put it on the same directory as your Python Then run the following:
Python regex practice Exercise 1: Exercise 2: Exercise 3: produce a list of all the words in wordlist that having two a's in a row aa = [word for word in wordlist if re.search('aa',word)] len(aa) Exercise 2: are there more words with two b's in a row? Exercise 3: words with two p's or b's or d's in a row – which is the most frequent?
Python regex practice Exercise 4: Exercise 5: Exercise 6: Exercise 7: find a word with both bb and dd in it Exercise 5: are there any words with pp and dd? Exercise 6: find words ending in zac. How many are there? Recall: meta-character for the end of line anchor is $ Exercise 7: find words beginning in anti. How many are there? Hint: some cases may begin with a capital letter
Python regex practice Look for words with prefix "pre" Are all of them correct? (cf. pretend) Devise a search that looks for words beginning with 'pre' but also contains the rest of the word as a word in the Brown corpus