LING 388: Computers and Language Lecture 14
Administrivia Homework 7 out today – due Friday night by midnight
Unicode characters ok in Python 3.x Python regex recap Unicode characters ok in Python 3.x Summary: \w a character [A-Za-z0-9_] \d [0-9] \b word boundary \s space character [ \t\n\r\f\v] Operators: * zero or more repeats + one or more repeats ( ) grouping Raw string (avoid escaping \): r"\w+" Negation: \W anything not in \w \D anything not in \d Methods: m = re.search(pattern, string) return match object or None m = re.match(pattern, string) l = re.findall(pattern, string) return list of strings/tuples Full Documentation: https://docs.python.org/3/library/re.html
Python regex More examples from https://docs.python.org/3/howto/regex.html
The trouble with re.findall() Only capturing groups (…) are reported Example: >>> text = "ababcababababacabd" >>> import re >>> re.findall(r'(ab)+', text) ['ab', 'ab', 'ab'] >>> re.findall(r'((ab)+)', text) [('abab', 'ab'), ('abababab', 'ab'), ('ab', 'ab')]
The trouble with re.findall() Example (using list comprehension): >>> text = "ababcababababacabd" >>> [tuple[0] for tuple in re.findall(r'((ab)+)', text)] ['abab', 'abababab', 'ab']
Review examples Regex for money: $ followed by digits comma (for thousands, optional) decimal point (optional)
Python regex Other useful meta-characters: ^ matches beginning of line $ matches end of line \n n = group number, must match identically to group
Python's re module
Python's re module
Python's re module
Homework 7 What went wrong on the High Street in 2018? https://www.bbc.com/news/business- 46646990?intlink_from_url=https://www.bbc.com/news/topics/cxqvep8kq ext/long-reads&link_location=live-reporting-story hw7.txt Using regexs in Python, find: Find the numbers in the article. List them. How many of them are there? Find all the named entities (approximately everything beginning with an uppercase letter denoting people, places, organizations etc.), e.g. Toys R Us or New Look. List them. How many of them are there? How could you filter out the words at the beginning of each sentence that aren't really named entities? Show your code. How many named entities now?
Homework 7 One PDF file Show your Python work Submission by email to me by Friday night