LING 388: Computers and Language

Slides:



Advertisements
Similar presentations
Regular expressions Day 2
Advertisements

Python: Regular Expressions
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 3: 8/28.
Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
LING 388: Language and Computers Sandiway Fong Lecture 3: 8/28.
Decisions in Python Comparing Strings – ASCII History.
Working with Files CSC 161: The Art of Programming Prof. Henry Kautz 11/9/2009.
LING 408/508: Programming for Linguists Lecture 19 November 4 th.
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
More on Regular Expressions Regular Expressions More character classes \s matches any whitespace character (space, tab, newline etc) \w matches.
Binary Search Trees continued Trees Draw the BST Insert the elements in this order 50, 70, 30, 37, 43, 81, 12, 72, 99 2.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
Methods in Computational Linguistics II with reference to Matt Huenerfauth’s Language Technology material Lecture 4: Matching Things. Regular Expressions.
Lists and More About Strings CS303E: Elements of Computers and Programming.
1 i206: Lecture 18: Regular Expressions Marti Hearst Spring 2012.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Python Regular Expressions Easy text processing. Regular Expression  A way of identifying certain String patterns  Formally, a RE is:  a letter or.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Examples of comparing strings. “ABC” = “ABC”? yes “ABC” = “ ABC”? No! note the space up front “ABC” = “abc” ? No! Totally different letters “ABC” = “ABCD”?
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
Regular Expressions Regular Expressions. Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have.
Regular Expression What is Regex? Meta characters Pattern matching Functions in re module Usage of regex object String substitution.
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
More about Strings. String Formatting  So far we have used comma separators to print messages  This is fine until our messages become quite complex:
CompSci 101 Introduction to Computer Science November 18, 2014 Prof. Rodger.
Homework #4: Operator Overloading and Strings By J. H. Wang Apr. 17, 2009.
REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
LING 408/508: Programming for Linguists Lecture 20 November 16 th.
Operators Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Regular expressions Day 11 LING Computational Linguistics Harry Howard Tulane University.
Python Pattern Matching and Regular Expressions Peter Wad Sackett.
CompSci 101 Introduction to Computer Science April 7, 2015 Prof. Rodger.
Lists 1 Day /17/14 LING 3820 & 6820 Natural Language Processing
Finding the needle(s) in the textual haystack
Regular Expressions Upsorn Praphamontripong CS 1110
Perl Regular Expression in SAS
CSC1018F: Regular Expressions
Introduction to Python
CMSC201 Computer Science I for Majors Lecture 22 – Binary (and More)
CSC 594 Topics in AI – Natural Language Processing
Finding the needle(s) in the textual haystack
LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong.
Corpus Linguistics I ENG 617
Finding the needle(s) in the textual haystack
CSC 594 Topics in AI – Natural Language Processing
LING/C SC/PSYC 438/538 Lecture 7 Sandiway Fong.
LING 388: Computers and Language
Topics in Linguistics ENG 331
LING 388: Computers and Language
LING 388: Computers and Language
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong.
The backslash is used to escape characters that are used in Python
LING 388: Computers and Language
i206: Lecture 19: Regular Expressions, cont.
LING 408/508: Computational Techniques for Linguists
CS 1111 Introduction to Programming Fall 2018
LING 408/508: Computational Techniques for Linguists
LING 408/508: Computational Techniques for Linguists
Regular Expressions
LING 388: Computers and Language
Data Types Every variable has a given data type. The most common data types are: String - Text made up of numbers, letters and characters. Integer - Whole.
Nate Brunelle Today: Regular Expressions
Nate Brunelle Today: Regular Expressions
REGEX.
LING 388: Computers and Language
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong.
Presentation transcript:

LING 388: Computers and Language Lecture 14

Administrivia Homework 7 out today – due Friday night by midnight

Unicode characters ok in Python 3.x Python regex recap Unicode characters ok in Python 3.x Summary: \w a character [A-Za-z0-9_] \d [0-9] \b word boundary \s space character [ \t\n\r\f\v] Operators: * zero or more repeats + one or more repeats ( ) grouping Raw string (avoid escaping \): r"\w+" Negation: \W anything not in \w \D anything not in \d Methods: m = re.search(pattern, string) return match object or None m = re.match(pattern, string) l = re.findall(pattern, string) return list of strings/tuples Full Documentation: https://docs.python.org/3/library/re.html

Python regex More examples from https://docs.python.org/3/howto/regex.html

The trouble with re.findall() Only capturing groups (…) are reported Example: >>> text = "ababcababababacabd" >>> import re >>> re.findall(r'(ab)+', text) ['ab', 'ab', 'ab'] >>> re.findall(r'((ab)+)', text) [('abab', 'ab'), ('abababab', 'ab'), ('ab', 'ab')]

The trouble with re.findall() Example (using list comprehension): >>> text = "ababcababababacabd" >>> [tuple[0] for tuple in  re.findall(r'((ab)+)', text)] ['abab', 'abababab', 'ab']

Review examples Regex for money: $ followed by digits comma (for thousands, optional) decimal point (optional)

Python regex Other useful meta-characters: ^ matches beginning of line $ matches end of line \n n = group number, must match identically to group

Python's re module

Python's re module

Python's re module

Homework 7 What went wrong on the High Street in 2018? https://www.bbc.com/news/business- 46646990?intlink_from_url=https://www.bbc.com/news/topics/cxqvep8kq ext/long-reads&link_location=live-reporting-story hw7.txt Using regexs in Python, find: Find the numbers in the article. List them. How many of them are there? Find all the named entities (approximately everything beginning with an uppercase letter denoting people, places, organizations etc.), e.g. Toys R Us or New Look. List them. How many of them are there? How could you filter out the words at the beginning of each sentence that aren't really named entities? Show your code. How many named entities now?

Homework 7 One PDF file Show your Python work Submission by email to me by Friday night