Download presentation
Presentation is loading. Please wait.
Published byElla Bradley Modified over 9 years ago
1
Methods in Computational Linguistics II with reference to Matt Huenerfauth’s Language Technology material Lecture 4: Matching Things. Regular Expressions
2
Today Regular Expressions Snippet on Speech Recognition –At least half of it. 1
3
Regular Expressions Can be viewed as a way to specify –Search patterns over a text string –Design a particular kind of machine, a Finite State Automaton (FSA) we probably won’t cover this today. –Define a formal “language” i.e. a set of strings 2
4
Uses of Regular Expressions Simple powerful tools for large corpus analysis and ‘shallow’ processing –What word is most likely to begin a sentence –What word is most likely to begin a question? –Are you more or less polite than the people you correspond with? 3
5
Definitions Regular Expression: Formula in algebraic notation for specifying a set of strings String: Any sequence of characters Regular Expression Search –Pattern: specifies the set of strings we want to search for –Corpus: the texts we want to search through 4
6
Simple Example 5
7
More Examples 6
8
And still more examples 7
9
Optionality and Repetition /[Ww]oodchucks?/ /colou?r/ /he{3}/ /(he){3}/ /(he){3},/ 8
10
Character Groups Some groups of characters are used very frequently, so the RE language includes shorthands for them 9
11
Special Characters These enable the matching of multiple occurrences of a pattern 10
12
Escape Characters Sometimes you want to use an asterisk “*” as an asterisk and not as a modifier. 11
13
RE Matching in Python NLTK Set up: –import re –from nltk.util import re_show –sent = “colourless green ideas sleep furiously re_show(pattern, str) –shows where the pattern matches 12
14
Substitutions Replace every l with an s re.sub(‘l’, ‘s’, sent) –‘cosoursess green ideas sseep furioussy’ re.sub(‘green’, ‘red’, sent) –‘colourless red ideas sleep furiously’ 13
15
Findall re.findall(pattern, sent) –will return all of the substrings that match the pattern –re.findall(‘(green|sleep)’, sent) [‘green’, ‘sleep’] 14
16
Match Matches from the beginning of the string match(pattern, string) –Returns: a Match object or None (if not found) Match objects contain information about the search 15
17
Methods in Match 16
18
More Match Methods 17
19
Search re.search(pattern, string) –Finds the pattern anywhere in the string. –re.search(‘\d+’, ‘ 1034 ’).group() ‘1034’ –re.search(‘\d+’, ‘ abc123 ‘).group() ‘123’ 18
20
Splitting ‘text can be made into lists’.split() re.split(pattern, split) –uses the pattern to identify the split point –re.split(‘\d+’, “I want 4 cats and 13 dogs”) [“I want ”, “ cats and ”, “ dogs”] –re.split(‘\s*\d+\s*’, “I want 4 cats and 13 dogs”) [“I want”, “cats and”, “dogs”] 19
21
Joining ‘ ‘.[‘lists’, ‘can’, ‘be’, ‘made’, ‘into’, ‘strings’] This simple formatting can be helpful to report results or merge information 20
22
Stemming with Regular Expressions def stem(word): regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$' stem, suffix = re.findall(regexp, word)[0] return stem 21
23
Play with some code 22
24
Snippet on Speech Recognition 23
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.