Download presentation
Presentation is loading. Please wait.
Published byLorraine Walton Modified over 9 years ago
1
1 Regular Expressions: grep LING 5200 Computational Corpus Linguistics Martha Palmer
2
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 2 Google Whole-word wildcard How should multi-word metaphors, expressions, etc., be represented in the lexicon? the tip of the iceberg
3
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 3 Google Whole-word wildcard How should multi-word metaphors, expressions, etc., be represented in the lexicon? If compositionally: how to explain meaning? If fixed string: what about productivity?
4
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 4 Google
5
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 5 Google the tip of the nuclear iceberg the tip of the MS iceberg the tip of the toxic iceberg the tip of the Satanic iceberg the tip of the jihadi iceberg the tip of the homosexual iceberg the tip of the WorldCom iceberg the tip of the melting iceberg
6
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 6 Task: find some text text in a file: find all instances of “the” text in the output of a command: pick out my directory in /home
7
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 7 grep grep pattern file Look... …for this... …in this file
8
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 8 Switches -c list a count of matching lines only (like adding | wc) -i ignore the case of the letters in the pattern -n include the line numbers -v show lines that do NOT match the pattern grep -i lemma README.english grep -ic lemma README.english grep -in lemma README.english
9
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 9 What if it's not always written the same way? Find all instances of the definite article The/DT the/DT The/DT
10
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 10 What if it's not always written the same way? grep 'The' corpora/treebank2/raw/wsj/05 corpora/treebank2/tagged/wsj/05 * The/DT the/DT The/DT
11
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 11 Regular expressions A way of specifying members of a set E.g.: All strings that contain 3 capital H's All strings that contain “Lemma" All strings that contain an upper-case vowel All strings that begin with a b, d, or g All strings that end with VBZ
12
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 12 Regular expressions Always defined with reference to some alphabet ( ) Generally, ASCII characters, A-Z Could be smaller ( = {a, b, !}) or larger (Unicode)
13
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 13 The Chomsky Grammar Hierarchy Regular grammars, aabbbb S → aS | nil | bS Context free grammars, aaabbb S → aSb | nil Context sensitive grammars, aaabbbccc xSy → xby Transformational grammars - Turing Machines
14
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 14 Movement What did John give to Mary? *Where did John give to Mary? John gave cookies to Mary. John gave to Mary.
15
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 15 Nested Dependencies and Crossing Dependencies John, Mary and Bill ate peaches, pears and apples, respectively The dog chased the cat that bit the mouse that ran. The mouse the cat the dog chased bit ran. CF CS
16
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 16 Most parsers are Turing Machines To give a more natural and comprehensible treatment of movement For a more efficient treatment of features Not because of respectively – most parsers can’t handle it.
17
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 17 Finite state automata – recognizing regular grammar strings
18
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 18 Finite state automata This arrow means "start here"
19
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 19 Finite state automata This arrow means "start here" Double circle means "OK"
20
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 20 Finite state automata
21
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 21 Finite state automata
22
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 22 What's regular? Informally
23
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 23 What's regular? Empty string Any single character Sequence of characters Union/disjunction of characters Zero or more ("Kleene closure") of characters
24
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 24 What's regular? Empty string
25
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 25 What's regular? Any single character a
26
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 26 What's regular? Sequence of characters ba
27
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 27 What's regular? Union/disjunction [Tt]he
28
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 28 What's regular? Zero or more ("Kleene closure") ba*
29
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 29 What's regular? One or more ("Kleene plus") aa* 0 a a
30
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 30 b*c matches the first character in the string cabbbcde, b*cd matches the third to seventh characters in the string cabbbcdebbbbbbcdbc.
31
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 31 Character classes [AEIOU] Any one of A, E, I, O, or U h[aeiou]d h, followed by any one of a, e, i, o, or u, followed by d
32
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 32 Character classes: ranges All upper-case, all lower-case, all letters, any digit from zero to 9…
33
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 33 Character classes: ranges All upper-case, all lower-case, all letters, any digit from zero to 9… [A-Z] [a-z] [A-Za-z] [0-9] Practice!
34
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 34 Character classes: complements Any character that's not a vowel
35
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 35 Character classes: complements Any character that's not a vowel [^aeiouAEIOU]
36
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 36 Character classes: complements Any character that's not a vowel [^aeiouAEIOU] In this context, means "not"
37
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 37 Anchors Any line that begins with… Any line that ends with…
38
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 38 Anchors Any line that begins with… Any line that ends with… ^T line that begins with T VBZ$ line that ends with VBZ
39
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 39 Quantifiers One or more… Zero or more… One or zero…
40
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 40 Quantifiers One or more… Zero or more… One or zero… a+ one or more “a's” a* zero or more “a's” a? one “a”, or nothing And more…
41
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 41 grep/egrep X+ instead of xx* (xxx|yyy)
42
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 42 What's regular? One or more ("Kleene plus") aa* = a+ 0 a a
43
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 43 grep/egrep grep '^[^a-z]*epl' README.english egrep '^[^a-z]*(epl|epw)' README.english
44
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 44 Searching the treebank cat ??/* | egrep -i '(push|pull)[a-z]*’ cat ??/* | egrep -i '(push|pull)[a-z]
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.