1 Regular Expressions: grep LING 5200 Computational Corpus Linguistics Martha Palmer
LING 5200, 2006 BASED on Kevin Cohen’s LING Homework 2 Bytes Read path names ~ not necessary in home directory Display results of commands if they’re just a few lines.
LING 5200, 2006 BASED on Kevin Cohen’s LING Switches -c list a count of matching lines only (like adding | wc) -i ignore the case of the letters in the pattern -n include the line numbers -v show lines that do NOT match the pattern grep -i lemma README.english grep -ic lemma README.english grep -in lemma README.english
LING 5200, 2006 BASED on Kevin Cohen’s LING The Chomsky Grammar Hierarchy Regular grammars, aabbbb S → aS | nil | bS Context free grammars, aaabbb S → aSb | nil Context sensitive grammars, aaabbbccc xSy → xby Transformational grammars - Turing Machines
LING 5200, 2006 BASED on Kevin Cohen’s LING Movement What did John give to Mary? *Where did John give to Mary? John gave cookies to Mary. John gave to Mary.
LING 5200, 2006 BASED on Kevin Cohen’s LING Nested Dependencies and Crossing Dependencies John, Mary and Bill ate peaches, pears and apples, respectively The dog chased the cat that bit the mouse that ran. The mouse the cat the dog chased bit ran. CF CS
LING 5200, 2006 BASED on Kevin Cohen’s LING Most parsers are Turing Machines To give a more natural and comprehensible treatment of movement For a more efficient treatment of features Not because of respectively – most parsers can’t handle it.
LING 5200, 2006 BASED on Kevin Cohen’s LING b*c matches the first character in the string cabbbcde, b*cd matches the third to seventh characters in the string cabbbcdebbbbbbcdbc.
LING 5200, 2006 BASED on Kevin Cohen’s LING Character classes: ranges All upper-case, all lower-case, all letters, any digit from zero to 9… [A-Z] [a-z] [A-Za-z] [0-9] Practice!
LING 5200, 2006 BASED on Kevin Cohen’s LING Character classes: complements Any character that's not a vowel [^aeiouAEIOU] In this context, means "not"
LING 5200, 2006 BASED on Kevin Cohen’s LING Anchors Any line that begins with… Any line that ends with… ^T line that begins with T VBZ$ line that ends with VBZ
LING 5200, 2006 BASED on Kevin Cohen’s LING Quantifiers One or more… Zero or more… One or zero… a+ one or more “a's” a* zero or more “a's” a? one “a”, or nothing And more…
LING 5200, 2006 BASED on Kevin Cohen’s LING grep/egrep X+ instead of xx* (xxx|yyy) ? Matches a single character
LING 5200, 2006 BASED on Kevin Cohen’s LING Searching the treebank cat ??/* | egrep -i '(push|pull)[a-z]*’
LING 5200, 2006 BASED on Kevin Cohen’s LING grep/egrep grep '^[^a-z]*epl' README.english grep ‘ epl' README.english egrep '^[^a-z]*(epl|epw)' README.english egrep ‘ (epl|epw)' README.english Nice when you have tokenized strings…
LING 5200, 2006 BASED on Kevin Cohen’s LING More grepping But when you don’t…. /corpora/celex/english/epw/epw.cd Find all capitalized words grep ^'[0-9][0-9]*.[A-Z]' epw.cd | wc -l
LING 5200, 2006 BASED on Kevin Cohen’s LING Exercises – pick a directory How many 5 letter words? head -10 wsj_0564 | grep -i ' [a-z][a-z][a-z][a-z][a-z] ' | wc grep -i ' [a-z][a-z][a-z][a-z][a-z] ' * | wc
LING 5200, 2006 BASED on Kevin Cohen’s LING Lab (cont.) Are there any words with no vowels? grep -i ' [^aeiou][^aeiou]* ' wsj_0564 | wc grep -i ' [^aeiouy][^aeiouy.]* ' wsj_0564 | wc grep -i ' [^aeiouy"][^aeiouy."]* ' wsj_ %?
LING 5200, 2006 BASED on Kevin Cohen’s LING Lab (cont.) Find “1-syllable” words. (words with exactly one vowel) grep -i ' [^aeiouy]*[aeiouy][^aeiouy]* ‘ Find “2- syllable” words. (words with exactly two vowels) Delete words ending with a silent “e” from the “2-syllable” list
LING 5200, 2006 BASED on Kevin Cohen’s LING Emacs emacs –nw Control x, control c – exit Control x, control s – save Control x, control v – visit Appropos