1 Regular Expressions and Xkwic LING 5200 Computational Corpus Linguistics Martha Palmer February 28, 2006
LING 5200, 2006 BASED on Kevin Cohen’s LING grep/egrep X+ instead of xx* (xxx|yyy) xxx OR yyy ? Matches a single character of the preceding character set, or nothing
LING 5200, 2006 BASED on Kevin Cohen’s LING More grepping/egrepping /corpora/celex/english/epw/epw.cd Find all capitalized words grep ^'[0-9][0-9]*.[A-Z]' epw.cd | wc –l OR egrep ^'[0-9]+.[A-Z]‘ epw.cd | wc –l
LING 5200, 2006 BASED on Kevin Cohen’s LING Homework 3 Please give me command AND results! 1. In the file /corpora/celex/english/epw/epw.cd, find all words that contain only upper- case letters, e.g. USSR and VTOL. ANS:158 grep '^[0-9][0-9]*\\[A-Z][A-Z]*\\' epw.cd | wc –l egrep '^[0-9]+\\[A-Z]+\\' epw.cd | wc –l egrep ^'[0-9]+[\][A-Z]+\\' epw.cd | wc -l egrep ^'[0-9]+.[A-Z]+\\' epw.cd | wc –l
LING 5200, 2006 BASED on Kevin Cohen’s LING Homework 3 2. How many entries have a syllable that ends with a 4-consonant cluster? ANS: 45 egrep 'CCCC]' epw.cd (why not \] )? 56 grep 'CCCC]' epw.cd 56 grep 'CCCC]' epw.cd | grep –v ‘ed[ \\]’ 36 egrep 'CCCC]\\' epw.cd 45
LING 5200, 2006 BASED on Kevin Cohen’s LING Homework 3 3. Find all multi-word terms in which only the first letter is capitalized, e.g. Colorado potato beetle. ANS: 238/243 egrep ^'[0-9]+.[A-Z][a-z]+( [a-z]+)+\\' epw.cd | wc –l egrep ^'[0-9]+\\[A-Z][a-z]*( [a-z]+)+\\' epw.cd | wc -l \X \X \Y \Y
LING 5200, 2006 BASED on Kevin Cohen’s LING Homework 3 4. Find all multi-word terms in which the first letter (and only the first letter) of each word is capitalized, e.g. Union Jacks and Royal Automobile Club. Note: your regex should be able to accommodate an arbitrary number of words. ANS: 296/298 egrep ^'[0-9]+.[A-Z][a-z]+( [A-Z][a-z]*)+\\' epw.cd egrep ^'[0-9]+.[A-Z][a-z]*( [A-Z][a-z]*)+\\' epw.cd
LING 5200, 2006 BASED on Kevin Cohen’s LING Homework 3 5. Find all disyllabic words that contain only vowels. ANS: 4 egrep '\\\[V+\]\[V+\]\\' epw.cd 5\AA\52\5\1\P\"1-'1\[VV][VV]\[eI][eI] 6\AA\95\6\1\P\"1-'1\[VV][VV]\[eI][eI] \i.e.\424\22210\1\P\"2-'i\[VV][VV]\[aI][i:]
LING 5200, 2006 BASED on Kevin Cohen’s LING Homework 3 6. Multiword expressions (Find a similar phrase in the wsj/raw corpus, and search for all variants of it in the entire corpus. ) egrep –i ‘.tip of the *[a-z] iceberg’ egrep ‘[Tt]he tip of (a|the).* iceberg’ patriarchical /a more alarming
LING 5200, 2006 BASED on Kevin Cohen’s LING Homework 3 6. Other multiword expressions war on (inflation/drugs/the dictator) fight the war on the expenditure side rather rule of (the day/journalism/Ferdinand Marcos) cream of the (British) crop
LING 5200, 2006 BASED on Kevin Cohen’s LING Searching the treebank cat ??/* | egrep -i '(push|pull)[a-z]*’ OR xkwic?
LING 5200, 2006 BASED on Kevin Cohen’s LING XWin 32 See Load on laptops, bring laptops to class if any issues Go to Feb 9 Emacs & Xkwic lecture