Download presentation
Presentation is loading. Please wait.
Published byOliver Dennis Modified over 8 years ago
1
1 Regular Expressions and Xkwic LING 5200 Computational Corpus Linguistics Martha Palmer February 28, 2006
2
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 2 grep/egrep X+ instead of xx* (xxx|yyy) xxx OR yyy ? Matches a single character of the preceding character set, or nothing
3
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 3 More grepping/egrepping /corpora/celex/english/epw/epw.cd Find all capitalized words grep ^'[0-9][0-9]*.[A-Z]' epw.cd | wc –l OR egrep ^'[0-9]+.[A-Z]‘ epw.cd | wc –l
4
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 4 Homework 3 Please give me command AND results! 1. In the file /corpora/celex/english/epw/epw.cd, find all words that contain only upper- case letters, e.g. USSR and VTOL. ANS:158 grep '^[0-9][0-9]*\\[A-Z][A-Z]*\\' epw.cd | wc –l egrep '^[0-9]+\\[A-Z]+\\' epw.cd | wc –l egrep ^'[0-9]+[\][A-Z]+\\' epw.cd | wc -l egrep ^'[0-9]+.[A-Z]+\\' epw.cd | wc –l
5
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 5 Homework 3 2. How many entries have a syllable that ends with a 4-consonant cluster? ANS: 45 egrep 'CCCC]' epw.cd (why not \] )? 56 grep 'CCCC]' epw.cd 56 grep 'CCCC]' epw.cd | grep –v ‘ed[ \\]’ 36 egrep 'CCCC]\\' epw.cd 45
6
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 6 Homework 3 3. Find all multi-word terms in which only the first letter is capitalized, e.g. Colorado potato beetle. ANS: 238/243 egrep ^'[0-9]+.[A-Z][a-z]+( [a-z]+)+\\' epw.cd | wc –l egrep ^'[0-9]+\\[A-Z][a-z]*( [a-z]+)+\\' epw.cd | wc -l 100184\X chromosome\0\52203\1\P\'Eks-"kr5-m@- s5m\[VCC][CCVV][CV][CVVC]\[Eks][kr@U][m@][s@Um] 100185\X chromosomes\0\52203\1\P\'Eks-"kr5-m@- s5mz\[VCC][CCVV][CV][CVVCC]\[Eks][kr@U][m@][s@Umz] 100287\Y chromosome\0\52250\1\P\'w2-"kr5-m@- s5m\[CVV][CCVV][CV][CVVC]\[waI][kr@U][m@][s@Um] 100288\Y chromosomes\0\52250\1\P\'w2-"kr5-m@- s5mz\[CVV][CCVV][CV][CVVCC]\[waI][kr@U][m@][s@Umz]
7
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 7 Homework 3 4. Find all multi-word terms in which the first letter (and only the first letter) of each word is capitalized, e.g. Union Jacks and Royal Automobile Club. Note: your regex should be able to accommodate an arbitrary number of words. ANS: 296/298 egrep ^'[0-9]+.[A-Z][a-z]+( [A-Z][a-z]*)+\\' epw.cd egrep ^'[0-9]+.[A-Z][a-z]*( [A-Z][a-z]*)+\\' epw.cd
8
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 8 Homework 3 5. Find all disyllabic words that contain only vowels. ANS: 4 egrep '\\\[V+\]\[V+\]\\' epw.cd 5\AA\52\5\1\P\"1-'1\[VV][VV]\[eI][eI] 6\AA\95\6\1\P\"1-'1\[VV][VV]\[eI][eI] 4727\ayah\13\2714\2\P\'2-@\[VV][V]\[aI][@]\S\'#- j@\[VV][CV]\[A:][j@] 43355\i.e.\424\22210\1\P\"2-'i\[VV][VV]\[aI][i:]
9
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 9 Homework 3 6. Multiword expressions (Find a similar phrase in the wsj/raw corpus, and search for all variants of it in the entire corpus. ) egrep –i ‘.tip of the *[a-z] iceberg’ egrep ‘[Tt]he tip of (a|the).* iceberg’ patriarchical /a more alarming
10
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 10 Homework 3 6. Other multiword expressions war on (inflation/drugs/the dictator) fight the war on the expenditure side rather rule of (the day/journalism/Ferdinand Marcos) cream of the (British) crop
11
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 11 Searching the treebank cat ??/* | egrep -i '(push|pull)[a-z]*’ OR xkwic?
12
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 12 XWin 32 See e-mail Load on laptops, bring laptops to class if any issues Go to Feb 9 Emacs & Xkwic lecture
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.