Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Regular Expressions and Xkwic LING 5200 Computational Corpus Linguistics Martha Palmer February 28, 2006.

Similar presentations


Presentation on theme: "1 Regular Expressions and Xkwic LING 5200 Computational Corpus Linguistics Martha Palmer February 28, 2006."— Presentation transcript:

1 1 Regular Expressions and Xkwic LING 5200 Computational Corpus Linguistics Martha Palmer February 28, 2006

2 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 2 grep/egrep X+ instead of xx* (xxx|yyy) xxx OR yyy ? Matches a single character of the preceding character set, or nothing

3 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 3 More grepping/egrepping /corpora/celex/english/epw/epw.cd Find all capitalized words grep ^'[0-9][0-9]*.[A-Z]' epw.cd | wc –l OR egrep ^'[0-9]+.[A-Z]‘ epw.cd | wc –l

4 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 4 Homework 3 Please give me command AND results! 1. In the file /corpora/celex/english/epw/epw.cd, find all words that contain only upper- case letters, e.g. USSR and VTOL. ANS:158  grep '^[0-9][0-9]*\\[A-Z][A-Z]*\\' epw.cd | wc –l  egrep '^[0-9]+\\[A-Z]+\\' epw.cd | wc –l  egrep ^'[0-9]+[\][A-Z]+\\' epw.cd | wc -l  egrep ^'[0-9]+.[A-Z]+\\' epw.cd | wc –l

5 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 5 Homework 3 2. How many entries have a syllable that ends with a 4-consonant cluster? ANS: 45  egrep 'CCCC]' epw.cd (why not \] )? 56  grep 'CCCC]' epw.cd 56  grep 'CCCC]' epw.cd | grep –v ‘ed[ \\]’ 36  egrep 'CCCC]\\' epw.cd 45

6 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 6 Homework 3 3. Find all multi-word terms in which only the first letter is capitalized, e.g. Colorado potato beetle. ANS: 238/243  egrep ^'[0-9]+.[A-Z][a-z]+( [a-z]+)+\\' epw.cd | wc –l  egrep ^'[0-9]+\\[A-Z][a-z]*( [a-z]+)+\\' epw.cd | wc -l 100184\X chromosome\0\52203\1\P\'Eks-"kr5-m@- s5m\[VCC][CCVV][CV][CVVC]\[Eks][kr@U][m@][s@Um] 100185\X chromosomes\0\52203\1\P\'Eks-"kr5-m@- s5mz\[VCC][CCVV][CV][CVVCC]\[Eks][kr@U][m@][s@Umz] 100287\Y chromosome\0\52250\1\P\'w2-"kr5-m@- s5m\[CVV][CCVV][CV][CVVC]\[waI][kr@U][m@][s@Um] 100288\Y chromosomes\0\52250\1\P\'w2-"kr5-m@- s5mz\[CVV][CCVV][CV][CVVCC]\[waI][kr@U][m@][s@Umz]

7 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 7 Homework 3 4. Find all multi-word terms in which the first letter (and only the first letter) of each word is capitalized, e.g. Union Jacks and Royal Automobile Club. Note: your regex should be able to accommodate an arbitrary number of words. ANS: 296/298  egrep ^'[0-9]+.[A-Z][a-z]+( [A-Z][a-z]*)+\\' epw.cd egrep ^'[0-9]+.[A-Z][a-z]*( [A-Z][a-z]*)+\\' epw.cd

8 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 8 Homework 3 5. Find all disyllabic words that contain only vowels. ANS: 4  egrep '\\\[V+\]\[V+\]\\' epw.cd 5\AA\52\5\1\P\"1-'1\[VV][VV]\[eI][eI] 6\AA\95\6\1\P\"1-'1\[VV][VV]\[eI][eI] 4727\ayah\13\2714\2\P\'2-@\[VV][V]\[aI][@]\S\'#- j@\[VV][CV]\[A:][j@] 43355\i.e.\424\22210\1\P\"2-'i\[VV][VV]\[aI][i:]

9 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 9 Homework 3 6. Multiword expressions (Find a similar phrase in the wsj/raw corpus, and search for all variants of it in the entire corpus. )  egrep –i ‘.tip of the *[a-z] iceberg’  egrep ‘[Tt]he tip of (a|the).* iceberg’  patriarchical /a more alarming

10 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 10 Homework 3 6. Other multiword expressions  war on (inflation/drugs/the dictator)  fight the war on the expenditure side rather  rule of (the day/journalism/Ferdinand Marcos)  cream of the (British) crop

11 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 11 Searching the treebank cat ??/* | egrep -i '(push|pull)[a-z]*’ OR xkwic?

12 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 12 XWin 32 See e-mail Load on laptops, bring laptops to class if any issues Go to Feb 9 Emacs & Xkwic lecture


Download ppt "1 Regular Expressions and Xkwic LING 5200 Computational Corpus Linguistics Martha Palmer February 28, 2006."

Similar presentations


Ads by Google