LING/C SC/PSYC 438/538 Lecture 13 Sandiway Fong
Today's Topics Homework 5: regex Review of ungraded homework exercises More Perl regex
Homework 5 File: hw5.pos Source: Penn Treebank POS (Part-of-Speech) tagged sentences
Homework 5 You may use either Perl or Python, e.g. for one-line answers: perl -ne 'code END { code }' hw5.pos Write a regular expression to extract all the proper nouns (POS tag: NNP singular or NNPS plural). Hint: you may wish to print the nouns out to debug your regex… Question 1: How many proper noun (tokens) are there? Question 2: How many different proper nouns (types)? Question 3: How many different plural proper nouns (types)? Question 4: What is the most frequent proper noun and its frequency? Question 5: What is the most frequent plural proper noun and its frequency? Question 6: Print the top 5 proper nouns and frequencies
Homework 5 Extra Credit Question 7: print out the frequency table for determiners (POS tag: DT) in hw5.pos. Note: be case-insensitive In your opinion, does this table follow Zipf's Law?
Homework 5 Due date: Usual instructions: next Monday midnight 438/538 Homework 5 Your Name one PDF file!
Review: Ungraded Homework Exercises
Review: Ungraded Homework Exercises Perl one-liners: perl -ne 'm/regex/ and print "$.:$&\n"' -e execute Perl code -n put code inside implicit while loop (uses default variable $_) $. is line number $& is whole match grouping (..) and backreferences, e.g. \1 anchors ^ and $
Review: Ungraded Homework Exercises #3: repeated.txt #4: ab.txt #5: integerline.txt
Zipf's Law Zipf's Law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. See: http://demonstrations.wolfram.com/ZipfsLawAppliedToWordAndLetterFrequencies/ Brown Corpus (1,015,945 words): only 135 words are needed to account for half the corpus. On a Log – Log scale: almost straight line http://www.learnholistically.it/esp-clil/wfk2.htm https://finnaarupnielsen.wordpress.com/2013/10/22/zipf-plot-for-word-counts-in-brown-corpus/
Character Frequency Counting Sample code is rather interesting: -e flag - evaluate the right-hand side as an expression Generally (see next slide): (?{ Perl code }) Slightly modified but easier to read: note: lc(..) for lowercase
(?{ Perl code })
Character Frequency Counting Does Zipf's Law apply to character frequencies?
Character Frequency Counting