Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING/C SC/PSYC 438/538 Lecture 13 Sandiway Fong.

Similar presentations


Presentation on theme: "LING/C SC/PSYC 438/538 Lecture 13 Sandiway Fong."— Presentation transcript:

1 LING/C SC/PSYC 438/538 Lecture 13 Sandiway Fong

2 Today's Topics Homework 5: regex Review of ungraded homework exercises
More Perl regex

3 Homework 5 File: hw5.pos Source: Penn Treebank POS (Part-of-Speech) tagged sentences

4 Homework 5 You may use either Perl or Python, e.g. for one-line answers: perl -ne 'code END { code }' hw5.pos Write a regular expression to extract all the proper nouns (POS tag: NNP singular or NNPS plural). Hint: you may wish to print the nouns out to debug your regex… Question 1: How many proper noun (tokens) are there? Question 2: How many different proper nouns (types)? Question 3: How many different plural proper nouns (types)? Question 4: What is the most frequent proper noun and its frequency? Question 5: What is the most frequent plural proper noun and its frequency? Question 6: Print the top 5 proper nouns and frequencies

5 Homework 5 Extra Credit Question 7:
print out the frequency table for determiners (POS tag: DT) in hw5.pos. Note: be case-insensitive In your opinion, does this table follow Zipf's Law?

6 Homework 5 Due date: Usual instructions: next Monday midnight
438/538 Homework 5 Your Name one PDF file!

7 Review: Ungraded Homework Exercises

8 Review: Ungraded Homework Exercises
Perl one-liners: perl -ne 'm/regex/ and print "$.:$&\n"' -e execute Perl code -n put code inside implicit while loop (uses default variable $_) $. is line number $& is whole match grouping (..) and backreferences, e.g. \1 anchors ^ and $

9 Review: Ungraded Homework Exercises
#3: repeated.txt #4: ab.txt #5: integerline.txt

10 Zipf's Law Zipf's Law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. See: Brown Corpus (1,015,945 words): only 135 words are needed to account for half the corpus. On a Log – Log scale: almost straight line

11 Character Frequency Counting
Sample code is rather interesting: -e flag - evaluate the right-hand side as an expression Generally (see next slide): (?{ Perl code }) Slightly modified but easier to read: note: lc(..) for lowercase

12 (?{ Perl code })

13 Character Frequency Counting
Does Zipf's Law apply to character frequencies?

14 Character Frequency Counting


Download ppt "LING/C SC/PSYC 438/538 Lecture 13 Sandiway Fong."

Similar presentations


Ads by Google