Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.

Similar presentations


Presentation on theme: "LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6."— Presentation transcript:

1 LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

2 Today’s Topic Computer Laboratory Class –n-gram statistics Homework #5 Due next Monday –13th December –exam period

3 Software Required NSP –Ted Pedersen’s n-gram statistics package –written in Perl –free –http://www.d.umn.edu/~tpederse/nsp.html Active State Perl –free Perl –http://www.activestate.com/ NSP Perl Windows/Mac etc.

4 Active Perl Installed on all the machines

5 NSP On the SBSRI computers Already present on the C drive – C:\nsp Otherwise access it using: – 1. Click "Start" and choose "Run" –2. Type \\sbsri0\apps\nsp and click "OK"

6 Command Processor We will run NSP from the command line here

7 Exercise 1 Download and prepare text –1. Google “marion jones steroids” –2. Click on USAToday article this one

8 Exercise 1 Download and prepare text –3. Click on Print this –4. Copy text of article into text editor this one

9 Exercise 1 Reformat article of text into lines Lower case the first letter of a sentence when appropriate Question 1: (1pt) –How many lines of text are there in the article, including the headline?

10 Exercise 2 In the command line environment... perl count.pl --ngram 1 --newline unigrams.txt text.txt –Usage: count.pl [OPTIONS] DESTINATION SOURCE [[, SOURCE]...] Counts up the frequency of all n-grams occurring in SOURCE. Sends to DESTINATION the list of n-grams found, along with the frequencies of combinations of the n tokens that the n-gram is composed of. If SOURCE is a directory, all text files in it are counted. –OPTIONS: --ngram N –Creates n-grams of N tokens each. N = 2 by default. --newLine –Prevents n-grams from spanning across the new-line character.

11 Exercise 2 Obtain unigrams from the text Question 1 (1pt) –How many different words are there in the text, excluding punctuation symbols (., : ? etc.) Question 2 (1pt) –Which is the most frequent non-closed class word? Question 3 (1pt) –Which is the 2nd most frequent non-closed class word? Question 4 (1pt) –Which person is mentioned most in the article?

12 Exercise 3 Obtain bigrams from the text Obtain trigrams from the text Question 1: (2pts) –Compute the probability of the sequence “... Jones has denied using steroids” using the bigram approximation –show the workings of your answer –assume first word is Jones and p(Jones) is the unigram probability Question 2: (2pts) –Compute the probability of the sequence “... Jones has denied using steroids” using the trigram approximation

13 Exercise 4 Insert the start of sentence dummy symbol StarT where appropriate Question 1: (2pts) –Compute the probability of the sentence beginning “Jones has denied using steroids” using the bigram approximation Question 2: (1pt) –Compare your answer with the answer you gave in Exercise Question 1 –Which probability is greater and why?

14 Smoothing Small and sparse dataset means that zero frequency values are a problem –Zero probabilities p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 ) bigram model one zero and the whole product is zero –Zero frequencies are a problem p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )relative frequency word doesn’t exist in dataset and we’re dividing by zero What to do when frequencies are zero? Answer 1: get a larger corpus –(even with the BNC) never large enough, always going to have zeros Answer 2: (Smoothing) –assign some small non-zero value to unknown frequencies

15 Smoothing Simplest Algorithm –not the best way... –called Add-One –add one to the frequency counts for everything –bigram probability for sequence w n-1 w n now given by p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V) V = # different words in corpus

16 Exercise 5 Using Add-One and StarT (from Exercise 4) Question 1: (2pts) –Recompute the bigram probability approximation for “Jones has denied using steroids” Question 2: (2pts) –Compute the bigram probability approximation for “Jones has admitted using steroids” –a sentence that does not exist in the original article

17 Homework Summary Total points on offer: 14 Exercise 1 –1pt Exercise 2 –4pt Exercise 3 –4pt Exercise 4 –3pt Exercise 5 –4pts


Download ppt "LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6."

Similar presentations


Ads by Google