LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.

LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

Today’s Topic Computer Laboratory Class –n-gram statistics Homework #5 Due next Monday –13th December –exam period

Software Required NSP –Ted Pedersen’s n-gram statistics package –written in Perl –free –http://www.d.umn.edu/~tpederse/nsp.html Active State Perl –free Perl –http://www.activestate.com/ NSP Perl Windows/Mac etc.

Active Perl Installed on all the machines

NSP On the SBSRI computers Already present on the C drive – C:\nsp Otherwise access it using: – 1. Click "Start" and choose "Run" –2. Type \\sbsri0\apps\nsp and click "OK"

Command Processor We will run NSP from the command line here

Exercise 1 Download and prepare text –1. Google “marion jones steroids” –2. Click on USAToday article this one

Exercise 1 Download and prepare text –3. Click on Print this –4. Copy text of article into text editor this one

Exercise 1 Reformat article of text into lines Lower case the first letter of a sentence when appropriate Question 1: (1pt) –How many lines of text are there in the article, including the headline?

Exercise 2 In the command line environment... perl count.pl --ngram 1 --newline unigrams.txt text.txt –Usage: count.pl [OPTIONS] DESTINATION SOURCE [[, SOURCE]...] Counts up the frequency of all n-grams occurring in SOURCE. Sends to DESTINATION the list of n-grams found, along with the frequencies of combinations of the n tokens that the n-gram is composed of. If SOURCE is a directory, all text files in it are counted. –OPTIONS: --ngram N –Creates n-grams of N tokens each. N = 2 by default. --newLine –Prevents n-grams from spanning across the new-line character.

Exercise 2 Obtain unigrams from the text Question 1 (1pt) –How many different words are there in the text, excluding punctuation symbols (., : ? etc.) Question 2 (1pt) –Which is the most frequent non-closed class word? Question 3 (1pt) –Which is the 2nd most frequent non-closed class word? Question 4 (1pt) –Which person is mentioned most in the article?

Exercise 3 Obtain bigrams from the text Obtain trigrams from the text Question 1: (2pts) –Compute the probability of the sequence “... Jones has denied using steroids” using the bigram approximation –show the workings of your answer –assume first word is Jones and p(Jones) is the unigram probability Question 2: (2pts) –Compute the probability of the sequence “... Jones has denied using steroids” using the trigram approximation

Exercise 4 Insert the start of sentence dummy symbol StarT where appropriate Question 1: (2pts) –Compute the probability of the sentence beginning “Jones has denied using steroids” using the bigram approximation Question 2: (1pt) –Compare your answer with the answer you gave in Exercise Question 1 –Which probability is greater and why?

Smoothing Small and sparse dataset means that zero frequency values are a problem –Zero probabilities p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 ) bigram model one zero and the whole product is zero –Zero frequencies are a problem p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )relative frequency word doesn’t exist in dataset and we’re dividing by zero What to do when frequencies are zero? Answer 1: get a larger corpus –(even with the BNC) never large enough, always going to have zeros Answer 2: (Smoothing) –assign some small non-zero value to unknown frequencies

Smoothing Simplest Algorithm –not the best way... –called Add-One –add one to the frequency counts for everything –bigram probability for sequence w n-1 w n now given by p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V) V = # different words in corpus

Exercise 5 Using Add-One and StarT (from Exercise 4) Question 1: (2pts) –Recompute the bigram probability approximation for “Jones has denied using steroids” Question 2: (2pts) –Compute the bigram probability approximation for “Jones has admitted using steroids” –a sentence that does not exist in the original article

Homework Summary Total points on offer: 14 Exercise 1 –1pt Exercise 2 –4pt Exercise 3 –4pt Exercise 4 –3pt Exercise 5 –4pts

LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.

Similar presentations

Presentation on theme: "LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.

Similar presentations

Presentation on theme: "LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6."— Presentation transcript:

Similar presentations

About project

Feedback