LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
LING 388: Language and Computers Sandiway Fong Lecture 5: 9/5.
LING 388: Language and Computers Sandiway Fong Lecture 5: 9/8.
 Use the Left and Right arrow keys or the Page Up and Page Down keys to move between the pages. You can also click on the pages to move forward.  To.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
NATURAL LANGUAGE PROCESSING. Applications  Classification ( spam )  Clustering ( news stories, twitter )  Input correction ( spell checking )  Sentiment.
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 21: 11/7.
LING 388: Language and Computers Sandiway Fong Lecture 26: 12/1.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/8.
LING 438/538 Computational Linguistics
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
LING 388: Language and Computers Sandiway Fong Lecture 3: 8/28.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
LING 581: Advanced Computational Linguistics Lecture Notes January 12th.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.
Computational Intelligence 696i Language Lecture 1 Sandiway Fong.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Linux & Shell Scripting Small Group Lecture 4 How to Learn to Code Workshop group/ Erin.
Android 4: Creating Contents Kirk Scott 1. Outline 4.1 Planning Contents 4.2 GIMP and Free Sound Recorder 4.3 Using FlashCardMaker to Create an XML File.
Natural Language Processing Expectation Maximization.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 22 nd.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
12/13/2007Chia-Ho Ling1 SRILM Language Model Student: Chia-Ho Ling Instructor: Dr. Veton Z. K ë puska.
Name:Venkata subramanyan sundaresan Instructor:Dr.Veton Kepuska.
MCB Lecture #3 Sept 2/14 Intro to UNIX terminal.
LING/C SC/PSYC 438/538 Lecture 23 Sandiway Fong. Administrivia Homework 4 – out today – due next Wednesday – (recommend you attempt it early) Reading.
Representing text Each of different symbol on the text (alphabet letter) is assigned a unique bit patterns the text is then representing as.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
TagHelper: Basics Part 1 Carolyn Penstein Rosé Carnegie Mellon University Funded through the Pittsburgh Science of Learning Center and The Office of Naval.
How to Tag a Corpus Using Stanford Tagger. Accuracy All tokens: 97.32% Unknown words: 90.79%
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
LING/C SC/PSYC 438/538 Lecture 2 Sandiway Fong. Today’s Topics Did you read Chapter 1 of JM? – Short Homework 2 (submit by midnight Friday) Today is Perl.
Lecture 6: Computer Languages. Programming Environments (IDE) COS120 Software Development Using C++ AUBG, COS dept.
Scripting Languages Course 2 Diana Trandab ă ț Master in Computational Linguistics - 1 st year
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Resolving Word Ambiguities Description: After determining word boundaries, the speech recognition process matches an array of possible word sequences from.
ISU Basic SAS commands Laboratory No. 1 Computer Techniques for Biological Research Animal Science 500 Ken Stalder, Professor Department of Animal Science.
LING/C SC/PSYC 438/538 Lecture 21 Sandiway Fong. Today’s Topics Last Time – Stemming and minimum edit distance Reading – Chapter 4 of JM: N-grams Pre-requisite:
1 Data Representation Characters, Integers and Real Numbers Binary Number System Octal Number System Hexadecimal Number System Powered by DeSiaMore.
Lecture 4 Ngrams Smoothing
Chapter 23: Probabilistic Language Models April 13, 2004.
LING/C SC/PSYC 438/538 Lecture 22 Sandiway Fong. Last Time Gentle introduction to probability Important notions: –sample space –events –rule of counting.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
U3A General Computing Class Autumn 2014 Week 4 of 10 weeks. Mondays 4:15 to 5:45 pm Half Term – Miss 27th of October 2014 and 3 rd November. Class dates.
Natural Language Processing Statistical Inference: n-grams
HANGMAN OPTIMIZATION Kyle Anderson, Sean Barton and Brandyn Deffinbaugh.
CSC 4630 Meeting 17 March 21, Exam/Quiz Schedule Due to ice, travel, research and other commitments that we all have: –Quiz 2, scheduled for Monday.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Group – 8 Maunik Shah Hemant Adil Akanksha Patel.
Make online survey Google Forms. Open Google docs Go to gmail.com and then Username ( ):etec444 password:444students Go to your Drive.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Language Model for Machine Translation Jang, HaYoung.
What is Google Classroom?
CSCE 771 Natural Language Processing
Word Processing.
Statistical n-gram David ling.
Presentation transcript:

LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6

Today’s Topic Computer Laboratory Class –n-gram statistics Homework #5 Due next Monday –13th December –exam period

Software Required NSP –Ted Pedersen’s n-gram statistics package –written in Perl –free – Active State Perl –free Perl – NSP Perl Windows/Mac etc.

Active Perl Installed on all the machines

NSP On the SBSRI computers Already present on the C drive – C:\nsp Otherwise access it using: – 1. Click "Start" and choose "Run" –2. Type \\sbsri0\apps\nsp and click "OK"

Command Processor We will run NSP from the command line here

Exercise 1 Download and prepare text –1. Google “marion jones steroids” –2. Click on USAToday article this one

Exercise 1 Download and prepare text –3. Click on Print this –4. Copy text of article into text editor this one

Exercise 1 Reformat article of text into lines Lower case the first letter of a sentence when appropriate Question 1: (1pt) –How many lines of text are there in the article, including the headline?

Exercise 2 In the command line environment... perl count.pl --ngram 1 --newline unigrams.txt text.txt –Usage: count.pl [OPTIONS] DESTINATION SOURCE [[, SOURCE]...] Counts up the frequency of all n-grams occurring in SOURCE. Sends to DESTINATION the list of n-grams found, along with the frequencies of combinations of the n tokens that the n-gram is composed of. If SOURCE is a directory, all text files in it are counted. –OPTIONS: --ngram N –Creates n-grams of N tokens each. N = 2 by default. --newLine –Prevents n-grams from spanning across the new-line character.

Exercise 2 Obtain unigrams from the text Question 1 (1pt) –How many different words are there in the text, excluding punctuation symbols (., : ? etc.) Question 2 (1pt) –Which is the most frequent non-closed class word? Question 3 (1pt) –Which is the 2nd most frequent non-closed class word? Question 4 (1pt) –Which person is mentioned most in the article?

Exercise 3 Obtain bigrams from the text Obtain trigrams from the text Question 1: (2pts) –Compute the probability of the sequence “... Jones has denied using steroids” using the bigram approximation –show the workings of your answer –assume first word is Jones and p(Jones) is the unigram probability Question 2: (2pts) –Compute the probability of the sequence “... Jones has denied using steroids” using the trigram approximation

Exercise 4 Insert the start of sentence dummy symbol StarT where appropriate Question 1: (2pts) –Compute the probability of the sentence beginning “Jones has denied using steroids” using the bigram approximation Question 2: (1pt) –Compare your answer with the answer you gave in Exercise Question 1 –Which probability is greater and why?

Smoothing Small and sparse dataset means that zero frequency values are a problem –Zero probabilities p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 ) bigram model one zero and the whole product is zero –Zero frequencies are a problem p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )relative frequency word doesn’t exist in dataset and we’re dividing by zero What to do when frequencies are zero? Answer 1: get a larger corpus –(even with the BNC) never large enough, always going to have zeros Answer 2: (Smoothing) –assign some small non-zero value to unknown frequencies

Smoothing Simplest Algorithm –not the best way... –called Add-One –add one to the frequency counts for everything –bigram probability for sequence w n-1 w n now given by p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V) V = # different words in corpus

Exercise 5 Using Add-One and StarT (from Exercise 4) Question 1: (2pts) –Recompute the bigram probability approximation for “Jones has denied using steroids” Question 2: (2pts) –Compute the bigram probability approximation for “Jones has admitted using steroids” –a sentence that does not exist in the original article

Homework Summary Total points on offer: 14 Exercise 1 –1pt Exercise 2 –4pt Exercise 3 –4pt Exercise 4 –3pt Exercise 5 –4pts