1 I256: Applied Natural Language Processing Marti Hearst Sept 13, 2006.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research.
Language Modeling: Ngrams
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
N-Grams and Corpus Linguistics 6 July Linguistics vs. Engineering “But it must be recognized that the notion of “probability of a sentence” is an.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
CS 4705 N-Grams and Corpus Linguistics Julia Hirschberg CS 4705.
CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 N-Grams CSC 9010: Special Topics. Natural Language Processing.
1 N-Grams and Corpus Linguistics September 2009 Lecture #5.
N-Grams and Corpus Linguistics.  Regular expressions for asking questions about the stock market from stock reports  Due midnight, Sept. 29 th  Use.
REDUCED N-GRAM MODELS FOR IRISH, CHINESE AND ENGLISH CORPORA Nguyen Anh Huy, Le Trong Ngoc and Le Quan Ha Hochiminh City University of Industry Ministry.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
N-Grams and Corpus Linguistics
1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems
CS 4705 Lecture 13 Corpus Linguistics I. From Knowledge-Based to Corpus-Based Linguistics A Paradigm Shift begins in the 1980s –Seeds planted in the 1950s.
Page 1 Language Modeling. Page 2 Next Word Prediction From a NY Times story... Stocks... Stocks plunged this …. Stocks plunged this morning, despite a.
CS 4705 Lecture 6 N-Grams and Corpus Linguistics.
Distributional Cues to Word Boundaries: Context Is Important Sharon Goldwater Stanford University Tom Griffiths UC Berkeley Mark Johnson Microsoft Research/
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
N-Grams and Language Modeling
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Word prediction What are likely completions of the following sentences? –“Oh, that must be a syntax …” –“I have to go to the …” –“I’d also like a Coke.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
CS 4705 N-Grams and Corpus Linguistics. Homework Use Perl or Java reg-ex package HW focus is on writing the “grammar” or FSA for dates and times The date.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 15, 2004.
Language Modeling Julia Hirschberg CS Approaches to Language Modeling Context-Free Grammars –Use in HTK Ngram Models.
CS 4705 N-Grams and Corpus Linguistics. Spelling Correction, revisited M$ suggests: –ngram: NorAm –unigrams: anagrams, enigmas –bigrams: begrimes –trigrams:
1 I256: Applied Natural Language Processing Marti Hearst Sept 18, 2006.
Learning Bit by Bit Class 4 - Ngrams. Ngrams Counting words Using observation to make predictions.
Natural Language Understanding
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
1 Advanced Smoothing, Evaluation of Language Models.
NGrams 09/16/2004 Instructor: Rada Mihalcea Note: some of the material in this slide set was adapted from an NLP course taught by Bonnie Dorr at Univ.
Name:Venkata subramanyan sundaresan Instructor:Dr.Veton Kepuska.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Lanugage Modeling Lecture 12 Spoken Language Processing Prof. Andrew Rosenberg.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
1 The Ferret Copy Detector Finding short passages of similar texts in large document collections Relevance to natural computing: System is based on processing.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
An Investigation of Statistical Machine Translation (Spanish to English) Raghav Bashyal.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Tokenization & POS-Tagging
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Language Model for Machine Translation Jang, HaYoung.
N-Grams Chapter 4 Part 2.
Introduction to Textual Analysis
N-Grams and Corpus Linguistics
Language-Model Based Text-Compression
LING/C SC 581: Advanced Computational Linguistics
Lecture 13 Corpus Linguistics I CS 4705.
Presentation transcript:

1 I256: Applied Natural Language Processing Marti Hearst Sept 13, 2006

2 Counting Tokens Useful for lots of things One cute application: see who talks where in a novel Idea comes from Eick et al. who did it with The Jungle Book by Kipling

3 SeeSoft Vizualization of Jungle Book Characters, From Eick, Steffen, and Sumner ‘92

4 The FreqDist Data Structure Purpose: collect counts and frequencies for some phenomenon Initialize a new FreqDist: from nltk_lite.probability import FreqDist fd = FreqDist() When in a counting loop: fd.inc(‘item of interest’) After done counting: fd.N() # total number of tokens counted fd.B() # number of unique tokens fd.samples() # list of all the tokens seen (there are N) fd.Nr(10) # number of samples that occurred 10 times fd.count(‘red’) # number of times the token ‘red’ was seen fd.freq(‘red’) # frequency of ‘red’; that is fd.count(‘red’)/fd.N() fd.max() # which token had the highest count fd.sorted_samples() # show the samples in decreasing order of frequency

5 FreqDist() in action

6 Word Lengths by Language

7

8 Doing Character Distribution

9 How to determine the characters? Write some code that takes as input a gutenberg file and quickly suggests who the main characters are.

10 How to determine the characters? My solution: look for words that begin with capital letters; count how often each occurs. Then show the most frequent.

11

12 Language Modeling A fundamental concept in NLP Main idea: For a given language, some words are more likely than others to follow each other, or You can predict (with some degree of accuracy) the probability that a given word will follow another word.

13 Adapted from slide by Bonnie Dorr Next Word Prediction From a NY Times story... Stocks... Stocks plunged this …. Stocks plunged this morning, despite a cut in interest rates Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall... Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began

14 Adapted from slide by Bonnie Dorr Human Word Prediction Clearly, at least some of us have the ability to predict future words in an utterance. How? Domain knowledge Syntactic knowledge Lexical knowledge

15 Adapted from slide by Bonnie Dorr Simple Statistics Does a Lot A useful part of the knowledge needed to allow word prediction can be captured using simple statistical techniques In particular, we'll rely on the notion of the probability of a sequence (a phrase, a sentence)

16 Adapted from slide by Bonnie Dorr N-Gram Models of Language Use the previous N-1 words in a sequence to predict the next word How do we train these models? Very large corpora

17 Adapted from slide by Bonnie Dorr Simple N-Grams Assume a language has V word types in its lexicon, how likely is word x to follow word y? Simplest model of word probability: 1/V Alternative 1: estimate likelihood of x occurring in new text based on its general frequency of occurrence estimated from a corpus (unigram probability) popcorn is more likely to occur than unicorn Alternative 2: condition the likelihood of x occurring in the context of previous words (bigrams, trigrams,…) mythical unicorn is more likely than mythical popcorn

18 ConditonalFreqDist() Data Structure A collection of FreqDist() objects Indexed by the “condition” that is being tested or compared Initialize a new one: cfd = ConditionalFreqDist() Add a count cfd[‘berkeley’].inc(‘blue’) cfd[‘berkeley’].inc(‘gold’) cfd[‘stanford’].inc(‘red’) Can access each FreqDist object by indexing on condition cfd[‘berkeley’].samples() cfd[‘berkeley’].N() Can also get a list of the conditions from the cfd object cfd.conditions() >> [‘stanford’, ‘berkeley’]

19 Computing Next Words

20 Auto-generate a Story

21 Adapted from slide by Bonnie Dorr Applications Why do we want to predict a word, given some preceding words? Rank the likelihood of sequences containing various alternative hypotheses, e.g. for ASR Theatre owners say popcorn/unicorn sales have doubled... Assess the likelihood/goodness of a sentence –for text generation or machine translation. The doctor recommended a cat scan. El doctor recommendó una exploración del gato.

22 How to implement this? Comparing Modal Verb Counts

23 Comparing Modals

24 Comparing Modals

25 Next Time Part of Speech Tagging