School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING An open discussion and exchange of ideas Introduced by Eric Atwell, Language.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Word Bi-grams and PoS Tags
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Language Modeling: Ngrams
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
N-Grams and Corpus Linguistics 6 July Linguistics vs. Engineering “But it must be recognized that the notion of “probability of a sentence” is an.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
CS 4705 N-Grams and Corpus Linguistics Julia Hirschberg CS 4705.
CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 N-Grams CSC 9010: Special Topics. Natural Language Processing.
1 N-Grams and Corpus Linguistics September 2009 Lecture #5.
N-Grams and Corpus Linguistics.  Regular expressions for asking questions about the stock market from stock reports  Due midnight, Sept. 29 th  Use.
1 I256: Applied Natural Language Processing Marti Hearst Sept 13, 2006.
REDUCED N-GRAM MODELS FOR IRISH, CHINESE AND ENGLISH CORPORA Nguyen Anh Huy, Le Trong Ngoc and Le Quan Ha Hochiminh City University of Industry Ministry.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
1 Words and the Lexicon September 10th 2009 Lecture #3.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.
N-Grams and Corpus Linguistics
CS 4705 Lecture 13 Corpus Linguistics I. From Knowledge-Based to Corpus-Based Linguistics A Paradigm Shift begins in the 1980s –Seeds planted in the 1950s.
Page 1 Language Modeling. Page 2 Next Word Prediction From a NY Times story... Stocks... Stocks plunged this …. Stocks plunged this morning, despite a.
CS 4705 Lecture 6 N-Grams and Corpus Linguistics.
N-Grams and Language Modeling
Introduction to Language Models Evaluation in information retrieval Lecture 4.
CS 4705 N-Grams and Corpus Linguistics. Homework Use Perl or Java reg-ex package HW focus is on writing the “grammar” or FSA for dates and times The date.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 15, 2004.
CS 4705 N-Grams and Corpus Linguistics. Spelling Correction, revisited M$ suggests: –ngram: NorAm –unigrams: anagrams, enigmas –bigrams: begrimes –trigrams:
Natural Language Understanding
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Text Analysis Everything Data CompSci Spring 2014.
Ngram Models Bahareh Sarrafzadeh Winter Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
1 CPE 641 Natural Language Processing Lecture 2: Levels of Linguistic Analysis, Tokenization & Part- of-speech Tagging Asst. Prof. Dr. Nuttanart Facundes.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Tokenization & POS-Tagging
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
Natural Language Processing Statistical Inference: n-grams
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Language Model for Machine Translation Jang, HaYoung.
Natural Language Processing (NLP)
Introduction to Textual Analysis
N-Grams and Corpus Linguistics
CS4705 Natural Language Processing
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Lecture 13 Corpus Linguistics I CS 4705.
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other contributors)

Reminder Tokenization - by whitespace, regular expressions Problems: Its data-base New York … Jabberwocky shows we can break words into morphemes Morpheme types: root/stem, affix, clitic Derivational vs. Inflectional Regular vs. Irregular Concatinative vs. Templatic (root-and-pattern) Morphological analysers: Porter stemmer, Morphy, PC-Kimmo Morphology by lookup: CatVar, CELEX, OALD++ MorphoChallenge: Unsupervised Machine Learning of morphology

Counting Token Distributions Useful for lots of things One cute application: see who talks where in a novel Idea comes from Eick et al. who did it with The Jungle Book by Kipling

SeeSoft Vizualization of Jungle Book Characters, From Eick, Steffen, and Sumner 92

The FreqDist Data Structure Purpose: collect counts and frequencies for some phenomenon Initialize a new FreqDist: >>> import nltk >>> from nltk.probability import FreqDist >>> fd = FreqDist() When in a counting loop: fd.inc(item of interest) After done counting: fd.N() # total number of tokens counted (N = number) fd.B() # number of unique tokens (types; B = buckets) fd.samples() # list of all the tokens seen (there are N) fd.Nr(10) # number of samples that occurred 10 times fd.count(red) # number of times the token red was seen fd.freq(red) # relative frequency of red; that is fd.count(red)/fd.N() fd.max() # which token had the highest count fd.sorted_samples() # show the samples in decreasing order of frequency

FreqDist() in action

Word Lengths by Language

How to determine the characters? Who are the main characters in a story? Simple solution: look for words that begin with capital letters; count how often each occurs. Then show the most frequent.

Who are the main characters? And where in the story?

Language Modeling N-gram modelling: a fundamental concept in NLP Main idea: For a given language, some words are more likely than others to follow each other; and You can predict (with some degree of accuracy) the probability that a given word will follow another word. This works for words; also for Parts-of-Speech, prosodic features, dialogue acts, …

Adapted from slide by Bonnie Dorr12 Next Word Prediction From a NY Times story... Stocks... Stocks plunged this …. Stocks plunged this morning, despite a cut in interest rates Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall... Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began

Adapted from slide by Bonnie Dorr13 Human Word Prediction Clearly, at least some of us have the ability to predict future words in an utterance. How? Domain knowledge Syntactic knowledge Lexical knowledge

Adapted from slide by Bonnie Dorr14 Simple Statistics Does a Lot A useful part of the knowledge needed to allow word prediction can be captured using simple statistical techniques In particular, we'll rely on the notion of the probability of a sequence (a phrase, a sentence)

Adapted from slide by Bonnie Dorr15 N-Gram Models of Language Use the previous N-1 words in a sequence to predict the next word How do we train these models? Very large corpora

Adapted from slide by Bonnie Dorr16 Simple N-Grams Assume a language has V word types in its lexicon, how likely is word x to follow word y? Simplest model of word probability: 1/V Alternative 1: estimate likelihood of x occurring in new text based on its general frequency of occurrence estimated from a corpus (unigram probability) popcorn is more likely to occur than unicorn Alternative 2: condition the likelihood of x occurring in the context of previous words (bigrams, trigrams,…) mythical unicorn is more likely than mythical popcorn

Computing Next Words

Auto-generate a Story If it simply chooses the most probable next word given the current word, the generator loops – can you see why? This is a bigram model ?better to take longer history into account: trigram, 4-gram, … (but will this guarantee no loops?)

Adapted from slide by Bonnie Dorr19 Applications Why do we want to predict a word, given some preceding words? Rank the likelihood of sequences containing various alternative hypotheses, e.g. for automatic speech recognition (ASR) Theatre owners say popcorn/unicorn sales have doubled... See for yourself: EBL has Dragon Naturally Speaking ASR Assess the likelihood/goodness of a sentence for text generation or machine translation. The doctor recommended a cat scan. El doctor recommendó una exploración del gato.

can and will more frequent in skills and hobbies (Bob the Builder: Yes we can!) How to implement this? Comparing Modal Verb Counts

Comparing Modals

Reminder FreqDist counts of tokens and their distribution can be useful Eg find main characters in Gutenberg texts Eg compare word-lengths in different languages Human can predict the next word … N-gram models are based on counts in a large corpus Auto-generate a story... (but gets stuck in local maximum) Grammatical trends: modal verb distribution predicts genre