Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain.

Slides:

Advertisements

Similar presentations

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.

Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.

Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.

Introduction to Language Models Evaluation in information retrieval Lecture 4.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.

Scalable Text Mining with Sparse Generative Models

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

Natural Language Understanding

Crash Course on Machine Learning

An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.

SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.

Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.

Isolated-Word Speech Recognition Using Hidden Markov Models

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Bayesian Networks. Male brain wiring Female brain wiring.

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

Text Classification, Active/Interactive learning.

Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.

NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

Classification Techniques: Bayesian Classification

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Tokenization & POS-Tagging

Language modelling María Fernández Pajares Verarbeitung gesprochener Sprache.

N-gram Models CMSC Artificial Intelligence February 24, 2005.

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

CSC 594 Topics in AI – Text Mining and Analytics

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Natural Language Processing Statistical Inference: n-grams

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

Learning, Uncertainty, and Information: Evaluating Models Big Ideas November 12, 2004.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

KNN & Naïve Bayes Hongning Wang

Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,

Language Model for Machine Translation Jang, HaYoung.

CS4705 Natural Language Processing

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

CPSC 503 Computational Linguistics

Speech recognition, machine learning

Word embeddings (continued)

From Unstructured Text to StructureD Data

Speech recognition, machine learning

Presentation transcript:

Ngram Models Bahareh Sarrafzadeh Winter 2010

Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain Text Classification – Ngram-based Approach

NGram

What is an N-Gram? A subsequence of n items from a given sequence Items: – Phonemes – Syllables – Letters – Words Number of Items: – Unigram, Bigram, Trigram,...

N-Gram - Examples 3-Grams – ceramics collectables collectibles (55) – ceramics collectables fine (130) – ceramics collected by (52) – ceramics collectible pottery (50) – ceramics collectibles cooking (45) 4-Grams – serve as the incoming (92) – serve as the incubator (99) – serve as the independent (794) – serve as the index (223) – serve as the indication (72) – serve as the indicator (120)

N-Gram Model A Probabilistic Model for Predicting the next Item in such a sequence. Why do we want to Predict Words? – Chatbots – Speech recognition – Handwriting recognition/OCR – Spelling correction – Author attribution – Plagiarism detection –...

N-Gram Model Models Sequences, esp. NL, using the Statistical Properties of N-Grams Idea: Shannon – given a sequence of letters (e.g. "for ex"), what is the likelihood of the next letter? – From training data, derive a probability distribution for the next letter given a history of size n.

N-Gram Model Predicts x i based on x i – 1, x i – 2,..., x i – n: NGram Independence Assumption: – word is affected only by its “prior local context” (last few words) – Advantages: Massively simplifies the problem of learning the language model because of the open nature of language, it is common to group words unknown to the language model together

Language Models A statistical language model assigns a probability to a sequence of m words by means of a probability distribution Applications in NLP: – speech recognition, – machine translation, – part-of-speech tagging, – parsing, – information retrieval.

The goal of Statistical Language Modeling is to build a statistical language model that can estimate the distribution of natural language as accurate as possible.

A bad language model

What happened? A Language model is a probability distribution over word sequences – P(“And nothing but the truth”)  – P(“And nuts sing on the roof”)  0

How language models work? Hard to compute P(“And nothing but the truth”) Step 1: Decompose probability

Language Models - Simplification Estimating the probability of sequences can become difficult in corpora – Arbitrary long phrases or sentences – Data sparseness – Overfitting Solution: Models are often approximated using smoothed N-gram models.

In an n-gram model, the probability of observing the sentence w 1,...,w m is approximated as: The conditional probability can be calculated from n-gram frequency counts: Ngram Modeling of a Language Prediction History

Example Assume each word depends only on the previous two words (Trigram Assumption)

Smoothing It is useful to assign small probabilities to unseen n-mers. For example, for 3-grams we add 2 “dummy“ words (such as ‘.’) to the beginning of each sentence, we have:

Graphical Representation... 1-gram 2-gram... n-gram Previous (n-1)-gram

Use of Log Probabilities Multiplying a large number of probabilities gives a very small result (close to zero) So in order to avoid floating-point underflow, we should use logarithms of the probabilities in the model.

Evaluation Extrinsic – The language model is embedded in a wider application: Slow Specific to the application Intrinsic – The language model is evaluated directly using some measure, such as Perplexity

Perplexity Perplexity is a measure of the size of the set of words from which the next word is chosen given that we observe the history of spoken words. The perplexity of a LM depends on the domain of discourse.

Perplexity: Intuition Ask a speech recognizer to recognize digits “0, 1, 2, 3, 4, 5, 6, 7, 8, 9” – easy – perplexity 10 Ask a speech recognizer to recognize names at Microsoft – hard – 30,000 – perplexity 30,000 Perplexity is weighted equivalent branching factor.

Perplexity: Is lower better? Remarkable fact: the true model for data has the lowest possible perplexity Lower the perplexity, the closer we are to the true model.

Markov Model

Markov Property – Markov Process “the future is independent of the past given the present.” A stochastic process has the Markov property if the conditional probability distribution of future states of the process depend only upon the present state. A process with this property is called Markov process.

Markov Chain We have a set of states, S = {s 1, s 2,..., s r }. The process starts in one of these states and moves successively from one state to another. Each move is called a step. If the chain is currently in state s i, then it moves to state s j at the next step with a probability denoted by p ij. This probability does not depend upon which states the chain was in before the current state

Order m – Markov Chain A Markov chain of order m (or a Markov chain with memory m) where m is finite, is a process in which the future state depends on the past m states.

Text Generation using Markov Chains Markov processes can also be used to generate superficially "real-looking" text given a sample document These processes are also used by spammers to inject real-looking hidden paragraphs into s to get these messages past spam filters.

Shannon considers a series of Markov chain approximations to English prose. For example, he presents first a simulation where the words are chosen independently but with appropriate frequencies. REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE.

He then notes the increased similarity to ordinary English text when the words are chosen as a Markov chain, in which case he obtains THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.

Garkov!

Text Classification using NGram

Text Classification A fundamental kind of document processing A content based assignment of one or more predefined categories to free texts. Approaches: – Supervised – Unsupervised – Semisupervised

Main Tasks 1.Feature Construction / Selection – Extracting Representative Features Words- Frequency Context of Words – Set of Words Spare Phrases – Neighbour Words Word Ngrams - Frequency 2.Learning Phase – Binary Classifiers – M-ary Classifiers

Learning Algorithms Decision Trees Naive Bayes KNN Neural Networks Support Vector Machines

Ngram based Text Classification Features: – N-grams Values: – N-grams Frequencies Similarity measure – Of various types

Classifier’s Characteristics The categorization must work reliably in spite of textual errors. The categorization must be efficient, consuming as little storage and processing time as possible. The categorization must be able to recognize when a given document does not match any category, or when it falls between two categories.

Overall Approach Start with a set of pre-existing text categories (such as subject domains) Generate a set of N-gram frequency profiles to represent each of the categories. When a new document arrives for classification, the system first computes its N-gram frequency profile. It then compares this profile against the profiles for each of the categories using an easily calculated distance measure. The system classifies the document as belonging to the category having the smallest distance.

N-gram Frequency Statistics Each word occurs in human languages with a different frequency. One of the most common ways of expressing this idea: Zipf’s Law

Zipf’s Law The nth most common word in a human language text occurs with a frequency inversely proportional to n: there is always a set of words which dominates most of the other words of the language in terms of frequency of use.

Zipf’s Law The most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word... This is true for: – Languages, – Subject – specific words

Zipf’s Law: Example For example, in the Brown Corpus "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences, The second-place word "of" accounts for slightly over 3.5% of words, Followed by "and" (about 2%) Only 135 vocabulary items are needed to account for half the Brown Corpus.

Zipf’s Law Applies to Lots of Things frequency of accesses to web pages sizes of settlements income distribution amongst individuals size of earthquakes words in the English language

word frequency in Wikipedia

Zipf’s Law: Classification Zipf’s Law implies that classifying documents with N-gram frequency statistics will not be very sensitive to cutting off the distributions at a particular rank. It also implies that if we are comparing documents from the same category they should have similar N-gram frequency distributions.

Document Representation Documents were represented, by their N-gram frequency profiles: – The list of N-grams ordered by the number of occurrences in the given document. – It simply describes the Zipfian distribution of N- grams in the document.

Generating N-Gram Frequency Profiles Split the text into separate tokens Scan down each token, generating all possible N-grams Hash into a table to find the counter for the N- gram, and increment it. When done, output all N-grams and their counts. Sort those counts into reverse order by the number of occurrences.

Comparing and Ranking N-Gram Frequency Profiles Take two N-gram profiles Calculate a simple rank-order statistic : – E.g. “out-of-place” measure

Language Classification Most writing systems support more than one language. Given a text that uses a particular writing system, it is necessary to determine the language in which it is written before further processing is possible.

Lexicon-based Approach Keep a lexicon for each possible language Look up every word in the sample text to see in which lexicon it falls The lexicon that contains the most words from the sample indicates which language was used Is it a good approach?

Challenges Building or Obtaining a Representative Lexicon is not easy! For the highly inflected languages, – A much larger lexicon – Some language-specific morphological processing required Spelling errors (e.g. as the result of an OCR process), will disrupt the lexicon lookup process

Ngram-based Approach Basic idea: Identify N-grams whose occurrence in a document gives strong evidence for / against identification of a text as belonging to a particular language N-gram frequency profile technique can be used to classify document according to their language

Requirements No lexicon No Morphological Processing rules A good number of sample texts (10K to 20K bytes) Calculating the N-gram frequency profiles

Advantages Modest Computational and Storage requirements Very effective Simple No Semantic or Content analysis required (apart from the N-gram frequency profile)

Subject Classification The same text categorization approach Extended to a multi-language database Overall: – A training set is obtained – N-gram frequencies are calculated for each class – N-gram frequencies are calculated for a new article – An overall distance measure between profiles is computed – The article is assigned to the category which minimizes this distance

N-grams: Summary Very simple but effective Resistant to Textual Errors No Text Preprocessing Language Independent

References P. Brown, et al, “Class-Based n-gram Models of Natural Language”, Association for Computational Linguistics, 1992 V. Keseljy, N. Cercone et al, “N-gram-based author profiles for authorship attribution”, 2003 W. B. Cavnar, J. M. Trenkle, “N-gram-based text categorization”, Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994 P. Náther, “N-gram based Text Categorization”, Diploma thesis, 2005 J. Henke, “Statistical Inference: n-gram Models over Sparse Data”, TDM Seminar J. Goodman, “The State of The Art in Language Modeling ”, Microsoft Research, Speech Technology Group

Thank You!