Introduction to Text Mining Thanks for Hongning slides on Text Ming Courses, Slides are slightly modified by Lei Chen.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Chapter 5: Introduction to Information Retrieval
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Language Models Hongning Wang
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
1 Essential Probability & Statistics (Lecture for CS598CXZ Advanced Topics in Information Retrieval ) ChengXiang Zhai Department of Computer Science University.
Vector Space Model Hongning Wang Recap: what is text mining 6501: Text Mining2 “Text mining, also referred to as text data mining, roughly.
Introduction to Text Mining
Probabilistic Ranking Principle
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Learning for Text Categorization
Chapter 7 Retrieval Models.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Language Modeling Approaches for Information Retrieval Rong Jin.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
IRDM WS Chapter 4: Advanced IR Models 4.1 Probabilistic IR 4.2 Statistical Language Models (LMs) Principles and Basic LMs Smoothing.
Text Classification, Active/Interactive learning.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Language Models Hongning Wang Recap: document generation model 4501: Information Retrieval Model of relevant docs for Q Model of non-relevant.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Chapter 23: Probabilistic Language Models April 13, 2004.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
Vector Space Models.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Text Clustering Hongning Wang
Natural Language Processing Statistical Inference: n-grams
Information Retrieval Models: Vector Space Models
Introduction to Text Mining Hongning Wang
A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Statistical Language Models Hongning Wang CS 6501: Text Mining1.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
KNN & Naïve Bayes Hongning Wang
Language Model for Machine Translation Jang, HaYoung.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Essential Probability & Statistics
Plan for Today’s Lecture(s)
Overview of Statistical Language Models
Statistical Language Models
Vector Space Model Hongning Wang
N-Gram Model Formulas Word sequences Chain rule of probability
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Language Models Hongning Wang
CS590I: Information Retrieval
INF 141: Information Retrieval
Language Models for TR Rong Jin
Presentation transcript:

Introduction to Text Mining Thanks for Hongning slides on Text Ming Courses, Slides are slightly modified by Lei Chen

What is “Text Mining”? “Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text.” - wikipedia “Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” - Hearst, 1999 Text Mining2

Two different definitions of mining Goal-oriented (effectiveness driven) – Any process that generates useful results that are non- obvious is called “mining”. – Keywords: “useful” + “non-obvious” – Data isn’t necessarily massive Method-oriented (efficiency driven) – Any process that involves extracting information from massive data is called “mining” – Keywords: “massive” + “pattern” – Patterns aren’t necessarily useful Text Mining

Text mining around us Sentiment analysis Text Mining4

Text mining around us Document summarization Text Mining5

Text mining around us Restaurant/hotel recommendation Text Mining6

Text mining around us Text analytics in financial services Text Mining7

How to perform text mining? As computer scientists, we view it as – Text Mining = Data Mining + Text Data Applied machine learning Natural language processing Information retrieval s Blogs News articles Web pages Tweets Scientific literature Software documentations Text Mining8

Text mining v.s. NLP, IR, DM… How does it relate to data mining in general? How does it relate to computational linguistics? How does it relate to information retrieval? Finding PatternsFinding “Nuggets” NovelNon-Novel Non-textual data General data-mining Exploratory data analysis Database queries Textual data Computational Linguistics Information retrieval Text Mining Text Mining9

Text mining in general Text Mining10 AccessMining Organization Filter information Discover knowledge Add Structure/Annotations Serve for IR applications Based on NLP/ML techniques Sub-area of DM research

Challenges in text mining Data collection is “free text” – Data is not well-organized Semi-structured or unstructured – Natural language text contains ambiguities on many levels Lexical, syntactic, semantic, and pragmatic – Learning techniques for processing text typically need annotated training examples Expensive to acquire at scale What to mine? Text Mining

Text mining problems we will solve Document categorization – Adding structures to the text corpus Text Mining12

Text mining problems we will solve Text clustering – Identifying structures in the text corpus Text Mining13

Text mining problems we will solve Topic modeling – Identifying structures in the text corpus Text Mining14

Text Document Representation & Language Model 1.How to represent a document? – Make it computable 2.How to infer the relationship among documents or identify the structure within a document? – Knowledge discovery 3. Language Model and N-Grams 6501: Text Mining15

How to represent a document Represent by a string? – No semantic meaning Represent by a list of sentences? – Sentence is just like a short document (recursive definition) 6501: Text Mining16

Vector space model Represent documents by concept vectors – Each concept defines one dimension – k concepts define a high-dimensional space – Element of vector corresponds to concept weight E.g., d=(x 1,…,x k ), x i is “importance” of concept i in d Distance between the vectors in this concept space – Relationship among documents 6501: Text Mining17

An illustration of VS model All documents are projected into this concept space Sports Education Finance D4D4 D2D2 D1D1 D5D5 D3D3 6501: Text Mining18 |D 2 -D 4 |

What the VS model doesn’t say How to define/select the “basic concept” – Concepts are assumed to be orthogonal How to assign weights – Weights indicate how well the concept characterizes the document How to define the distance metric 6501: Text Mining19

What is a good “Basic Concept”? Orthogonal – Linearly independent basis vectors “Non-overlapping” in meaning – No ambiguity Weights can be assigned automatically and accurately Existing solutions – Terms or N-grams, a.k.a., Bag-of-Words – Topics We will come back to this later 6501: Text Mining20

Bag-of-Words representation Term as the basis for vector space – Doc1: Text mining is to identify useful information. – Doc2: Useful information is mined from text. – Doc3: Apple is delicious. textinformationidentifyminingminedisusefultofromappledelicious Doc Doc Doc : Text Mining21

Tokenization Break a stream of text into meaningful units – Tokens: words, phrases, symbols – Definition depends on language, corpus, or even context Input: It’s not straight-forward to perform so-called “tokenization.” Output(1): 'It’s', 'not', 'straight-forward', 'to', 'perform', 'so-called', '“tokenization.”' Output(2): 'It', '’', 's', 'not', 'straight', '-', 'forward, 'to', 'perform', 'so', '-', 'called', ‘“', 'tokenization', '.', '”‘ Information Retrieval22

Tokenization Solutions – Regular expressions [\w]+: so-called -> ‘so’, ‘called’ [\S]+: It’s -> ‘It’s’ instead of ‘It’, ‘’s’ – Statistical methods Explore rich features to decide where the boundary of a word is – Apache OpenNLP ( – Stanford NLP Parser ( parser.shtml) parser.shtml Online Demo – Stanford ( – UIUC ( Information Retrieval23 We will come back to this later

Bag-of-Words representation Assumption – Words are independent from each other Pros – Simple Cons – Basis vectors are clearly not linearly independent! – Grammar and order are missing The most frequently used document representation – Image, speech, gene sequence 6501: Text Mining24 textinformationidentifyminingminedisusefultofromappledelicious Doc Doc Doc

Bag-of-Words with N-grams 6501: Text Mining25

Automatic document representation 6501: Text Mining26

A statistical property of language A plot of word frequency in Wikipedia (Nov 27, 2006) Word frequency Word rank by frequency 6501: Text Mining27 In the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences; the second-place word "of" accounts for slightly over 3.5% of words. Discrete version of power law

Zipf’s law tells us Head words take large portion of occurrences, but they are semantically meaningless – E.g., the, a, an, we, do, to Tail words take major portion of vocabulary, but they rarely occur in documents – E.g., dextrosinistral The rest is most representative – To be included in the controlled vocabulary 6501: Text Mining28

Automatic document representation Remove non-informative words Remove rare words 6501: Text Mining29

Normalization Convert different forms of a word to a normalized form in the vocabulary – U.S.A. -> USA, St. Louis -> Saint Louis Solution – Rule-based Delete periods and hyphens All in lower cases – Dictionary-based Construct equivalent class – Car -> “automobile, vehicle” – Mobile phone -> “cellphone” 6501: Text Mining30 We will come back to this later

Stemming Reduce inflected or derived words to their root form – Plurals, adverbs, inflected word forms E.g., ladies -> lady, referring -> refer, forgotten -> forget – Bridge the vocabulary gap – Solutions (for English) Porter stemmer: patterns of vowel-consonant sequence Krovetz stemmer: morphological rules – Risk: lose precise meaning of the word E.g., lay -> lie (a false statement? or be in a horizontal position?) 6501: Text Mining31

Stopwords Useless words for document analysis – Not all words are informative – Remove such words to reduce vocabulary size – No universal definition – Risk: break the original meaning and structure of text E.g., this is not a good option -> option to be or not to be -> null The OEC: Facts about the language 6501: Text Mining32

Constructing a VSM representation 6501: Text Mining33 4.Stopword/controlled vocabulary filtering: Documents in a vector space! D1: ‘Text mining is to identify useful information.’ D1: ‘Text’, ‘mining’, ‘is’, ‘to’, ‘identify’, ‘useful’, ‘information’, ‘.’ D1: ‘text’, ‘mine’, ‘is’, ‘to’, ‘identify’, ‘use’, ‘inform’, ‘.’ D1: ‘text-mine’, ‘mine-is’, ‘is-to’, ‘to-identify’, ‘identify-use’, ‘use-inform’, ‘inform-.’ D1: ‘text-mine’, ‘to-identify’, ‘identify-use’, ‘use-inform’ 1.Tokenization: 2.Stemming/normalization: 3.N-gram construction: Naturally fit into MapReduce paradigm! Mapper Reducer

Recap: an illustration of VS model All documents are projected into this concept space Sports Education Finance D4D4 D2D2 D1D1 D5D5 D3D3 6501: Text Mining34 |D 2 -D 4 |

Recap: Bag-of-Words representation Assumption – Words are independent from each other Pros – Simple Cons – Basis vectors are clearly not linearly independent! – Grammar and order are missing The most frequently used document representation – Image, speech, gene sequence 6501: Text Mining35 textinformationidentifyminingminedisusefultofromappledelicious Doc Doc Doc

Recap: automatic document representation Remove non-informative words Remove rare words 6501: Text Mining36

Recap: constructing a VSM representation 6501: Text Mining37 4.Stopword/controlled vocabulary filtering: Documents in a vector space! D1: ‘Text mining is to identify useful information.’ D1: ‘Text’, ‘mining’, ‘is’, ‘to’, ‘identify’, ‘useful’, ‘information’, ‘.’ D1: ‘text’, ‘mine’, ‘is’, ‘to’, ‘identify’, ‘use’, ‘inform’, ‘.’ D1: ‘text-mine’, ‘mine-is’, ‘is-to’, ‘to-identify’, ‘identify-use’, ‘use-inform’, ‘inform-.’ D1: ‘text-mine’, ‘to-identify’, ‘identify-use’, ‘use-inform’ 1.Tokenization: 2.Stemming/normalization: 3.N-gram construction: Naturally fit into MapReduce paradigm! Mapper Reducer

How to assign weights? Important! Why? – Corpus-wise: some terms carry more information about the document content – Document-wise: not all terms are equally important How? – Two basic heuristics TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) 6501: Text Mining38

Term frequency 6501: Text Mining39 Doc A: ‘good’,10 Doc B: ‘good’,2 Doc C: ‘good’,3 Which two documents are more similar to each other?

TF normalization Two views of document length – A doc is long because it is verbose – A doc is long because it has more content Raw TF is inaccurate – Document length variation – “Repeated occurrences” are less informative than the “first occurrence” – Information about semantic does not increase proportionally with number of term occurrence Generally penalize long document, but avoid over-penalizing – Pivoted length normalization 6501: Text Mining40

TF normalization Norm. TF Raw TF 6501: Text Mining41

TF normalization Norm. TF Raw TF : Text Mining42

Document frequency Idea: a term is more discriminative if it occurs only in fewer documents 6501: Text Mining43

Inverse document frequency Total number of docs in collection Non-linear scaling 6501: Text Mining44

Why document frequency Wordttfdf try insurance Table 1. Example total term frequency v.s. document frequency in Reuters-RCV1 collection. 6501: Text Mining45

TF-IDF weighting “Salton was perhaps the leading computer scientist working in the field of information retrieval during his time.” - wikipedia Gerard Salton Award – highest achievement award in IR 6501: Text Mining46

How to define a good similarity metric? Euclidean distance? Sports Education Finance D4D4 D2D2 D1D1 D5D5 D3D3 D6D6 TF-IDF space 6501: Text Mining47

How to define a good similarity metric? 6501: Text Mining48

From distance to angle Angle: how vectors are overlapped – Cosine similarity – projection of one vector onto another Sports Finance D1D1 D2D2 D6D6 TF-IDF space The choice of angle The choice of Euclidean distance 6501: Text Mining49

Cosine similarity TF-IDF vector Unit vector D1D1 D2D2 D6D6 TF-IDF space Sports Finance 6501: Text Mining50

Advantages of VS model Empirically effective! Intuitive Easy to implement Well-studied/mostly evaluated The Smart system – Developed at Cornell: – Still widely used Warning: many variants of TF-IDF! 6501: Text Mining51

Disadvantages of VS model Assume term independence Lack of “predictive adequacy” – Arbitrary term weighting – Arbitrary similarity measure Lots of parameter tuning! 6501: Text Mining52

Text Document Representation & Language Model 1.How to represent a document? – Make it computable 2.How to infer the relationship among documents or identify the structure within a document? – Knowledge discovery 3. Language Model and N-Grams 6501: Text Mining53

What is a statistical LM? A model specifying probability distribution over word sequences – p(“ Today is Wednesday ”)  – p(“ Today Wednesday is ”)  – p(“ The eigenvalue is positive” )  It can be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model 6501: Text Mining54

Why is a LM useful? Provide a principled way to quantify the uncertainties associated with natural language Allow us to answer questions like: – Given that we see “ John ” and “ feels ”, how likely will we see “ happy ” as opposed to “ habit ” as the next word? (speech recognition) – Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports” v.s. “politics”? (text categorization) – Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval) 6501: Text Mining55

Measure the fluency of documents 6501: Text Mining56

Source-Channel framework [Shannon 48] Source Transmitter (encoder) Destination Receiver (decoder) Noisy Channel P(X) P(Y|X) X Y X’ P(X|Y)=? When X is text, p(X) is a language model (Bayes Rule) Many Examples: Speech recognition: X=Word sequence Y=Speech signal Machine translation: X=English sentence Y=Chinese sentence OCR Error Correction: X=Correct word Y= Erroneous word Information Retrieval: X=Document Y=Query Summarization: X=Summary Y=Document 6501: Text Mining57

Basic concepts of probability Random experiment – An experiment with uncertain outcome (e.g., tossing a coin, picking a word from text) Sample space (S) – All possible outcomes of an experiment, e.g., tossing 2 fair coins, S={HH, HT, TH, TT} Event (E) – E  S, E happens iff outcome is in S, e.g., E={HH} (all heads), E={HH,TT} (same face) – Impossible event ({}), certain event (S) Probability of event – 0 ≤ P(E) ≤ : Text Mining58

Recap: how to assign weights? Important! Why? – Corpus-wise: some terms carry more information about the document content – Document-wise: not all terms are equally important How? – Two basic heuristics TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) 6501: Text Mining59

Sampling with replacement Pick a random shape, then put it back in the bag 6501: Text Mining60

Essential probability concepts 6501: Text Mining61

Essential probability concepts 6501: Text Mining62

Language model for text 6501: Text Mining63 Chain rule: from conditional probability to joint probability sentence Average English sentence length is 14.3 words 475,000 main headwords in Webster's Third New International Dictionary How large is this? We need independence assumptions!

Unigram language model 6501: Text Mining64 The simplest and most popular choice!

More sophisticated LMs 6501: Text Mining65

Why just unigram models? Difficulty in moving toward more complex models – They involve more parameters, so need more data to estimate – They increase the computational complexity significantly, both in time and space Capturing word order or structure may not add so much value for “topical inference” But, using more sophisticated models can still be expected to improve performance : Text Mining66

Generative view of text documents (Unigram) Language Model  p(w|  ) … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food … Topic 1: Text mining … text 0.01 food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health Document Text mining document Food nutrition document Sampling 6501: Text Mining67

How to generate text from an N-gram language model? 6501: Text Mining68

Generating text from language models 6501: Text Mining69 Under a unigram language model:

Generating text from language models 6501: Text Mining70 The same likelihood! Under a unigram language model:

N-gram language models will help Unigram – Months the my and issue of year foreign new exchange’s september were recession exchange new endorsed a q acquire to six executives. Bigram – Last December through the way to preserve the Hudson corporation N.B.E.C. Taylor would seem to complete the major central planners one point five percent of U.S.E. has already told M.X. corporation of living on information such as more frequently fishing to keep her. Trigram – They also point to ninety nine point six billon dollars from two hundred four oh six three percent of the rates of interest stores as Mexico and Brazil on market conditions. 6501: Text Mining71 Generated from language models of New York Times

Turing test: generating Shakespeare 6501: Text Mining72 A B C D SCIgen - An Automatic CS Paper Generator

Estimation of language models 6501: Text Mining73 Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 Unigram Language Model  p(w|  )=? … text ? mining ? assocation ? database ? … query ? … Estimation A “text mining” paper (total #words=100)

Sampling with replacement Pick a random shape, then put it back in the bag 6501: Text Mining74

Recap: language model for text 6501: Text Mining75 Chain rule: from conditional probability to joint probability sentence Average English sentence length is 14.3 words 475,000 main headwords in Webster's Third New International Dictionary How large is this? We need independence assumptions!

Recap: how to generate text from an N-gram language model? 6501: Text Mining76

Recap: estimation of language models Maximum likelihood estimation Unigram Language Model  p(w|  )=? Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 … text ? mining ? assocation ? database ? … query ? … Estimation A “text mining” paper (total #words=100) 10/100 5/100 3/100 1/ : Text Mining77

Parameter estimation 6501: Text Mining78

Maximum likelihood vs. Bayesian Maximum likelihood estimation – “Best” means “data likelihood reaches maximum” – Issue: small sample size Bayesian estimation – “Best” means being consistent with our “prior” knowledge and explaining data well – A.k.a, Maximum a Posterior estimation – Issue: how to define prior? 6501: Text Mining79 ML: Frequentist’s point of view MAP: Bayesian’s point of view

Illustration of Bayesian estimation 6501: Text Mining80 Prior: p(  ) Likelihood: p(X|  ) X=(x 1,…,x N ) Posterior: p(  |X)  p(X|  )p(  )    : prior mode  ml : ML estimate  : posterior mode

Maximum likelihood estimation Set partial derivatives to zero ML estimate Sincewe have Requirement from probability CS 6501: Text Mining

Maximum likelihood estimation 6501: Text Mining82 Length of document or total number of words in a corpus

Problem with MLE Unseen events – We estimated a model on 440K word tokens, but: Only 30,000 unique words occurred Only 0.04% of all possible bigrams occurred  This means any word/N-gram that does not occur in the training data has zero probability!  No future documents can contain those unseen words/N-grams 6501: Text Mining83 A plot of word frequency in Wikipedia (Nov 27, 2006) Word frequency Word rank by frequency

Smoothing If we want to assign non-zero probabilities to unseen words – Unseen words = new words, new N-grams – Discount the probabilities of observed words General procedure 1.Reserve some probability mass of words seen in a document/corpus 2.Re-allocate it to unseen words 6501: Text Mining84

Illustration of N-gram language model smoothing w Max. Likelihood Estimate Smoothed LM Assigning nonzero probabilities to the unseen words 6501: Text Mining85 Discount from the seen words

Smoothing methods Additive smoothing – Add a constant  to the counts of each word Unigram language model as an example – Problems? Hint: all words are equally important? “Add one”, Laplace smoothing Vocabulary size Counts of w in d Length of d (total counts) 6501: Text Mining86

Add one smoothing for bigrams 6501: Text Mining87

After smoothing Giving too much to the unseen events 6501: Text Mining88

Refine the idea of smoothing Should all unseen words get equal probabilities? We can use a reference model to discriminate unseen words Discounted ML estimate Reference language model 6501: Text Mining89

Smoothing methods Linear interpolation – Use (N –1)-gram probabilities to smooth N-gram probabilities We never see the trigram “Bob was reading”, but we do see the bigram “was reading” 6501: Text Mining90

Smoothing methods Linear interpolation – Use (N –1)-gram probabilities to smooth N-gram probabilities – Further generalize it 6501: Text Mining91 ML N-gram Smoothed (N-1)-gram Smoothed N-gram …….

Smoothing methods 6501: Text Mining92 We will come back to this later

Smoothing methods 6501: Text Mining93 Smoothed (N-1)-gram Smoothed N-gram

What you should know Basic ideas of vector space model Procedures of constructing VS representation for a document Two important heuristics in bag-of-words representation – TF – IDF Similarity metric for VS model 6501: Text Mining94