Download presentation
Presentation is loading. Please wait.
Published byEdith Simpson Modified over 9 years ago
1
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU
2
Bag of Words Models do we really need elaborate linguistic analysis? look at text mining applications – document retrieval – opinion mining – association mining see how far we can get with document-level bag- of-words models – and introduce some of our mathematical approaches 1/16/14NYU2
3
document retrieval opinion mining association mining 1/16/14NYU3
4
Information Retrieval Task: given query = list of keywords, identify and rank relevant documents from collection Basic idea: find documents whose set of words most closely matches words in query 1/16/14NYU4
5
Topic Vector Suppose the document collection has n distinct words, w 1, …, w n Each document is characterized by an n-dimensional vector whose i th component is the frequency of word w i in the document Example D1 = [The cat chased the mouse.] D2 = [The dog chased the cat.] W = [The, chased, dog, cat, mouse] (n = 5) V1 = [ 2, 1, 0, 1, 1 ] V2 = [ 2, 1, 1, 1, 0 ] 1/16/14NYU5
6
Weighting the components Unusual words like elephant determine the topic much more than common words such as “the” or “have” can ignore words on a stop list or weight each term frequency tf i by its inverse document frequency idf i where N = size of collection and n i = number of documents containing term i 1/16/14NYU6
7
Cosine similarity metric Define a similarity metric between topic vectors A common choice is cosine similarity (dot product): 1/16/14NYU7
8
Cosine similarity metric the cosine similarity metric is the cosine of the angle between the term vectors: 1/16/14NYU8 w1w1 w2w2
9
Verdict: a success For heterogenous text collections, the vector space model, tf-idf weighting, and cosine similarity have been the basis for successful document retrieval for over 50 years stemming required for inflected languages limited resolution: returns documents, not answers
10
document retrieval opinion mining association mining 1/16/14NYU10
11
Opinion Mining – Task: judge whether a document expresses a positive or negative opinion (or no opinion) about an object or topic classification task valuable for producers/marketers of all sorts of products – Simple strategy: bag-of-words make lists of positive and negative words see which predominate in a given document (and mark as ‘no opinion’ if there are few words of either type problem: hard to make such lists – lists will differ for different topics 1/16/1411NYU
12
Training a Model – Instead, label the documents in a corpus and then train a classifier from the corpus labeling is easier than thinking up words in effect, learn positive and negative words from the corpus – probabilistic classifier identify most likely class s = argmax P ( t | W) t є {pos, neg} 1/16/1412NYU
13
Using Bayes’ Rule The last step is based on the naïve assumption of independence of the word probabilities 1/16/14NYU13
14
Training We now estimate these probabilities from the training corpus (of N documents) using maximum likelihood estimators P ( t ) = count ( docs labeled t) / N P ( w i | t ) = count ( docs labeled t containing w i ) count ( docs labeled t) 1/16/14NYU14
15
A Problem Suppose a glowing review GR (with lots of positive words) includes one word, “mathematical”, previously seen only in negative reviews P ( positive | GR ) = ? 1/16/14NYU15
16
P ( positive | GR ) = 0 because P ( “mathematical” | positive ) = 0 The maximum likelihood estimate is poor when there is very little data We need to ‘smooth’ the probabilities to avoid this problem 1/16/14NYU16
17
Laplace Smoothing A simple remedy is to add 1 to each count – to keep them as probabilities we increase the denominator N by the number of outcomes (values of t) (2 for ‘positive’ and ‘negative’) – for the conditional probabilities P( w | t ) there are similarly two outcomes (w is present or absent) 1/16/14NYU17
18
An NLTK Demo Sentiment Analysis with Python NLTK Text Classification http://text-processing.com/demo/sentiment/ NLTK Code (simplified classifier) http://streamhacker.com/2010/05/10/text- classification-sentiment-analysis-naive-bayes-classifier http://streamhacker.com/2010/05/10/text- classification-sentiment-analysis-naive-bayes-classifier 1/16/14NYU18
19
Is “low” a positive or a negative term? 1/16/14NYU19
20
Ambiguous terms “low” can be positive “low price” or negative “low quality” 1/16/14NYU20
21
Negation how to handle “the equipment never failed” 1/16/14NYU21
22
Negation modify words following negation “the equipment never NOT_failed” treat them as a separate ‘negated’ vocabulary 1/16/14NYU22
23
Negation: how far to go? “the equipment never failed and was cheap to run” “the equipment never NOT_failed NOT_and NOT_was NOT_cheap NOT_to NOT_run have to determine scope of negation 1/16/14NYU23
24
Verdict: mixed A simple bag-of-words strategy with a NB model works quite well for simple reviews referring to a single item, but fails – for ambiguous terms – for negation – for comparative reviews – to reveal aspects of an opinion the car looked great and handled well, but the wheels kept falling off 1/16/14NYU24
25
document retrieval opinion mining association mining 1/16/14NYU25
26
Association Mining Goal: find interesting relationships among attributes of an object in a large collection … objects with attribute A also have attribute B – e.g., “people who bought A also bought B” For text: documents with term A also have term B – widely used in scientific and medical literature 1/16/14NYU26
27
Bag-of-words Simplest approach look for words A and B for which frequency (A and B in same document) >> frequency of A x frequency of B doesn’t work well want to find names (of companies, products, genes), not individual words interested in specific types of terms want to learn from a few examples – need contexts to avoid noise 1/16/14NYU27
28
Needed Ingredients Effective text association mining needs – name recognition – term classification – preferably: ability to learn patterns – preferably: a good GUI – demo: www.coremine.comwww.coremine.com 1/16/14NYU28
29
Conclusion Some tasks can be handled effectively (and very simply) by bag-of-words models, but most benefit from an analysis of language structure 1/16/14NYU29
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.