Presentation is loading. Please wait.

Presentation is loading. Please wait.

Andrew McCallum Just Research (formerly JPRC)

Similar presentations


Presentation on theme: "Andrew McCallum Just Research (formerly JPRC)"— Presentation transcript:

1 Two Methods for Improving Text Classification when Training Data is Sparse
Andrew McCallum Just Research (formerly JPRC) Carnegie Mellon University For more detail see Improving Text Classification by Shrinkage in a Hierarchy of Classes (Sub. to ICML-98) McCallum, Rosenfeld, Mitchell, Ng Learning to ClassifyText from Labeled and Unlabeled Documents (AAAI-98) Nigam, McCallum, Thrun, Mitchell

2 The Task: Document Classification
4/24/2017 The Task: Document Classification (AKA “Document Categorization”, “Routing” or “Tagging”) Automatically placing documents in their correct categories. “wheat grow tractor…” Testing Data: Categories: Irrigation Crops Botany Evolution Magnetism Relativity Training Data: water grating ditch tractor... wheat corn silo grow... wheat tulips splicing grow... selection mutation Darwin... ... ...

3 A Probabilistic Approach to Document Classification
Pick the most probable class, given the evidence: - a class (like “Crops”) - a document (like “wheat grow tractor...”) Bayes Rule: “Naïve Bayes”: Independence assumption - the i th word in d (like “grow”)

4 Comparison with TFIDF TFIDF/Rocchio Naïve Bayes
Where Z is some normalization constant

5 Parameter Estimation in Naïve Bayes
Bayes optimal estimate of Pr(w|c), (via LaPlace smoothing) A Key Problem: Getting better estimates of Pr(w|c)

6 Document Classification in a Hierarchy of Classes
4/24/2017 Document Classification in a Hierarchy of Classes Andrew McCallum Roni Rosenfeld Tom Mitchell Andrew Ng

7 The Idea: “Deleted Interpolation” or “Shrinkage”
We can improve the parameter estimates in a leaf by averaging them with the estimates in its ancestors. This represents a tradeoff between reliability and specificity. “wheat grow tractor…” Testing Data: Science Agriculture Biology Physics Categories: Irrigation Crops Botany Evolution Magnetism Relativity Training Data: water grating ditch tractor... wheat corn silo grow... wheat tulips splicing grow... selection mutation Darwin... ... ...

8 “Deleted Interpolation” or “Shrinkage”
[Jelinek and Mercer, 1980], [James and Stein, 1961] “Deleted Interpolation” in N-gram space: “Deleted Interpolation” in class hierarchy space: Learn the l’s via EM, performing the E-step with leave-one-out cross-validation.

9 Experimental Results Industry Sector Dataset 20 Newsgroups Dataset
71 classes, 6.5k documents, 1.2 million words, 30k vocabulary 20 Newsgroups Dataset 15 classes, 15k documents, 1.7 million words, 52k vocabulary Yahoo Science Dataset 95 classes, 13k documents, 0.6 million words, 44k vocabulary

10 Learning to Classify Text from Labeled and Unlabeled Documents
4/24/2017 Learning to Classify Text from Labeled and Unlabeled Documents Kamal Nigam Andrew McCallum Sebastian Thrun Tom Mitchell Joint with Want to get Kamal for summer, and it seems likely that we will get him.

11 The Scenario Training data with class labels
4/24/2017 The Scenario Training data with class labels Data available at training time, but without class labels The Scenario in which this technique applies. Small amount of labeled. Vast quantities of unlabeled. For example: (1) Learning users interests (2) Yahoo data (3) ML for Information Extraction Question: Can we use the unlabeled data…? Web pages user says are interesting Web pages user says are uninteresting Web pages user hasn’t seen or said anything about Can we use the unlabeled documents to increase accuracy?

12 Using the Unlabeled Data
4/24/2017 Using the Unlabeled Data Build a classification model using limited labeled data Use model to guess the labels of the unlabeled documents The answer is “Yes”! In way that seems almost magic. (1) Build model (2) Use to estimate class labels (probabilistically) (3) Train new model using estimated labels New model is better Like pulling self up by bootstraps! Use all documents to build a new classification model, which is more accurate because it is trained using more data.

13 Expectation Maximization [Dempster, Laird, Rubin 1977]
4/24/2017 Expectation Maximization [Dempster, Laird, Rubin 1977] Applies when there are two inter-dependent unknowns. (1) The word probabilities for each class (2) The class labels of the unlabeled doc’s. E-step: Use current “guess” of (1) to estimate value of (2) Use classification model built from limited training data to assign probabilistic labels to unlabeled documents M-step: Use probabilistic estimates of (2) to update (1). Use probabilistic class labels on unlabeled documents to build a more accurate classification model. Repeat E- and M-steps until convergence. This is example of a well-known algorithm: EM Used in clustering, speech recognition, spelling correction, … Applies when…

14 Why it Works -- An Example
4/24/2017 Why it Works -- An Example Labeled Data Unlabeled Data Baseball Ice Skating Tara Lipinski new ice skates didn’t hurt her performance. She graced the ice with a series of perfect jumps and won the gold medal... The new hitter struck out... Fell on the ice... Perfect triple jump... Struck out in last inning... Katarina Witt’s gold medal performance... Homerun in the first inning... It seems like magic. Why does it work? It takes advantage of feature co-occurrences. (In our case word co-occurrences) Example: Limited training data. Baseball, Ice Skating In general, bad estimates of word probabilities Some common words often for good estimates Other less common, bad, e.g. Yamaguchi See unlabeled data… Co-occurrence --> better word probability estimates New ice skates... Tara Lipinski bought a new house for her parents. Pete Rose is not as good an athlete as Tara Lipinski... Practice at the ice rink every day... Pr ( Lipinski ) = 0.02 Pr ( Lipinski ) = 0.01 Pr ( Lipinski ) = 0.001 Pr ( Lipinski ) = 0.003

15 EM for Text Classification
Expectation-step (guess the class labels) Maximization-step (set parameters using the guesses)

16 Experimental Results -- The Data
4/24/2017 Experimental Results -- The Data Four classes of Web pages Student, Faculty, Course, Project 4199 Web pages total Twenty newsgroups from UseNet several of religion, politics, sports, comp.* 1000 articles per class New articles from Reuters 90 different categories 12902 articles total Kamal has coded this algorithm up. Experiments with UseNet data, WebKB Here just UseNet. 5 classes, confusable… Total data set includes 1000 per class, we split this among training, testing, and training split among labeled and unlabeled.

17 Word Vector Evolution with EM
Iteration 0 intelligence DD artificial understanding DDw dist identical rus arrange games dartmouth natural cognitive logic proving prolog Iteration 1 DD D lecture cc D* DD:DD handout due problem set tay DDam yurtas homework kfoury sec Iteration 2 D DD lecture cc DD:DD due D* homework assignment handout set hw exam problem DDam postscript (D is a digit)

18 Related Work Using EM to reduce the need for training examples:
4/24/2017 Related Work Using EM to reduce the need for training examples: [Miller and Uyar 1997] [Shahshahani and Landgrebe 1994] AutoClass - unsupervised EM with Naïve Bayes: [Cheeseman 1988] Using EM to fill in missing values [Ghahramani and Jordan 1995] This is a widely studied technique and phenomenon. EM has been around since at least 1977. Cheeseman’s “AutoClass” system uses it to cluster data in ways very similar to what I presented. Zoubin Gharamani’s work very much inspired mine. He used EM to fill in missing values. Here the class label is the missing value. At this last NIPS conference, Zehra Cataltepe from CalTech presented a poster on… END OF EM --- Questions?


Download ppt "Andrew McCallum Just Research (formerly JPRC)"

Similar presentations


Ads by Google