Document Classification with Naïve Bayes -- How to Build Yahoo Automatically Andrew McCallum Just Research & CMU www.cs.cmu.edu/~mccallum Joint work with.

Document Classification with Naïve Bayes -- How to Build Yahoo Automatically Andrew McCallum Just Research & CMU www.cs.cmu.edu/~mccallum Joint work with Kamal Nigam, Jason Rennie, Kristie Seymore, Tom Mitchell, Sebastian Thrun, Roni Rosenfeld, Andrew Ng.

6 MultimediaGUIGarb.Coll.Semantics ML Planning planning temporal reasoning plan language... programming semantics types language proof... learning algorithm reinforcement intelligence network... garbage collection memory optimization region...... “planning language semantics proof intelligence” Training Data: Testing Data: Categories: (Planning) Document Classification.

7 A Probabilistic Approach to Document Classification Pick the most probable class, given the evidence: - a class (like “Planning”) - a document (like “language intelligence proof...”) Bayes Rule: (1) One mixture-component per class (2) Independence assumption “Naïve Bayes”: - the i th word in d (like “proof”)

8 A Probabilistic Bayesian Approach Define a probabilistic generative model for documents with classes. Learn the parameters of this model by fitting them to the data and a prior.

9 Parameter Estimation in Naïve Bayes Maximum a posteriori estimate of Pr(w|c), with a Dirichlet prior, (AKA “Laplace smoothing”) Naïve Bayes where N(w,d) is number of times word w occurs in document d. Two ways to improve this method: (A) Make less restrictive assumptions about the model (B) Get better estimates of the model parameters, i.e. Pr(w|c)

10 The Scenario Training data with class labels Data available at training time, but without class labels Web pages user says are interesting Web pages user says are uninteresting Web pages user hasn’t seen or said anything about Can we use the unlabeled documents to increase accuracy?

11 Using the Unlabeled Data Build a classification model using limited labeled data Use model to estimate the labels of the unlabeled documents Use all documents to build a new classification model, which is often more accurate because it is trained using more data.

12 An Example BaseballIce Skating Labeled Data Fell on the ice... The new hitter struck out... Pete Rose is not as good an athlete as Tara Lipinski... Struck out in last inning... Homerun in the first inning... Perfect triple jump... Katarina Witt’s gold medal performance... New ice skates... Practice at the ice rink every day... Unlabeled Data Tara Lipinski’s substitute ice skates didn’t hurt her performance. She graced the ice with a series of perfect jumps and won the gold medal. Tara Lipinski bought a new house for her parents. Pr ( Lipinski ) = 0.01Pr ( Lipinski ) = 0.001 Pr ( Lipinski | Ice Skating ) = 0.02 Pr ( Lipinski | Baseball ) = 0.003 After EM: Before EM:

13 Filling in Missing Labels with EM E-step: Use current estimates of model parameters to “guess” value of missing labels. M-step: Use current “guesses” for missing labels to calculate new estimates of model parameters. Repeat E- and M-steps until convergence. Expectation Maximization is a class of iterative algorithms for maximum likelihood estimation with incomplete data. [Dempster et al ‘77], [Ghahramani & Jordan ‘95], [McLachlan & Krishnan ‘97] Finds the model parameters that locally maximize the probability of both the labeled and the unlabeled data.

14 EM for Text Classification Expectation-step (estimate the class labels) Maximization-step (new parameters using the estimates)

15 WebKB Data Set studentfacultycourseproject 4 classes, 4199 documents from CS academic departments

16 Word Vector Evolution with EM Iteration 0 intelligence DD artificial understanding DDw dist identical rus arrange games dartmouth natural cognitive logic proving prolog Iteration 1 DD D lecture cc D* DD:DD handout due problem set tay DDam yurtas homework kfoury sec Iteration 2 D DD lecture cc DD:DD due D* homework assignment handout set hw exam problem DDam postscript (D is a digit)

17 EM as Clustering X X X = unlabeled

18 EM as Clustering, Gone Wrong X X X

19 20 Newsgroups Data Set 20 class labels, 20,000 documents 62k unique words … comp.sys.mac.hardware comp.sys.ibm.pc.hardwarecomp.os.ms-windows.misc alt.atheism comp.graphics comp.windows.x rec.sport.baseball rec.sport.hockey talk.politics.mideast talk.politics.guns talk.politics.misc talk.religion.misc sci.crypt sci.electronics sci.med sci.space

20 Newsgroups Classification Accuracy varying # labeled documents

21 Newsgroups Classification Accuracy varying # unlabeled documents

22 WebKB Classification Accuracy varying # labeled documents

23 WebKB Classification Accuracy varying weight of unlabeled data

24 WebKB Classification Accuracy varying # labeled documents and selecting unlabeled weight by CV

25 Populating a hierarchy Naïve Bayes +Simple, robust document classification. +Many principled enhancements (e.g. shrinkage). –Requires some labeled training data. Keyword matching +Requires no labeled training data except keywords themselves. –Brittle, breaks easily

26 Combine Naïve Bayes and Keywords for Best of Both Classify unlabeled documents with keyword matching. Pretend these category labels are correct, and use this data to train naïve Bayes. Naïve Bayes acts to temper and “round out” the keyword class definitions. Brings in new probabilistically-weighted keywords that are correlated with the few original keywords.

27 Top words found by naïve Bayes and Shrinkage ROOT computer, university, science, system, paper HCI computer system multimedia university paper IR information text documents classification retrieval Hardware circuits designs computer university performance AI learning university computer based intelligence Programming programming language logic university programs GUI interface design user sketch interfaces Cooperative collaborative CSCW work provide group Multimedia multimedia real time data media Planning planning temporal reasoning plan problems Machine Learning learning algorithm university networks NLP language natural processing information text Semantics semantics denotational language construction types Garbage Collection garbage collection memory optimization region

28 Classification Results 400 test documents 70 classes in a hierarchy of depth 2-4

29 Conclusions Naïve Bayes is a method of document classification based on Bayesian statistics. Many parameters to estimate. Requires much labeled training data. We can build on its probabilistic, statistical foundations to improve performance (e.g. unlabeled data + EM) These techniques are accurate and robust enough to build useful Web services.

Document Classification with Naïve Bayes -- How to Build Yahoo Automatically Andrew McCallum Just Research & CMU www.cs.cmu.edu/~mccallum Joint work with.

Similar presentations

Presentation on theme: "Document Classification with Naïve Bayes -- How to Build Yahoo Automatically Andrew McCallum Just Research & CMU www.cs.cmu.edu/~mccallum Joint work with."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Document Classification with Naïve Bayes -- How to Build Yahoo Automatically Andrew McCallum Just Research & CMU www.cs.cmu.edu/~mccallum Joint work with.

Similar presentations

Presentation on theme: "Document Classification with Naïve Bayes -- How to Build Yahoo Automatically Andrew McCallum Just Research & CMU www.cs.cmu.edu/~mccallum Joint work with."— Presentation transcript:

Similar presentations

About project

Feedback