Download presentation
Presentation is loading. Please wait.
Published byCharles Clark Modified over 9 years ago
1
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito
2
Introduction: Text Categorization Many digital Texts are available E-mail, Online news, Blog … Need of Automatic Text Categorization is increasing without human resource Merits of time and cost
3
Introduction: Text Categorization Application Spam filter Topic Categorization
4
Introduction: Machine Learning Making Categorization rule automatically by Feature of Text Types of Machine Learning (ML) Supervised Learning Labeling Unsupervised Learning Clustering
5
Introduction: flow of ML 1.Prepare training Text data with label Feature of Text 2.Learn 3.Categorize new Text Label1 Label2 ?
6
Outline Introduction Text Categorization Feature of Text Learning Algorithm Conclusion
7
Number of labels Binary-label True or False (Ex. spam or not) Applied for other types Multi-label Many labels, but One Text has one label Overlapping-label One Text has some labels Yes No L1 L2 L3 L4 L1 L2 L3 L4
8
Types of labels Topic Categorization Basic Task Compare individual words Author Categorization Sentiment Categorization Ex) Review of products Need more linguistic information
9
Outline Introduction Text Categorization Feature of Text Learning Algorithm Conclusion
10
Feature of Text How to express a feature of Text? “ Bag of Words ” Ignore an order of words Structure Ex) I like this car. | I don ’ t like this car. “ Bag of Words ” will not work well (d:document = text) (t:term = word)
11
Preprocessing Remove stop words “ the ” “ a ” “ for ” … Stemming relational -> relate, truly -> true
12
Term Weighting Term Frequency Number of a term in a document Frequent terms in a document seems to be important for categorization tf ・ idf Terms appearing in many documents are not useful for categorization
13
Sentiment Weighting For sentiment classification, weight a word as Positive or Negative Constructing sentiment dictionary WordNet [04 Kamps et al.] Synonym Database Using a distance from ‘ good ’ and ‘ bad ’ good bad happy d (good, happy) = 2 d (bad, happy) = 4
14
Dimension Reduction Size of feature vector is (#terms)*(#documents) #terms ≒ size of dictionary High calculation cost Risk of overfitting Best for training data ≠ Best for real data Choosing effective feature to improve accuracy and calculation cost
15
Dimension Reduction df-threshold Terms appearing in very few documents (ex.only one) are not important Score If t and cj are independent, Score is equal to Zero
16
Outline Introduction Text Categorization Feature of Text Learning Algorithm Conclusion
17
Learning Algorithm Many (Almost all?) algorithms are used in Text Categorization Simple approach Na ï ve Bayes K-Nearest Neighbor High performance approach Boosting Support Vector Machine Hierarchical Learning
18
Na ï ve Bayes Bayes Rule This value is hard to calculate ? Assumption : each terms occurs independently
19
k-Nearest Neighbor Define a “ distance ” of two Texts Ex)Sim(d1, d2) = d1 ・ d2 / |d1||d2| = cosθ check k of high similarity Texts and categorize by majority vote If size of test data is larger, memory and search cost is higher d1 d2 θ k=3
20
Boosting BoosTexter [00 Schapire et al.] Ada boost making many “ weak learner ” s with different parameters Kth “ weak learner ” checks performance of 1..K-1th, and tries to classify right to the worst score training data BoosTexter uses Decision Stump as “ weak learner ”
21
+ + + + + - - - - - Simple example of Boosting + + + + + - - - - - 1. - - + + + + + - - - 2. + + + + + - - - - - 3.
22
Support Vector Machine Text Categorization with SVM [98 Joachims] Maximize margin
23
Text Categorization with SVM SVM works well for Text Categorization Robustness for high dimension Robustness for overfitting Most Text Categorization problems are linearly separable All of OHSUMED (MEDLINE collection) Most of Reuters-21578 (NEWS collection)
24
Comparison of these methods [02 Sebastiani] Reuters-21578 (2 versions) difference: number of Categories MethodVer.1(90)Ver.2(10) k-NN.860.823 Na ï ve Bayes.795.815 Boosting.878 - SVM.870.920
25
Hierarchical Learning TreeBoost[06 Esuli et al.] Boosting algorithm for Hierarchical labels Hierarchical labels and Texts with label as Training data Applying AdaBoost recursively Better classifier than ‘ flat ’ AdaBoost Accuracy : 2-3% up Time: training and categorization time down Hierarchical SVM[04 Cai et al.]
26
TreeBoost root L1 L2 L3 L4 L11L12L41L42L43 L421L422
27
Outline Introduction Text Categorization Feature of Text Learning Algorithm Conclusion
28
Conclusion Overview of Text Categorization with Machine Learning Feature of Text Learning Algorithm Future Work Natural Language Processing with Machine Learning, especially in Japanese Calculation Cost
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.