Download presentation
Presentation is loading. Please wait.
Published byJemimah Lamb Modified over 9 years ago
1
Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF, LSI and multi-words for text classification
2
Intelligent Database Systems Lab Outlines Motivation Objectives Methodology Experiments Conclusions Comments
3
Intelligent Database Systems Lab Motivation Although TF*IDF, LSI and multi-word have been proposed for a long time, there is no comparative study on these indexing methods, and no results are reported concerning their classification performances.
4
Intelligent Database Systems Lab Objectives A comparative study of TF*IDF, LSI and multi-words for text classification. - information retrieval - text categorization indexing term: semantic quality statistical quality
5
Intelligent Database Systems Lab Methodology - TF*IDF 1)w i,j : the weight for term i in document j 2) N : the number of documents in the collection 3) tf i,j : is the term frequency of term i in document j 4) df i : is the document frequency of term i in the collection Terms (keywords) of the document collection documents
6
Intelligent Database Systems Lab Methodology - LSI Given a term-document matrix X = [x 1, x 2,..., x n ] є R m and suppose the rank of X is r, LSI decomposes the X using SVD as follows: Terms (keywords) of the document collection documents 1. X k =U k ’Σ k V k T ’ 2.
7
Intelligent Database Systems Lab Methodology - Multi-word the length of the multi-word should be between 2 and 6 its occurrence frequency should be at least twice in a document.
8
Intelligent Database Systems Lab Experiments - Datasets Chinese corpus : TanCorpV1.0 14150 documents20 categories Select 1200 documents219,115 sentences 5,468,301 individual words agriculturehistorypoliticseconomy English corpus : Reuters-22173 distribution 1.0 22173 documents135 categories Select 2032 documents50,837 sentences 281,111 individual words Crude (520)agriculture (574)Trade (514)Interest (424)
9
Intelligent Database Systems Lab Experiments - Evaluation
10
Intelligent Database Systems Lab Experiments - Chinese
11
Intelligent Database Systems Lab Experiments - English
12
Intelligent Database Systems Lab Experiments – t-test
13
Intelligent Database Systems Lab Comparison information retrieval text categorization computation complexity TF*IDFChineseO(n m) LSIEnglishbestO(n 2 r 3 ) multi-wordO(ms 2 )
14
Intelligent Database Systems Lab Conclusions LSI can produce better indexing in discriminative power. LSI and multi-word have better semantic quality than TF*IDF, and TF*IDF has better statistical quality than the other two methods. The number of dimension is still a decisive factor for indexing when we use different indexing methods for classification.
15
Intelligent Database Systems Lab Comments Advantages - Compare with TF*IDF, LSI and multi-words Disadvantage - semantic quality and statistical quality are considered merely by our intuition instead of theory Applications - text mining
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.