Vector Space Classification 1.Vector space text classification 2.Rochhio Text Classification
Vector Space Classification
Using Projection to handle 2D and #D graphs
Rocchio Text Classification
5 Illustration of Rocchio Text Categorization
6 Rocchio Text Categorization Algorithm (Training) Assume the set of categories is {c 1, c 2,…c n } For i from 1 to n let p i = (init. prototype vectors) For each training example D Let d be the frequency normalized TF/IDF term vector for doc x Let i = j: (c j = c(x)) (sum all the document vectors in c i to get p i ) Let p i = p i + d
7 Rocchio Text Categorization Algorithm (Test) Given test document x Let d be the TF/IDF weighted term vector for x Let m = –2 (init. maximum cosSim) For i from 1 to n: (compute similarity to prototype vector) Let s = cosSim(d, p i ) if s > m let m = s let r = c i (update most similar class prototype) Return class r
8 Rocchio Anomaly Prototype models have problems with polymorphic (disjunctive) categories. Sec.14.2
Properties
Rocchio classification Rocchio forms a simple representation for each class: the centroid/prototype Classification is based on similarity to / distance from the prototype/centroid It does not guarantee that classifications are consistent with the given training data It is little used outside text classification – It has been used quite effectively for text classification – But in general worse than Naïve Bayes Again, cheap to train and test documents 10 Sec.14.2
References Stefan Büttcher, Charles L. A. Clarke, Gordon V. Cormack; Information retrieval ; MIT Press, Rocchio, J. J Relevance feedback in information retrieval. In Salton (1971b), pp