A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07
1. Document Clustering Agglomerative Hierarchical Clustering (AHC)
Suffix Tree Clustering (STC) - commonly used in result clustering
2-1. Suffix Tree Clustering Ex: 3 documents cat ate cheese cat ate mouse too mouse ate cheese too
cat ate cheese
score(B) = |B| f(|P|) f: remove stopwords, <= 3, > 40% && penalize single word, constant for |P| > Base Cluster
2-3. Combining Base Cluster Keep top k(=500) base cluster Merge high overlap base clusters merge B i & B j iff |B i ∩B j | / |B i | > 0.5 |B j ∩B i | / |B j | > 0.5
2-4. Advantage High precision even using snippet Incremental and linear time Order Independent No magic k top k base clusters? 0.5?
3. New Suffix Tree Clustering d i T = [tfidf(n 1, d i ), tfidf(n 2, d i ), …] Group-average AHC (GAHC)
4. Evaluation Use F-measure precision(C i, G j ) = |C i ∩ G j | / |C i | recall(C i, G j ) = |C i ∩ G j | / | G j |
OHSUMED Document Collection MeSH indexing terms RCV1 Document Collection categories
5. Comparison STC : seldom generate large cluster NSTC : not incremental