Download presentation
Presentation is loading. Please wait.
Published byAlexia West Modified over 9 years ago
1
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07
2
1. Document Clustering Agglomerative Hierarchical Clustering (AHC)
3
Suffix Tree Clustering (STC) - commonly used in result clustering
4
2-1. Suffix Tree Clustering Ex: 3 documents cat ate cheese cat ate mouse too mouse ate cheese too
5
cat ate cheese
9
score(B) = |B| f(|P|) f: remove stopwords, <= 3, > 40% && penalize single word, constant for |P| > 6 2-2. Base Cluster
10
2-3. Combining Base Cluster Keep top k(=500) base cluster Merge high overlap base clusters merge B i & B j iff |B i ∩B j | / |B i | > 0.5 |B j ∩B i | / |B j | > 0.5
11
2-4. Advantage High precision even using snippet Incremental and linear time Order Independent No magic k top k base clusters? 0.5?
13
3. New Suffix Tree Clustering d i T = [tfidf(n 1, d i ), tfidf(n 2, d i ), …] Group-average AHC (GAHC)
14
4. Evaluation Use F-measure precision(C i, G j ) = |C i ∩ G j | / |C i | recall(C i, G j ) = |C i ∩ G j | / | G j |
15
OHSUMED Document Collection MeSH indexing terms RCV1 Document Collection categories
17
5. Comparison STC : seldom generate large cluster NSTC : not incremental
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.