Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in P that have a non-zero score zero score words: stopwords, too few( 40%) Tf-Idf is better 1
Improving Suffix Tree clustering Cluster similarity – Page overlap – Add: cluster label distance (word pair distance) Google normalised distance WikiMiner: wikilink similarity 2
Improving suffix tree clustering 3 rd step: cluster merging – If more than half overlapped pages, then merge – New: HAC 3
4 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington
5 Related Work: Web Page Clustering All Standard Algorithms – partitioning (k-means), hierarchical (agglomerative, divisive), ………… Web Features – structure, hyperlinks, colour Textual Features – STC: phrases, Lingo: latent semantic indexing Word Semantics – Global document analysis, co-occurrence statistics Query is never used
QDC – Query Directed Clustering 6 1: Find Base Clusters 2: Merge Clusters3: Split Clusters4: Select Clusters5: Clean Clusters
QDC – 1: Find Base Clusters Clean Pages Identify Base Clusters Prune Small Clusters Semantic Prune #1 Semantic Prune #2 7 Mac (28) Car (40) Auto (25) Animal (18) OS (12) Atari (22) Game (5) Service (80) Forest (11) cluster size distance(cluster,query) Score #1 = Score #2 =
Car Home Page Toyota Specific Broad Query: Jaguar Ambiguous QDC – 1: Query Distance 8
QDC – 1: Find Base Clusters Removes Many Base Clusters – Normally Negative Effect on Performance But … Query Directed Score – Reliable Guide to Cluster Quality – Removes just Low Quality Clusters – Improves Performance 9
QDC – 2: Merge Clusters Merging 10 Mac (28) Car (40) Auto (25) Animal (18) OS (12) Atari (22) Car, Auto (40) Mac, OS (28)
QDC – 2: Merge Clusters Single-link Clustering Similarity Function – Extension (by page overlap) – Intension (by description similarity) Global document analysis: co-occurrence frequency relative to expected frequency if independent 11
QDC – 2: Merge Clusters Reducing Page Overlap Threshold – Normally Negative Effect on Performance But … Description Similarity – More semantically related clusters merge Increasing cluster coverage – Fewer semantically unrelated clusters merge Increasing cluster quality 12
QDC – 3: Split Clusters Single Link Merging – Cluster Chaining (Drifting) Hierarchical Agglomerative – Distance Measure: Path Length 13
QDC – 4: Select Clusters ESTC cluster selection algorithm – Heuristic based hill-climbing search with look-ahead and advanced branch and bound pruning Original heuristic – Page Coverage and Cluster Overlap New heuristic – Page Coverage and Cluster Overlap – Pages Not Covered and Cluster Quality 14
QDC – 5: Clean Clusters Page-Cluster Relevance – Based on Base Cluster Membership – Cluster Size, Cluster Quality Remove Outliers and Erroneous Inclusions Sorting improves usability 15 13
Evaluation Algorithm Efficiency on 250 Documents – Ten Times Faster than STC – One Hundred Times Faster than K-means Algorithm Performance – External Evaluation against a rich gold standard Real World Usability – Informal Usability Comparison with four algorithms K-means, ESTC, Lingo, Vivisimo 16
Evaluation: Algorithm Performance External Evaluation against a rich gold standard Four Algorithms – STC, ESTC, K-means, Random Four Data Sets – Salsa, Jaguar, GP, Victoria University Eleven Measurements – Average and Weighted: Quality, Coverage, Precision, Recall, and Entropy + Mutual Information Snippets and Full Page Text 17
Evaluation: Quality and Coverage 18
Evaluation: Improvement over Random 19
Evaluation: Precision and Recall 20
Evaluation: Entropy and Mutual Information 21
Evaluation: Real World Usability QDC finds broader topics – Maximizes probability of refinement – Simplifies user’s decision process Fewer choices Less chance of multiple relevant choices Fewer semantically meaningless clusters 22 Jaguar Results
Evaluation: Real World Usability Performance better than indicated by external evaluation – No penalty for overly specific clusters since gold standard included them External evaluation shows QDC clusters have: – Fewer irrelevant pages – Cover more relevant pages 23
Conclusion QDC: New Web Page Clustering Algorithm Key innovations: – Query Directed Scoring – Merging using cluster descriptions – Solve cluster chaining by splitting – Improved cluster selection heuristic Vastly improved performance over other algorithms – External evaluation – Informal usability evaluation 24
25 Further Extension Use Phrases rather than just Words – STC, Lingo show large improvement possible Use Wiki Link similarity (WikiMiner) instead of GND Future work: – Improve cluster description similarity merging to consider entire description – Common shared phrases as key features, use VSM, build vectors for each cluster, new weighting – Formal usability evaluation