Download presentation
Presentation is loading. Please wait.
Published byMiranda Gallagher Modified over 9 years ago
1
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01
2
Motivation Low precision of Web search engines—hard for users to locate expected information quickly… Solutions: 1. Increase precision– by filtering methods? by advanced pruning options?… 2. Web Document Clustering - 2. Web Document Clustering - Cluster documents returned by search engine in response to a query and re-present them
3
Key Requirements for Web Document Clustering Relevance Browsable Summaries Overlap Snippet-tolerance – “snippet”: small piece of info. Or brief extract Speed Incrementality
4
Suffix Tree Clustering(STC) STC is a linear time clustering algorithm that is based on a suffix tree which efficiently identifies sets of documents that share common phrases. STC satisfies the key requirements: – STC treats a document as a string, making use of proximity information between words. – STC is novel, incremental, and O(n) time algorithm. – STC succinctly summarizes clusters’ contents for users. smaller set – Quick because of working on smaller set of documents, incremantality – …
5
Operating procedure of STC Step1: Document “cleaning” – Html -> plain text – Words stemming – Mark sentence boundaries – Remove non-word tokens Step 2: Identifying Base Clusters Step3: Combining Base Clusters
6
Step2: Identifying base Clusters — Suffix Tree * STC treats a document as a set of strings… Suffix tree of string S: a compact tree containing all the suffixes of S – Suffix of a word: lovely – Suffix of a string: “Friends” is a lovely show. Precise definition: – A suffix tree is a rooted, directed tree. – Each internal node has 2+ children. – Each edge is labeled with a non-empty sub-string of S. The label of a node is defined to be the concatenation of the edge-labels on the path from the root to that node – No two edges out of the same node can have edge- labels that begin with the same word—compact.
7
Ex. A Suffix Tree of Strings String1: “cat ate cheese”, String2: “mouse ate cheese too” String3: “cat ate mouse too”
8
Base clusters Base clusters corresponding to the suffix tree nodes
9
Cluster score s(B) = |B| * f(|P|) – |B| is the number of documents in base cluster B – |P| is the number of words in P that have a non- zero score zero score words: stopwords, too few( 40%)
10
Step 3:Combining Base Clusters Merge base clusters with a high overlap in their document sets – documents may share multiple phrases. Similarity of B m and B n (0.5 is paramter) 1 iff | B m B n | / | B m | > 0.5 =and | B m B n | / | B n | > 0.5 0 otherwise
11
Base Cluster Graph Node: cluster Edge: similarity between two clusters > 1 What if “ate” is in the stop word list?
12
STC is Incremental As each document arrives from the web, we – “clean” it (linear with collection size) – Add it to the suffix tree. Each node that is updated/created as a result of this is tagged(linear) – Update the relevant base clusters and recalculate the similarity of these base clusters to the rest of k highest scoring base clusters(linear) – Check any changes to the final clusters(linear) – Score and sort the final clusters, choose top 10...(linear)
13
STC allows cluster overlap… – Why overlap is reasonable? a document often has 1+ topics a document often has 1+ topics – STC allows a document to appear in 1+ clusters, since documents may share 1+ phrases with other documents – But not too similar to be merged into one cluster..
14
Experiments Cluster output of meta search engine, using STC alg. – Representative of Web search engines – WEB clustering, instead of “IR corpus”
15
Evaluation- Precision Precision of different Clustering algorithm
16
Cluster overlap & multi-word phrases are critical to STC’s success
17
Cluster overlap & multi-word phrases are specifically effective to STC’s success
18
Why? Allowing a document to appear in multiple clusters is only advantageous if that document is relevant; placing an irrelevant document in multiple clusters can only hurt cluster quality
19
Snippets versus Whole Document
20
Execution time Incremental – use “free” CPU time when the system is waiting for the search engine results to arrive over the web – speedy
21
Conclusion The identification of the unique requirements of document clustering of Web seach engine results The definition of STC – an incremental, o(n) time clustering algorithm that satisfies these requirements The first experimental evaluation of clustering algorithms on Web search engine results
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.