Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab
Background
Open Directory Project Used by Google, Lycos, etc. Categorizing Web pages by hand Accurate Lately updated Unscalable
World Wide Web Rapid increase (= # of clusters changes) Daily updated (= cluster centers move) Due to these two properties of the Web.. A Web page clustering system without human effort is needed.
Purpose Constructing a Web page clustering system which finds clusters without human help is scalable clusters Web pages in high speed clusters Web pages accurately
Agenda Introduction Related Work Proposal Comparison Conclusion
Clustering Algorithm Text-based clustering Use of word as feature Generally used algorithm Link-based clustering Focus on link structure Especially used in clustering Web pages
k-means Algorithm k = 3 point: vector expression of each document
Problems of k-means Algorithm k depends on the data set. Outliers sensitively effect clustering result.
Hierarchical Clustering BIRCH [Zhang ’96], CURE [Guha ’98], Chameleon [Karypis ’99], ROCK [Guha ’00]
Hierarchical Clustering # of clusters can be determined by condition. Clustering a large number of points (pages) results in many I/O accesses.
Use of Link Structure Web pages include not only text but also links. People link Web pages to other related pages. Linked Web pages may share the same topic
Extraction of Web Community based on Link Analysis An Approach to Find Related Communities Based on Bipartite Graphs [P.Krishna Reddy et al., 2001]
Terminology Fans and Centers Bipartite Graph Complete BG Dense BG FanCenter (a) CBG (b) DBG p q
An Approach to Find Related Communities Based on Bipartite Graphs Definition The set T contains the members of the community if there exist a dense bipartite graph DBG(T, I, p, q) where T: Fans I: Centers p: # of out-link q: # of in-link p q DBG(T, I, 2, 3)
DBG Extraction Algorithm (pt = 2, qt = 3) 1. Gathering related nodes threshold = 1
DBG Extraction Algorithm (pt = 2, qt = 3) 2. Extracting a DBG
DBG-based Web Community O High speed (O( #links )) O Finding out topics over the Web X Possibility of extracting disrelated Web page group
Comparison Text-based clustering Accurate Difficult to determine the center of cluster Community topology based on DBG Inaccurate Can be used as topic selection Refined Web CommunityCenter of Cluster
Agenda Introduction Related Word Proposal Comparison Conclusion
Proposal 1. Extract DBGs through link analysis 2. Refine communities and fix centers with DBSCAN 3. Partition other pages to the nearest center
Community Extraction Extract DBGs from the Web Graph Disallow the same page to be included in more than one Web community Web Graph
Cluster Center Refinement Find meaningful page sets 1. Does the DBGs really have a topic? 2. Is there any page in the community that is not related the topic? Feature: terms of extracted pages DBSCAN [Martin Easter et al., A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, 1999]
DBSCAN radius: r minP: m r Core Density reachable Community (Center of cluster)
Partitioning Remaining Pages Feature: term’s appearance 1. Calculate distance between a remaining page and each center 2. If the distance to the nearest center is shorter than threshold, attach the page to that cluster 3. Otherwise, attach the page to “Unclassified cluster”
Agenda Introduction Related Word Proposal Experimental Result Conclusion
Target Seed: 3,000 pages categorized to Computer/Software by ODP 70,000 pages departed from seed pages by 2 hops
Preprocess Word ID Use words of a dictionary as base vectors Attribute the same ID to words sharing the same derivation Add terms which appear in many documents (IDF <= 8) Total: Link Extraction Elimination of links to pages which are not collected.
# Communities
# Community Members (pt=3, qt=3)
# Community Members
Variance of Terms
After DBSCAN
Conclusion
Future Work Applying to more large data set This may need parallel processing Analyzing with