Download presentation
Presentation is loading. Please wait.
Published byDarlene Cook Modified over 8 years ago
1
Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab
2
Background
3
Open Directory Project Used by Google, Lycos, etc. Categorizing Web pages by hand Accurate Lately updated Unscalable
4
World Wide Web Rapid increase (= # of clusters changes) Daily updated (= cluster centers move) Due to these two properties of the Web.. A Web page clustering system without human effort is needed.
5
Purpose Constructing a Web page clustering system which finds clusters without human help is scalable clusters Web pages in high speed clusters Web pages accurately
6
Agenda Introduction Related Work Proposal Comparison Conclusion
7
Clustering Algorithm Text-based clustering Use of word as feature Generally used algorithm Link-based clustering Focus on link structure Especially used in clustering Web pages
8
k-means Algorithm k = 3 point: vector expression of each document
9
Problems of k-means Algorithm k depends on the data set. Outliers sensitively effect clustering result.
10
Hierarchical Clustering BIRCH [Zhang ’96], CURE [Guha ’98], Chameleon [Karypis ’99], ROCK [Guha ’00]
11
Hierarchical Clustering # of clusters can be determined by condition. Clustering a large number of points (pages) results in many I/O accesses.
12
Use of Link Structure Web pages include not only text but also links. People link Web pages to other related pages. Linked Web pages may share the same topic
13
Extraction of Web Community based on Link Analysis An Approach to Find Related Communities Based on Bipartite Graphs [P.Krishna Reddy et al., 2001]
14
Terminology Fans and Centers Bipartite Graph Complete BG Dense BG FanCenter (a) CBG (b) DBG p q
15
An Approach to Find Related Communities Based on Bipartite Graphs Definition The set T contains the members of the community if there exist a dense bipartite graph DBG(T, I, p, q) where T: Fans I: Centers p: # of out-link q: # of in-link p q DBG(T, I, 2, 3)
16
DBG Extraction Algorithm (pt = 2, qt = 3) 1. Gathering related nodes threshold = 1
17
DBG Extraction Algorithm (pt = 2, qt = 3) 2. Extracting a DBG 1 2 1 2 1 0 2 3 2 2 3 3 3
18
DBG-based Web Community O High speed (O( #links )) O Finding out topics over the Web X Possibility of extracting disrelated Web page group
19
Comparison Text-based clustering Accurate Difficult to determine the center of cluster Community topology based on DBG Inaccurate Can be used as topic selection Refined Web CommunityCenter of Cluster
20
Agenda Introduction Related Word Proposal Comparison Conclusion
21
Proposal 1. Extract DBGs through link analysis 2. Refine communities and fix centers with DBSCAN 3. Partition other pages to the nearest center
22
Community Extraction Extract DBGs from the Web Graph Disallow the same page to be included in more than one Web community Web Graph
23
Cluster Center Refinement Find meaningful page sets 1. Does the DBGs really have a topic? 2. Is there any page in the community that is not related the topic? Feature: terms of extracted pages DBSCAN [Martin Easter et al., A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, 1999]
24
DBSCAN radius: r minP: m r Core Density reachable Community (Center of cluster)
25
Partitioning Remaining Pages Feature: term’s appearance 1. Calculate distance between a remaining page and each center 2. If the distance to the nearest center is shorter than threshold, attach the page to that cluster 3. Otherwise, attach the page to “Unclassified cluster”
26
Agenda Introduction Related Word Proposal Experimental Result Conclusion
27
Target Seed: 3,000 pages categorized to Computer/Software by ODP 70,000 pages departed from seed pages by 2 hops
28
Preprocess Word ID Use words of a dictionary as base vectors Attribute the same ID to words sharing the same derivation Add terms which appear in many documents (IDF <= 8) Total: 29347 Link Extraction Elimination of links to pages which are not collected.
29
# Communities
30
# Community Members (pt=3, qt=3)
31
# Community Members
32
Variance of Terms
33
After DBSCAN
34
Conclusion
35
Future Work Applying to more large data set This may need parallel processing Analyzing with
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.