Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab.

Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Background

Open Directory Project Used by Google, Lycos, etc. Categorizing Web pages by hand  Accurate  Lately updated  Unscalable

World Wide Web Rapid increase (= # of clusters changes) Daily updated (= cluster centers move) Due to these two properties of the Web..  A Web page clustering system without human effort is needed.

Purpose Constructing a Web page clustering system which  finds clusters without human help  is scalable  clusters Web pages in high speed  clusters Web pages accurately

Agenda Introduction Related Work Proposal Comparison Conclusion

Clustering Algorithm Text-based clustering  Use of word as feature  Generally used algorithm Link-based clustering  Focus on link structure  Especially used in clustering Web pages

k-means Algorithm k = 3 point: vector expression of each document

Problems of k-means Algorithm k depends on the data set. Outliers sensitively effect clustering result.

Hierarchical Clustering BIRCH [Zhang ’96], CURE [Guha ’98], Chameleon [Karypis ’99], ROCK [Guha ’00]

Hierarchical Clustering # of clusters can be determined by condition. Clustering a large number of points (pages) results in many I/O accesses.

Use of Link Structure Web pages include not only text but also links. People link Web pages to other related pages. Linked Web pages may share the same topic

Extraction of Web Community based on Link Analysis An Approach to Find Related Communities Based on Bipartite Graphs [P.Krishna Reddy et al., 2001]

Terminology Fans and Centers Bipartite Graph  Complete BG  Dense BG FanCenter (a) CBG (b) DBG p q

An Approach to Find Related Communities Based on Bipartite Graphs Definition The set T contains the members of the community if there exist a dense bipartite graph DBG(T, I, p, q) where  T: Fans  I: Centers  p: # of out-link  q: # of in-link p q DBG(T, I, 2, 3)

DBG Extraction Algorithm (pt = 2, qt = 3) 1. Gathering related nodes threshold = 1

DBG Extraction Algorithm (pt = 2, qt = 3) 2. Extracting a DBG 1 2 1 2 1 0 2 3 2 2 3 3 3

DBG-based Web Community O High speed (O( #links )) O Finding out topics over the Web X Possibility of extracting disrelated Web page group

Comparison Text-based clustering  Accurate  Difficult to determine the center of cluster Community topology based on DBG  Inaccurate  Can be used as topic selection Refined Web CommunityCenter of Cluster

Agenda Introduction Related Word Proposal Comparison Conclusion

Proposal 1. Extract DBGs through link analysis 2. Refine communities and fix centers with DBSCAN 3. Partition other pages to the nearest center

Community Extraction Extract DBGs from the Web Graph  Disallow the same page to be included in more than one Web community Web Graph

Cluster Center Refinement Find meaningful page sets 1. Does the DBGs really have a topic? 2. Is there any page in the community that is not related the topic? Feature: terms of extracted pages DBSCAN [Martin Easter et al., A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, 1999]

DBSCAN radius: r minP: m r Core Density reachable Community (Center of cluster)

Partitioning Remaining Pages Feature: term’s appearance 1. Calculate distance between a remaining page and each center 2. If the distance to the nearest center is shorter than threshold, attach the page to that cluster 3. Otherwise, attach the page to “Unclassified cluster”

Agenda Introduction Related Word Proposal Experimental Result Conclusion

Target Seed: 3,000 pages categorized to Computer/Software by ODP 70,000 pages departed from seed pages by 2 hops

Preprocess Word ID  Use words of a dictionary as base vectors  Attribute the same ID to words sharing the same derivation  Add terms which appear in many documents (IDF <= 8)  Total: 29347 Link Extraction Elimination of links to pages which are not collected.

# Communities

# Community Members (pt=3, qt=3)

# Community Members

Variance of Terms

After DBSCAN

Conclusion

Future Work Applying to more large data set  This may need parallel processing Analyzing with

Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab.

Similar presentations

Presentation on theme: "Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab.

Similar presentations

Presentation on theme: "Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab."— Presentation transcript:

Similar presentations

About project

Feedback