Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab.

Slides:



Advertisements
Similar presentations
Hierarchical Clustering, DBSCAN The EM Algorithm
Advertisements

Intelligent Database Systems Lab Presenter : WU, MIN-CONG Authors : KADIM TA¸SDEMIR, PAVEL MILENOV, AND BROOKE TAPSALL 2011,IEEE Topology-Based Hierarchical.
Mining di Dati Web Web Community Mining and Web log Mining : Commody Cluster based execution Romeo Zitarosa.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Midterm topics Chapter 2 Data Data preprocessing Measures of similarity/dissimilarity Chapter.
2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.
Clustering Prof. Navneet Goyal BITS, Pilani
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Chapter 3: Cluster Analysis
K-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory.
Clustering II.
Cluster Analysis.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Data Mining Chapter 5 Web Data Mining Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
SCAN: A Structural Clustering Algorithm for Networks
Cluster Analysis.
1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese.
Birch: An efficient data clustering method for very large databases
Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo.
CURE: Clustering Using REpresentatives algorithm Student: Uglješa Milić University of Belgrade School of Electrical Engineering.
Presented by Tienwei Tsai July, 2005
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel genetic algorithm for automatic clustering Advisor.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.
Topic9: Density-based Clustering
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie.
DBSCAN Data Mining algorithm Dr Veljko Milutinović Milan Micić
CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Author:George et al. Advisor:Dr. Hsu Graduate:ZenJohn Huang IDSL seminar 2001/10/23.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Presented by Ho Wai Shing
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
Intelligent Database Systems Lab Presenter: YU-TING LU Authors: Christopher C. Yang and Tobun Dorbin Ng TSMCA Analyzing and Visualizing Web Opinion.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
DATA MINING Spatial Clustering
Semi-Supervised Clustering
More on Clustering in COSC 4335
CS 685: Special Topics in Data Mining Jinze Liu
Topic 3: Cluster Analysis
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
CS 685: Special Topics in Data Mining Jinze Liu
Topic Oriented Semi-supervised Document Clustering
GPX: Interactive Exploration of Time-series Microarray Data
CSE572, CBS572: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Group 9 – Data Mining: Data
Topic 5: Cluster Analysis
CSE572: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Background

Open Directory Project Used by Google, Lycos, etc. Categorizing Web pages by hand  Accurate  Lately updated  Unscalable

World Wide Web Rapid increase (= # of clusters changes) Daily updated (= cluster centers move) Due to these two properties of the Web..  A Web page clustering system without human effort is needed.

Purpose Constructing a Web page clustering system which  finds clusters without human help  is scalable  clusters Web pages in high speed  clusters Web pages accurately

Agenda Introduction Related Work Proposal Comparison Conclusion

Clustering Algorithm Text-based clustering  Use of word as feature  Generally used algorithm Link-based clustering  Focus on link structure  Especially used in clustering Web pages

k-means Algorithm k = 3 point: vector expression of each document

Problems of k-means Algorithm k depends on the data set. Outliers sensitively effect clustering result.

Hierarchical Clustering BIRCH [Zhang ’96], CURE [Guha ’98], Chameleon [Karypis ’99], ROCK [Guha ’00]

Hierarchical Clustering # of clusters can be determined by condition. Clustering a large number of points (pages) results in many I/O accesses.

Use of Link Structure Web pages include not only text but also links. People link Web pages to other related pages. Linked Web pages may share the same topic

Extraction of Web Community based on Link Analysis An Approach to Find Related Communities Based on Bipartite Graphs [P.Krishna Reddy et al., 2001]

Terminology Fans and Centers Bipartite Graph  Complete BG  Dense BG FanCenter (a) CBG (b) DBG p q

An Approach to Find Related Communities Based on Bipartite Graphs Definition The set T contains the members of the community if there exist a dense bipartite graph DBG(T, I, p, q) where  T: Fans  I: Centers  p: # of out-link  q: # of in-link p q DBG(T, I, 2, 3)

DBG Extraction Algorithm (pt = 2, qt = 3) 1. Gathering related nodes threshold = 1

DBG Extraction Algorithm (pt = 2, qt = 3) 2. Extracting a DBG

DBG-based Web Community O High speed (O( #links )) O Finding out topics over the Web X Possibility of extracting disrelated Web page group

Comparison Text-based clustering  Accurate  Difficult to determine the center of cluster Community topology based on DBG  Inaccurate  Can be used as topic selection Refined Web CommunityCenter of Cluster

Agenda Introduction Related Word Proposal Comparison Conclusion

Proposal 1. Extract DBGs through link analysis 2. Refine communities and fix centers with DBSCAN 3. Partition other pages to the nearest center

Community Extraction Extract DBGs from the Web Graph  Disallow the same page to be included in more than one Web community Web Graph

Cluster Center Refinement Find meaningful page sets 1. Does the DBGs really have a topic? 2. Is there any page in the community that is not related the topic? Feature: terms of extracted pages DBSCAN [Martin Easter et al., A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, 1999]

DBSCAN radius: r minP: m r Core Density reachable Community (Center of cluster)

Partitioning Remaining Pages Feature: term’s appearance 1. Calculate distance between a remaining page and each center 2. If the distance to the nearest center is shorter than threshold, attach the page to that cluster 3. Otherwise, attach the page to “Unclassified cluster”

Agenda Introduction Related Word Proposal Experimental Result Conclusion

Target Seed: 3,000 pages categorized to Computer/Software by ODP 70,000 pages departed from seed pages by 2 hops

Preprocess Word ID  Use words of a dictionary as base vectors  Attribute the same ID to words sharing the same derivation  Add terms which appear in many documents (IDF <= 8)  Total: Link Extraction Elimination of links to pages which are not collected.

# Communities

# Community Members (pt=3, qt=3)

# Community Members

Variance of Terms

After DBSCAN

Conclusion

Future Work Applying to more large data set  This may need parallel processing Analyzing with