Evaluation of Clustering Techniques on DMOZ Data

Slides:



Advertisements
Similar presentations
The Math Studies Project for Internal Assessment
Advertisements

Cluster Analysis of fMRI Data Using Dendrogram Sharpening L. Stanberry, R. Nandy, and D. Cordes Presenter: Abdullah-Al Mahmood.
Retrieval Evaluation J. H. Wang Mar. 18, Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.
Clustering Categorical Data The Case of Quran Verses
PARTITIONAL CLUSTERING
Evaluation of Clustering Techniques on DMOZ Data  Alper Rifat Uluçınar  Rıfat Özcan  Mustafa Canım.
Search for personal information using Yahoo BOSS by Evgeny Dosychev Dmitry Kichin Supervisor: Eddie Bortnikov.
Clustering V. Outline Validating clustering results Randomization tests.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Mutual Information Mathematical Biology Seminar
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Data Fusion Eyüp Serdar AYAZ İlker Nadi BOZKURT Hayrettin GÜRKÖK.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Cluster Validation.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Programming Collective Intelligence by Toby.
New Tools for Evaluating the Results of Cluster Analyses Hilde Schaeper Higher Education Information System (HIS), Hannover/Germany Fourth.
Presented by Tienwei Tsai July, 2005
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Lecture 20: Cluster Validation
Advanced Higher Physics Investigation Report. Hello, and welcome to Advanced Higher Physics Investigation Presentation.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
The Internet 8th Edition Tutorial 4 Searching the Web.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
The Math Studies Project for Internal Assessment A good project should be able to be followed by a non-mathematician and be self explanatory all the way.
Clustering C.Watters CS6403.
A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Unsupervised Learning
Search Engine Optimization
An Efficient Algorithm for Incremental Update of Concept space
Clustering Patrice Koehl Department of Biological Sciences
Chapter 15 – Cluster Analysis
Hierarchical Clustering
Part III – Gathering Data
Evaluation of IR Systems
PCB 3043L - General Ecology Data Analysis.
Big Data Analysis and Mining
Happy new year Welcome back.
Multidimensional Scaling and Correspondence Analysis
Inferential statistics,
K-means and Hierarchical Clustering
Hierarchical and Ensemble Clustering
Revision (Part II) Ke Chen
Clustering and Multidimensional Scaling
Revision (Part II) Ke Chen
Data Mining – Chapter 4 Cluster Analysis Part 2
Cluster Analysis.
Clustering The process of grouping samples so that the samples are similar within each group.
In your books Title: Evening Out
Correspondence Analysis
Unsupervised Learning
Presentation transcript:

Evaluation of Clustering Techniques on DMOZ Data Alper Rifat Uluçınar Rıfat Özcan Mustafa Canım

Outline What is DMOZ and why do we use it? What is our aim? Conclusion Evaluation of partitioning clustering algorithms Evaluation of hierarchical clustering algorithms Conclusion

What is DMOZ and why do we use it? www.dmoz.org Another name for ODP, Open Directory Project The largest human edited directory on the Internet 5,300,000 sites 72,000 editors 590,000 categories

What is our aim? Evaluating cluster algorithms is not easy We will use DMOZ as reference point (ideal cluster structure) Run our own cluster algorithms on same data Finally compare results.

? Applying Clustering Algorithms such as C3M, K Means etc. All DMOZ documents (websites) Applying Clustering Algorithms such as C3M, K Means etc. Human Evaluation ? DMOZ Clusters ??

A) Evaluation of Partitioning Clustering Algorithms 20,000 documents from DMOZ flat partitioned data (214 folders) We applied html parsing, stemming, stop word list elimination We will apply two clustering algorithms : C3M K-Means

Before applying html parsing, stemming, stop word list elimination

After applying html parsing, stemming, stop word list elimination

Applying C3M Human Evaluation 214 Clusters 642 Clusters 20,000 DMOZ documents Applying C3M Human Evaluation 214 Clusters 642 Clusters

How to compare DMOZ Clusters and C3M clusters ? Answer: Corrected Rand

Validation of Partitioning Clustering Comparison of two clustering structures N documents Clustering structure 1: R clusters Clustering structure 2: C clusters Metrics [1]: Rand Index Jaccard Coefficient Corrected Rand Coefficient

Validation of Partitioning Clustering ….. ….. ….. d1,d2 d2 d1 Type II, Frequency: b d1,d2 d1,d2 Type I, Frequency: a ….. d2 d1 d1,d2 Type III, Frequency: c ….. d2 d1 Type IV, Frequency: d

Validation of Partitioning Clustering Rand Index = (a+d) / (a+b+c+d) Jaccard Coefficient = a / (a+b+c) Corrected Rand Coefficient Accounts for randomness Normalize rand index so that 0 when the partitions are selected by chance and 1 when a perfect match achieved. CR = (R – E(R)) / (1 – E(R))

Validation of Partitioning Clustering Example: Docs: d1 , d2 , d3 , d4 , d5 , d6 Clustering Structure 1: C1: d1 , d2 , d3 C2: d4 , d5 , d6 Clustering Structure 2: D1: d1 , d2 D2: d3 , d4 D3: d5 , d6

Validation of Partitioning Clustering Contingency Table: a : (d1, d2), (d5, d6) b : (d1, d3), (d2, d3), (d4, d5), (d4, d6) c : (d3, d4) d : remaining 8 pairs (15-7) Rand Index = (2+8)/15 = 0.66 Jaccard Coeff. = 2/(2+4+1) = 0.29 Corrected Rand = 0.24 D1 D2 D3 C1 2 1 3 C2 6

Results Results: Possible Reasons: Low corrected rand and jaccard values ~=0.01 Rand index ~= 0.77 Possible Reasons: Noise in the data Ex: 300 Document Not Found pages. Problem is difficult: Ex: Homepages category.

B) Evaluation of Hierarchical Clustering Algorithms Obtain a partitioning of DMOZ Determine a depth (experiment?) Collect documents of higher (or equal) depth at that level Documents of lower depths? Ignore them…

Hierarchical Clustering: Steps Obtain the hierarchical clusters using: Single Linkage Average Linkage Complete Linkage Obtain a partitioning on the hierarchical cluster…

Hierarchical Clustering: Steps One way, treat DMOZ clusters as “queries”: For each selected cluster of DMOZ Find the number of “target clusters” on computerized partitioning Take the average See if Nt < Ntr If not, either choice of partitioning or hierarchical clustering did not perform well…

Hierarchical Clustering: Steps Another way: Compare the two partitions using an index, i.e. C-RAND…

Choice of Partition: Outline Obtain the dendrogram Single linkage Complete linkage Group average linkage Ward’s methods

Choice of Partition: Outline How to convert a hierarchical cluster structure into a partition? Visually inspect the dendrogram? Use tools from statistics?

Choice of Partition: Inconsistency Coefficient At each fusion level: Calculate the “inconsistency coefficient” Utilize statistics from the previous fusion levels Choose the fusion level for which inconsistency coefficient is at maximum.

Choice of Partition: Inconsistency Coefficient Inconsistency coefficient (I.C.) at fusion level i:

Choice of Partition: I.C. Hands on, Objects Plot of the objects Distance measure: Euclidean Distance

Choice of Partition: I.C. Hands on, Single Linkage

Choice of Partition: I.C. Single Linkage Results Level 1  0 Level 2  0 Level 3  0 Level 4  0 Level 5  0 Level 6  1.1323 Level 7  0.6434 => Cut the dendrogram at a height between level 5 & 6

Choice of Partition: I.C. Single Linkage Results

Choice of Partition: I.C. Hands on, Average Linkage

Choice of Partition: I.C. Average Linkage Results Level 1  0 Level 2  0 Level 3  0.7071 Level 4  0 Level 5  0.7071 Level 6  1.0819 Level 7  0.9467 => Cut the dendrogram at a height between level 5 & 6

Choice of Partition: I.C. Hands on, Complete Linkage

Choice of Partition: I.C. Complete Linkage Results Level 1  0 Level 2  0 Level 3  0.7071 Level 4  0 Level 5  0.7071 Level 6  1.0340 Level 7  1.0116 => Cut the dendrogram at a height between level 5 & 6

Conclusion Our aim is to evaluate clustering techniques on DMOZ Data. Analysis on partitioning & hierarchical clustering algorithms. If the experiments are succesfull we will apply same experiments on larger DMOZ data after we download it. Else We will try other methodologies to improve our experiment results.

References www.dmoz.org [1] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. [2] Korenius T., Laurikkala J., Juhola M., Jarvelin K. Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments. Information Retrieval, 9(1). Kluwer Academic Publishers, 2006. www.dmoz.org