A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department of Computer Engineering, Maltepe University 2 Department of Computer and Control Education,Marmara University
Outline Introduction Relationship based clustering approach / framework Visualization using CLUSION (CLUSter visualizatION) Problems of the Framework Graclus partitioning system Our Proposed Framework Using Graclus: to create Micro-partition Space Outlier filtering on micro-partition space Using Graclus: to cluster ΔP Space Visualization of the results using CLUSION graphs Experiments Results
Introduction Mining high dimensional datasets are an important problem of Data Mining community Well-known problem: curse of dimensionality Graph based methods such as METIS and CHACO perform best on high dimensional space However, these methods have 2 major problems: can not perform outlier filtering Force clusters to be balanced
Relationship based Clustering Approach Strehl A. and Ghosh J. proposed a better approach for mining high dimensional datasets [1]. They focus on similarity space rather than Feature space. A graph partitioning tool METIS is used to perform balanced clustering (OPOSSUM) They also provide a customized matrix visualization tool called CLUSION. CLUSION is fast,simple and it can operate on very high dimensional datasets.
Relationship based Clustering Framework Data Sources Feature SpaceSimilarity Space Cluster Labels Feature Extraction Similarity computation OPOSSUM (Optimal partitioning of Similarity space using Metis)
Visualization using CLUSION Clusters appear as symmetrical dark squares across the main diagonal Similarity Matrix λ index CLUSION S is permuted with a nxn permutation matrix P Cluster Visualization
Problems of the Framework Produces balanced clusters only: It forces clusters to be of equal size. In some datasests this could be important, because it avoids trivial clusterings. But in most cases, can cause undesired results. No outlier filtering : Outliers can reduce the quality and the validity of the clusters depending on the resolution and distribution of the dataset.
Graclus* partitioning system Graclus* is a fast kernel based multilevel algorithm which involves coarsening, initial partitioning and refinement phases. Unlike METIS, it does not force clusters to be nearly,equal size. Uses weighted form of kernel based k-means approach kernel k-means approach is extremely fast and gives high-quality partitions (*) * Dhillon, I., Guan, Y., Kulis,B.: A Fast Kernel-based Multilevel Algorithm for Graph Clustering, Proceedings of The 11th ACM SIGKDD, Chicago, IL, August , (2005).
Our Proposed Framework Three major improvements: An intermediate space (P): We call it micro-partition space. Graclus is used for creating unbalanced micro-partitions. Outlier filtering on the P space (results ΔP) : Graclus creates micro-partitions of different sizes. The singletons on the P space means the points that have not enough neighbors can be filtered or marked as outliers. Using Graclus for clustering ΔP space: Graclus has two important roles on our framework. The first role is creating the micro-partition space.The second role is unbalanced clustering of the filtered space ΔP which is denoted by Φ.
Our Proposed Framework creating micro-partitions (using Graclus) Micro-partition space (P) Contains unbalanced tiny partitions outlier filtering and (re)clustering (using Graclus) results ΔP Space ΔPΔP
Use Graclus in Similarity Space to create tiny partitions (micro-partitions) Notation: n = number of samples, k = number of micro-partitions on P space relation between k and p should be: [1] Micro-partitions can contain up to 4 objects, therefore: [2] Using Graclus: to create Micro-partition Space
Outlier filtering on micro-partition space illustration
Outlier filtering on micro-partition space Outliers in P space (P o ) is: where T o is Outlier threshold value Then, ΔP space is:
Graclus needs the number of partitions k. In formula [1], k refers to the number of micro partitions. Here k refers to the number of clusters we desire. we denote the former one by k 1 and the latter one by k 2. Graclus performs clustering on the ΔP space and produces λ index which is defined as: Using Graclus: to cluster ΔP Space
Visualization of the results using CLUSION graphs CLUSION looks at the λ, reorders the ΔP space so that points with same cluster label are contiguous then visualize the resulting permuted ΔP there are two λ indices produced during clustering process. λ 1 is created while forming micro-partitions λ 2 is created while clustering ΔP space We use λ 2 for CLUSION, the first one is only used for forming micro-partitions
Experiments: Datasets We evaluated our proposed framework on two different real world datasets terms from 2225 complete news articles from the BBC News web site. (2225 dimensional dataset, 5 natural clusters) 2. Collection of news articles from Turkish newspaper Milliyet. Contains 6223 terms in Turkish from 1455 news articles. (1455 dimensional dataset, 3 natural clusters)
Experiments: Evaluated Frameworks OPOSSUM: Strehl & Ghoshs METIS based original framework S&G(Graclus): We replaced METIS by Graclus on Strehl & Ghoshs framework for testing the quality of the clusters produced by Graclus algorithm. P space+Graclus: Our proposed framework.
Experiments: Comparison Criteria Purity Entropy Mutual Information CLUSION graphics (visually identification, visual data mining)
Results: BBC Dataset
Results: BBC Dataset OPOSSUM
Results: BBC Dataset S&G(Graclus):
Results:BBC Dataset P space+Graclus
Results: Milliyet Dataset
Results: Milliyet Dataset OPOSSUM
Results: Milliyet Dataset S&G(Graclus):
Results:Milliyet Dataset P space+Graclus
Thank You! Presenter : T.Tugay BiLGiN