A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department.

Slides:

Advertisements

Similar presentations

Advertisements

Consistent Bipartite Graph Co-Partitioning for High-Order Heterogeneous Co-Clustering Tie-Yan Liu WSM Group, Microsoft Research Asia Joint work.

Text Categorization.

ICDE 2014 LinkSCAN*: Overlapping Community Detection Using the Link-Space Transformation Sungsu Lim †, Seungwoo Ryu ‡, Sejeong Kwon§, Kyomin Jung ¶, and.

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

Aggregating local image descriptors into compact codes

Random Forest Predrag Radenković 3237/10

Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.

KDD 2009 Scalable Graph Clustering using Stochastic Flows Applications to Community Discovery Venu Satuluri and Srinivasan Parthasarathy Data Mining Research.

Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Author: Jie chen and Yousef Saad IEEE transactions of knowledge and data engineering.

A Novel Scheme for Video Similarity Detection Chu-Hong Hoi, Steven March 5, 2003.

One-Shot Multi-Set Non-rigid Feature-Spatial Matching

April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

A Unified View of Kernel k-means, Spectral Clustering and Graph Cuts

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

1 On statistical models of cluster stability Z. Volkovich a, b, Z. Barzily a, L. Morozensky a a. Software Engineering Department, ORT Braude College of.

San Diego, 06/12/03 San Diego, 06/12/03 Martin Pfeifle, Database Group, University of Munich Using Sets of Feature Vectors for Similarity Search on Voxelized.

Cluster Validation.

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

Graph clustering Jin Chen CSE Fall 2012 MSU 1.

IIIT Hyderabad Interactive Visualization and Tuning of Multi-Dimensional Clusters for Indexing Dasari Pavan Kumar (MS by Research Thesis) Centre for Visual.

Problem Statement A pair of images or videos in which one is close to the exact duplicate of the other, but different in conditions related to capture,

Graph Embedding: A General Framework for Dimensionality Reduction Dong XU School of Computer Engineering Nanyang Technological University

Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression.

Lecture 20: Cluster Validation

Co-clustering Documents and Words Using Bipartite Spectral Graph Partitioning Jinghe Zhang 10/28/2014 CS 6501 Information Retrieval.

Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining

80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.

A hybrid SOFM-SVR with a filter-based feature selection for stock market forecasting Huang, C. L. & Tsai, C. Y. Expert Systems with Applications 2008.

Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies Akshay Java Anupam Joshi Tim Finin University of Maryland, Baltimore County.

Support Vector Machines and Kernel Methods Machine Learning March 25, 2010.

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

K ERNEL - BASED W EIGHTED M ULTI - VIEW C LUSTERING Grigorios Tzortzis and Aristidis Likas Department of Computer Science, University of Ioannina, Greece.

Local/Global Term Analysis for Discovering Community Differences in Social Networks David Fuhry, Yiye Ruan, and Srinivasan Parthasarathy Data Mining Research.

Data Mining, ICDM '08. Eighth IEEE International Conference on Duy-Dinh Le National Institute of Informatics Hitotsubashi, Chiyoda-ku Tokyo,

1 Classification and Feature Selection Algorithms for Multi-class CGH data Jun Liu, Sanjay Ranka, Tamer Kahveci

A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.

A Tutorial on using SIFT Presented by Jimmy Huff (Slightly modified by Josiah Yoder for Winter )

FISM: Factored Item Similarity Models for Top-N Recommender Systems

Melbourne, Australia, Oct., 2015 gSparsify: Graph Motif Based Sparsification for Graph Clustering Peixiang Zhao Department of Computer Science Florida.

Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.

1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.

Jianping Fan Department of Computer Science University of North Carolina at Charlotte Charlotte, NC Relevance Feedback for Image Retrieval.

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.

1 Visualizing Multi-dimensional Clusters, Trends, and Outliers using Star Coordinates Author : Eser Kandogan Reporter : Tze Ho-Lin 2007/5/9 SIGKDD, 2001.

Digital Image Processing CCS331 Relationships of Pixel 1.

NLP&CC 2012 报告人：许灿辉单位：北京大学计算机科学技术研究所 Integration of Text Information and Graphic Composite for PDF Document Analysis 基于复合图文整合的 PDF 文档分析 Integration of.

DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.

Similarity Measures for Text Document Clustering

Experience Report: System Log Analysis for Anomaly Detection

Data Mining: Basic Cluster Analysis

An Image Database Retrieval Scheme Based Upon Multivariate Analysis and Data Mining Presented by C.C. Chang Dept. of Computer Science and Information.

Privacy Preserving Subgraph Matching on Large Graphs in Cloud

Capturing, Processing and Experiencing Indian Monuments

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Machine Learning Basics

William Norris Professor and Head, Department of Computer Science

Privacy Preserving Subgraph Matching on Large Graphs in Cloud

CSE 4705 Artificial Intelligence

Jianping Fan Dept of CS UNC-Charlotte

William Norris Professor and Head, Department of Computer Science

Scale-Space Representation of 3D Models and Topological Matching

iSRD Spam Review Detection with Imbalanced Data Distributions

Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.

Generalized Locality Preserving Projections

Housam Babiker, Randy Goebel and Irene Cheng

Donghui Zhang, Tian Xia Northeastern University

Presentation transcript:

A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department of Computer Engineering, Maltepe University 2 Department of Computer and Control Education,Marmara University

Outline Introduction Relationship based clustering approach / framework Visualization using CLUSION (CLUSter visualizatION) Problems of the Framework Graclus partitioning system Our Proposed Framework Using Graclus: to create Micro-partition Space Outlier filtering on micro-partition space Using Graclus: to cluster ΔP Space Visualization of the results using CLUSION graphs Experiments Results

Introduction Mining high dimensional datasets are an important problem of Data Mining community Well-known problem: curse of dimensionality Graph based methods such as METIS and CHACO perform best on high dimensional space However, these methods have 2 major problems: can not perform outlier filtering Force clusters to be balanced

Relationship based Clustering Approach Strehl A. and Ghosh J. proposed a better approach for mining high dimensional datasets [1]. They focus on similarity space rather than Feature space. A graph partitioning tool METIS is used to perform balanced clustering (OPOSSUM) They also provide a customized matrix visualization tool called CLUSION. CLUSION is fast,simple and it can operate on very high dimensional datasets.

Relationship based Clustering Framework Data Sources Feature SpaceSimilarity Space Cluster Labels Feature Extraction Similarity computation OPOSSUM (Optimal partitioning of Similarity space using Metis)

Visualization using CLUSION Clusters appear as symmetrical dark squares across the main diagonal Similarity Matrix λ index CLUSION S is permuted with a nxn permutation matrix P Cluster Visualization

Problems of the Framework Produces balanced clusters only: It forces clusters to be of equal size. In some datasests this could be important, because it avoids trivial clusterings. But in most cases, can cause undesired results. No outlier filtering : Outliers can reduce the quality and the validity of the clusters depending on the resolution and distribution of the dataset.

Graclus* partitioning system Graclus* is a fast kernel based multilevel algorithm which involves coarsening, initial partitioning and refinement phases. Unlike METIS, it does not force clusters to be nearly,equal size. Uses weighted form of kernel based k-means approach kernel k-means approach is extremely fast and gives high-quality partitions (*) * Dhillon, I., Guan, Y., Kulis,B.: A Fast Kernel-based Multilevel Algorithm for Graph Clustering, Proceedings of The 11th ACM SIGKDD, Chicago, IL, August , (2005).

Our Proposed Framework Three major improvements: An intermediate space (P): We call it micro-partition space. Graclus is used for creating unbalanced micro-partitions. Outlier filtering on the P space (results ΔP) : Graclus creates micro-partitions of different sizes. The singletons on the P space means the points that have not enough neighbors can be filtered or marked as outliers. Using Graclus for clustering ΔP space: Graclus has two important roles on our framework. The first role is creating the micro-partition space.The second role is unbalanced clustering of the filtered space ΔP which is denoted by Φ.

Our Proposed Framework creating micro-partitions (using Graclus) Micro-partition space (P) Contains unbalanced tiny partitions outlier filtering and (re)clustering (using Graclus) results ΔP Space ΔPΔP

Use Graclus in Similarity Space to create tiny partitions (micro-partitions) Notation: n = number of samples, k = number of micro-partitions on P space relation between k and p should be: [1] Micro-partitions can contain up to 4 objects, therefore: [2] Using Graclus: to create Micro-partition Space

Outlier filtering on micro-partition space illustration

Outlier filtering on micro-partition space Outliers in P space (P o ) is: where T o is Outlier threshold value Then, ΔP space is:

Graclus needs the number of partitions k. In formula [1], k refers to the number of micro partitions. Here k refers to the number of clusters we desire. we denote the former one by k 1 and the latter one by k 2. Graclus performs clustering on the ΔP space and produces λ index which is defined as: Using Graclus: to cluster ΔP Space

Visualization of the results using CLUSION graphs CLUSION looks at the λ, reorders the ΔP space so that points with same cluster label are contiguous then visualize the resulting permuted ΔP there are two λ indices produced during clustering process. λ 1 is created while forming micro-partitions λ 2 is created while clustering ΔP space We use λ 2 for CLUSION, the first one is only used for forming micro-partitions

Experiments: Datasets We evaluated our proposed framework on two different real world datasets terms from 2225 complete news articles from the BBC News web site. (2225 dimensional dataset, 5 natural clusters) 2. Collection of news articles from Turkish newspaper Milliyet. Contains 6223 terms in Turkish from 1455 news articles. (1455 dimensional dataset, 3 natural clusters)

Experiments: Evaluated Frameworks OPOSSUM: Strehl & Ghoshs METIS based original framework S&G(Graclus): We replaced METIS by Graclus on Strehl & Ghoshs framework for testing the quality of the clusters produced by Graclus algorithm. P space+Graclus: Our proposed framework.

Experiments: Comparison Criteria Purity Entropy Mutual Information CLUSION graphics (visually identification, visual data mining)

Results: BBC Dataset

Results: BBC Dataset OPOSSUM

Results: BBC Dataset S&G(Graclus):

Results:BBC Dataset P space+Graclus

Results: Milliyet Dataset

Results: Milliyet Dataset OPOSSUM

Results: Milliyet Dataset S&G(Graclus):

Results:Milliyet Dataset P space+Graclus

Thank You! Presenter : T.Tugay BiLGiN