Big Data Analysis and Mining

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

AMCS/CS229: Machine Learning

Clustering Basic Concepts and Algorithms

Medical Image Registration Kumar Rajamani. Registration Spatial transform that maps points from one image to corresponding points in another image.

Evaluation of Clustering Techniques on DMOZ Data  Alper Rifat Uluçınar  Rıfat Özcan  Mustafa Canım.

Clustering V. Outline Validating clustering results Randomization tests.

Cluster Analysis. Midterm: Monday Oct 29, 4PM  Lecture Notes from Sept 5, 2007 until Oct 15, Chapters from Textbook and papers discussed in class.

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.

What is Statistical Modeling

A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

University at BuffaloThe State University of New York Cluster Validation Cluster validation q Assess the quality and reliability of clustering results.

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

Clustering Evaluation April 29, Today Cluster Evaluation – Internal We don’t know anything about the desired labels – External We have some information.

Distance Measures Tan et al. From Chapter 2.

Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.

Cluster Validation.

ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.

Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.

© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.

Performance Metrics for Graph Mining Tasks

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

Cluster validation Pasi Fränti Clustering methods: Part 3 Speech and Image Processing Unit School of Computing University of Eastern Finland

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.

Lecture 20: Cluster Validation

Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

Mingyang Zhu, Huaijiang Sun, Zhigang Deng Quaternion Space Sparse Decomposition for Motion Compression and Retrieval SCA 2012.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Model-based evaluation of clustering validation measures.

Presented by Ho Wai Shing

Cluster validation Integration ICES Bioinformatics.

1 Neighboring Feature Clustering Author: Z. Wang, W. Zheng, Y. Wang, J. Ford, F. Makedon, J. Pearlman Presenter: Prof. Fillia Makedon Dartmouth College.

1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.

Multidimensional Scaling and Correspondence Analysis © 2007 Prentice Hall21-1.

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )

Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.

CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)

Intelligent and Adaptive Systems Research Group A Novel Method of Estimating the Number of Clusters in a Dataset Reza Zafarani and Ali A. Ghorbani Faculty.

Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.

Clustering (1) Clustering Similarity measure Hierarchical clustering

Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.

Machine Learning University of Eastern Finland

Today Cluster Evaluation Internal External

What Is Cluster Analysis?

Clustering Patrice Koehl Department of Biological Sciences

Hierarchical Clustering: Time and Space requirements

Centroid index Cluster level quality measure

CSE 5243 Intro. to Data Mining

A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets Ashok Sharma, Robert Podolsky, Jieping.

Clustering (3) Center-based algorithms Fuzzy k-means

Clustering Evaluation The EM Algorithm

Multidimensional Scaling and Correspondence Analysis

CSE 4705 Artificial Intelligence

Multidimensional Scaling

Critical Issues with Respect to Clustering

Revision (Part II) Ke Chen

Data Mining 資料探勘分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育

Revision (Part II) Ke Chen

Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.

Dimension reduction : PCA and Clustering

Data Mining – Chapter 4 Cluster Analysis Part 2

Multidimensional Scaling

Nearest Neighbors CSC 576: Data Mining.

Text Categorization Berlin Chen 2003 Reference:

Evaluation of Clustering Techniques on DMOZ Data

Correspondence Analysis

Inferring Road Networks from GPS Trajectories

Presentation transcript:

Big Data Analysis and Mining Cluster Validity Qinpei Zhao 赵钦佩 qinpeizhao@tongji.edu.cn 2015 Fall 2018/9/16

Background & Status What do we have?

Data Sets: s1 s2 s3 s4

Background & Status What do we have? What have we done?

Clustering Results s1 s2 s3 s4

Background & Status Data Sets Clustering Algorithms What are we still struggling? -- How many clusters? -- How good clustering?

Problems How many clusters? How good clustering?

“Clusters are in the eye of the beholder”! Figure1. (a) a data set consists 3 clusters. (b) the results by k-means when asking 4 clusters. Figure2. different partition results from DBSCAN with different input parameter values.

Why? Evaluate the clustering results, especially in high dimensional data space To compare clustering algorithms To compare two sets of clusters To compare two clusters

Different Aspects Determining the clustering tendency of a set of data, i.e.,distinguishing whether non-random structure actually exists in the data. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels. Evaluating how well the results of a cluster analysis fit the data without reference to external information. - Use only the data Comparing the results of two different sets of cluster analysis to determine which is better. Determining the ‘correct’ number of clusters.

Measures of Cluster Validity A Typical View of Cluster Validation Measures: External measures Match a cluster structure to a prior information, e.g., class labels. E.g., Rand index, Γ statistics, F-measure, Mutual Information Internal measures Assess the fit between the structure and the data themselves only. E.g., Silhouette index, CPCC, Γ statistics Relative measures Decide which of two structures is better, often used for selecting the right clustering parameters, e.g., the cluster number. E.g., Dunn’s indices, Davies-Bouldin index, partition coefficient, Xie-Beni index Other Views: Partitional Indices vs. Hierarchical Indices Fuzzy Indices vs. Non-Fuzzy Indices Statistics-based Indices vs. Information-based Indices

Survey Status Existing techniques 30 indices comparison (hierarchical clustering algorithms) by Milligan and Cooper 1985 15 indices comparison (binary data sets) by Dimitriadou et al. 2002 Comparison on Internal indexes by Q. Zhao 2014 Existing techniques Davies-Bouldin index Dunn’s index Calinski-Harabasz index Bayesian Information Criterion (BIC) Rand Index Jaccard……

Sum-of-square based index SSW: SSB: Define SSW/SSB as WB-ratio Proposed index as: wb-index = m•SSW/SSB

Sum-of-square based index WB type index: which takes use of sum-of square Within and Between variance into the index (SSW & SSB) History on WB type index: SSW / m ---- Ball and Hall (1965) m2|W| ---- Marriot (1971) ---- Calinski & Harabasz (1974) log(SSB/SSW) ---- Hartigan (1975) ---- Xu (1997) ( d is the dimension of data; n is the size of data; m is the number of clusters)

Internal Index

External Validity RLS VS. KM (S3) RLS VS. Genetic (S3) P1 - Partitions External indices hardsoft Resampling method Determining the number of clusters efficiently RLS VS. KM (S3) RLS VS. Genetic (S3) P1 - Partitions P2 - Partitions

Different clusters in C External Index (1) C={C1,…,Ck’} (clustering structure) and P ={P1,…, Pk} (known partition) a pair of points (Xu, Xv) Rand Statistic: R = (SS+DD)/(SS+SD+DS+DD) Jaccard coefficient: J = SS/(SS+SD+DS) Folkes and Mallows index: FM = No. of pairs Same cluster in C Different clusters in C Same class in P SS SD Different class in P DS DD

External Index (2) Contingency Matrix Confusion matrix

A test Data: 50 documents from 5 classes. The class sizes are 30, 2, 6, 10, and 2, respectively. i.e. |C|= {30, 2, 6, 10, 2} Two clustering results are as follows. Which one is better?

Determining the K Typical procedures: Input a dataset X; Define range of the number of clusters K = [Kmin, Kmax ]; for each K: run clustering algorithm; Calculate the value of certain validity index on the clustering result; Plot the “number of clusters vs. index metric” and use features of the plot to determine the optimal K*. Partitions P Codebook C Parameter K INPUT: DataSet(X) Clustering Algorithm Validity Index K* Scheme diagram of cluster validity process

Cluster Validity in Image Segmentation 2011/12/12

Text categorization