数据挖掘 Introduction to Data Mining Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2018 S8700113C
Course schedule (日程安排) Lecture 1 Introduction What is the knowledge discovery process? Lecture 2 Exploring the data Lecture 3 Classification (part 1) Lecture 4 Classification (part 2) Lecture 5 Association analysis Lecture 6 Lecture 7 Clustering Lecture 8 Anomaly detection and advanced topics Final exam (date to be announced)
Introduction Last time: Important: Association analysis - part 2 Solution assignment #1 Assignment #2 Important: QQ group: 723166394 The PPTs are on the website.
Clustering (群集)
Introduction Clustering (群集): to automatically group similar objects/instances into clusters (groups). The clusters should capture the natural structure of the data.
Clustering Why do clustering? to summarize the data, to understand the data for decision-making, as a first step before applying other data mining techniques. Clustering is a task that humans naturally do in everyday life Many applications: Grouping similar webpages, Grouping customers with similar behavior or preferences Grouping similar movies, songs
What are « good » clusters? In general, we may want to find clusters that: Minimize the similarity between points of different categories Maximize the similarity between points of a category
To reduce the size of datasets Some data mining techniques such as PCA may be slow if a database is large (since they have an exponential complexity). A solution is to replace all points in each cluster by a single data point representing the cluster. This reduces the size of the database and allows data mining algorithms to run faster.
To reduce the size of datasets Some data mining techniques such as PCA may be slow if a database is large (since they have an exponential complexity). A solution is to replace all points in each cluster by a single data point representing the cluster. This reduces the size of the database and allows data mining algorithms to run faster.
Classification (分类) Classification (分类): predicting the value of a target attribute for some new data. The possible values for the target attributes are called “classes” or categories “target attribute” NAME AGE INCOME GENDER EDUCATION John 99 1 元 Male Ph.D. Lucia 44 20元 Female Master Paul 33 25元 Daisy 20 50元 High school Jack 15 10元 Macy 35 ????????? Classes are known in advance : Ph.D., Master, high school…
Classification (分类) Supervised classification (监督分类) require to have training data that is already labelled for training a classification model. “target attribute” NAME AGE INCOME GENDER EDUCATION John 99 1 元 Male Ph.D. Lucia 44 20元 Female Master Paul 33 25元 Daisy 20 50元 High school Jack 15 10元 Macy 35 ????????? Training data 训练数据
Clustering (群集) Automatically group instances into groups. No training data is required No labels or target attribute needs to be selected.
Clustering (群集) Automatically group instances into groups. No training data is required No labels or target attribute needs to be selected.
What is a good clustering? How many categories? Six? Four? Two?
Partitional Clustering (划分聚类) Each object must belong to exactly one cluster A Partitional Clustering Original Points
Hierarchical Clustering (层次聚类) Clusters are created as a hierarchy of clusters Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram
http://www. instituteofcaninebiology. org/how-to-read-a-dendrogram http://www.instituteofcaninebiology.org/how-to-read-a-dendrogram.html An example of dendrogram
Many types of clustering Exclusive versus non-exclusive In a non-exclusive clustering, points may belong to multiple clusters. Can represent multiple classes or ‘border’ points Exclusive clustering Non-exclusive clustering
Many types of clustering Fuzzy versus non-fuzzy In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 Weights must sum to 1 Probabilistic clustering has similar characteristics Non fuzzy clustering Fuzzy clustering
Many types of clustering Partial versus complete In some cases, we only want to cluster some of the data e.g. to eliminate the outliers. Complete clustering Partial clustering
Many types of clustering Heterogeneous versus homogeneous Cluster of widely different sizes, shapes, and densities Homogeneous (均匀的) Heterogeneous (各种各样的) (in terms of size)
Types of clusters: Well-Separated Clusters A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters
Types of clusters: Center-Based clusters A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters
Types of Clusters: Contiguity-Based Contiguous Cluster (Nearest neighbor or Transitive) A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 8 contiguous clusters
Types of Clusters: Density-Based A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters
The K-Means algorithm
Introduction A simple and popular approach Partitional clustering Each cluster is associated with a centroid (center point) Each point (object) is assigned to the cluster with the closest centroid Number of clusters, K, must be specified by the user.
K-Means Input: Output: k, the number of clusters to be generated P, a set of points to be clustered Output: k partitions, some may be empty
Example – iteration 1 Three points are randomly selected to be the centroids
Example – iteration 2 Centroid are recalculated as the average of each category and each point is assigned to the category with the closest centroid.
Example – iteration 3 Centroid are recalculated as the average of each category and each point is assigned to the category with the closest centroid.
Example – iteration 4 Centroid are recalculated as the average of each category and each point is assigned to the category with the closest centroid.
Example – iteration 5 Centroid are recalculated as the average of each category and each point is assigned to the category with the closest centroid.
Example – iteration 6 This is the last iteration because after that, the categories do not change.
More information about K-Means Initially, centroids are randomly selected. Thus, if we run K-Means several times, the result may be different. The similarity or distance between points may be calculated using different distance functions such as the Euclidian distance, correlation, etc. For such measures, K-means will always converge to a solution (set of clusters). Usually, the clusters will change more during the first iterations. We can stop K-Means when the results does not change much between two iterations.
The choice of the initial centroids can have a huge influence on the final result Data A clustering that is optimal A clustering that is quite bad
In some cases, K-Means can find a good solution despite an initial choice of centroids that appears to not be very good
How to evaluate a clustering Sum of squared errors (SSE) k = number of clusters x : an object from a cluster Ci mi : the prototype (centroid) of Ci SSE: allows to choose the best clustering Note: if we increase k, it will decrease the SSE But a good clustering with still have a small SSE even for a small k value.
Some problems with k-means It may be difficult to find a perfect clustering if k is large, because it becomes unlikely that a centroid will be chosen in each natural clusters. K-Means can create some empty clusters. Many strategies to fix this problem Apply the algorithm several times…
Limitations of K-Means K-means does not work very well for categories: having different sizes, having different densities, Having a globular shape. K-means may also not work very well when the data contains outliers.
Limitations of K-means: different sizes Original Points K-Means (3 clusters)
And what if we increase k ? Original points K-Means (3 clusters)
Limitations of K-Means : different densities Original points K-Means 3 clusters
And what if we increase k ? Original points K-Means 9 clusters
Limitations of K-Means: non-globular shapes Original points K-Means (2 clusters)
And what if we increase k ? Not better…
Pre-processing and post-processing Normalize data, Remove outliers. Post-processing Remove small clusters that could be outliers. If a cluster has a high SSE, split the cluster into two clusters. Merge two clusters that are very similar to each other, if the SSE is low. Some of these operations can be integrated into the K-Means algorithm.
Density-BASED CLUSTERING (基于密度的聚类) (DB-SCAN)
What is density? Density can be defined as the number of points within a circular area defined by some radius (半径) Density is here defined with respect to a given point
DBScan (1996) Input: Output: some data points (objects) eps: a distance (a positive number) minPoints: a number of points Output: clusters that are created based on the density of points, some points are considered as noise and are not included in any clusters.
Definitions Neighbors: points at a distance not greater than eps. Core point: points having at least MinPts neighbors. Border point: point having less than MinPts neighbors, but having a neighbor that is a core point. Noise: the other points Example: eps = 1 minPts = 4
How DBScan works? Current label = 1. For each core point p IF p has no label THEN: p.Label = Current_label. Current_label = Current_Label + 1. FOR EACH point y in the neighborhood of p (transitively) IF y is a border point or a core point without label THEN y.label = CurrentLabel.
DBSCAN: Illustration Types of points core points border points noise Original points Eps = 10, MinPts = 4
Advantages of DBScan Clusters Original points Noise-tolerant Can discover clusters of various size and shapes.
Other examples
Limitations of DBScan Various densities High dimensional data (MinPts=4, Eps=9.75). Original points Various densities High dimensional data (MinPts=4, Eps=9.92)
Other examples
How to choose the EPS and MinPTS parameters? We can observe the distance from each point to its kth closest neighbor. Noise points are more far from their kth neighbor than points that are not noise To chose the value k to be used with eps, we choose k, and then we can sort the points by their distance to the kth node.
Density-based clustering Advantages Clusters of different sizes and shapes Do not need to specify the number of clusters Remove points that are noise Can be quite fast, if the software is using appropriate spatial data structures to search quickly for neighbors. Disadvantages It can be difficult to find good parameter values Results may vary greatly depending on how the parameters are set.
Density-peak clustering (Science, 2014) Clusters: peaks in the density of points Allows to find non-spherical clusters of different densities The number of clusters is found automatically Can also remove noise Simple
(for a distance dc) http://conference.mipt.ru/img/conference/material-design-2014/talks/Laio-talk.pdf
Minimum distance Density
Minimum distance Density
This algorithm solves some problems of DBScan.
Clustering evaluation
Clustering evaluation Evaluating a clustering found by an algorithm is very important to avoid finding clusters in noise, to compare different clustering algorithms, to compare two clusterings generated by the same algorithm, to compare two clusters
Clusters found in random data DBSCAN Random points K-means Hierarchical clustering
Issues related to cluster evaluation Is there really some natural categories in the data (or is it just some random data). Evaluating clusters using external data (e.g. some already known class labels). Evaluating clusters without using external data (e.g. using the SSE or other measures). Comparing two clustering to choose one Determining how many categories there is.
A method: using a similarity matrix Order the points by cluster labels. Calculate the similarity between all pairs of points EXAMPLE 1: K-MEANS If categories are well separated, there should be some squares appearing diagonally
EXAMPLE 2: DBScan, random data la diagonale est moins bien définie EXAMPLE 3: K-Means, random data
EXAMPLE 4: DBSCAN
A method to choose the number of categories There are various methods A simple method is to use the sum of squared errors (SSE) For example: The SSE with respect to the number of categories for K-Means
Another example:
Hierarchical clustering (层次聚类)
MIN: proximity between the two closest points of two categories MAX: proximity between the two farthest points of two categories Average: average proximity between points of two categories
Comparison of hierarchical clustering methods 5 5 1 2 3 4 5 6 4 4 1 2 3 4 5 6 3 2 2 1 3 1 MIN MAX 5 1 2 3 4 5 6 4 2 3 1 Average
Conclusion Today, we discussed clustering. K-Means DB-Scan Density peak clustering How to evaluate clusters Next week, we will discuss anomaly detection, discuss some more advanced topics. Tutorial: how to use K-Means with the SPMF software: http://data-mining.philippe-fournier-viger.com/introduction-clustering-k-means-java-code/
References Chapter 8, 9. Tan, Steinbach & Kumar (2006), Introduction to Data Mining, Pearson education, ISBN-10: 0321321367 (and PPTs) Han & Kamber (2011). Data Mining Concepts and Techniques.