Download presentation
Presentation is loading. Please wait.
1
数据挖掘 Introduction to Data Mining
Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities Spring 2018 S C
2
Course schedule (日程安排)
Lecture 1 Introduction What is the knowledge discovery process? Lecture 2 Exploring the data Lecture 3 Classification (part 1) Lecture 4 Classification (part 2) Lecture 5 Association analysis Lecture 6 Lecture 7 Clustering Lecture 8 Anomaly detection and advanced topics Final exam (date to be announced)
3
Introduction Last time: Important: Association analysis - part 2
Solution assignment #1 Assignment #2 Important: QQ group: The PPTs are on the website.
4
Clustering (群集)
5
Introduction Clustering (群集): to automatically group similar objects/instances into clusters (groups). The clusters should capture the natural structure of the data.
6
Clustering Why do clustering? to summarize the data,
to understand the data for decision-making, as a first step before applying other data mining techniques. Clustering is a task that humans naturally do in everyday life Many applications: Grouping similar webpages, Grouping customers with similar behavior or preferences Grouping similar movies, songs
7
What are « good » clusters?
In general, we may want to find clusters that: Minimize the similarity between points of different categories Maximize the similarity between points of a category
8
To reduce the size of datasets
Some data mining techniques such as PCA may be slow if a database is large (since they have an exponential complexity). A solution is to replace all points in each cluster by a single data point representing the cluster. This reduces the size of the database and allows data mining algorithms to run faster.
9
To reduce the size of datasets
Some data mining techniques such as PCA may be slow if a database is large (since they have an exponential complexity). A solution is to replace all points in each cluster by a single data point representing the cluster. This reduces the size of the database and allows data mining algorithms to run faster.
10
Classification (分类) Classification (分类): predicting the value of a target attribute for some new data. The possible values for the target attributes are called “classes” or categories “target attribute” NAME AGE INCOME GENDER EDUCATION John 99 1 元 Male Ph.D. Lucia 44 20元 Female Master Paul 33 25元 Daisy 20 50元 High school Jack 15 10元 Macy 35 ????????? Classes are known in advance : Ph.D., Master, high school…
11
Classification (分类) Supervised classification (监督分类) require to have training data that is already labelled for training a classification model. “target attribute” NAME AGE INCOME GENDER EDUCATION John 99 1 元 Male Ph.D. Lucia 44 20元 Female Master Paul 33 25元 Daisy 20 50元 High school Jack 15 10元 Macy 35 ????????? Training data 训练数据
12
Clustering (群集) Automatically group instances into groups.
No training data is required No labels or target attribute needs to be selected.
13
Clustering (群集) Automatically group instances into groups.
No training data is required No labels or target attribute needs to be selected.
14
What is a good clustering?
How many categories? Six? Four? Two?
15
Partitional Clustering (划分聚类)
Each object must belong to exactly one cluster A Partitional Clustering Original Points
16
Hierarchical Clustering (层次聚类)
Clusters are created as a hierarchy of clusters Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram
17
http://www. instituteofcaninebiology. org/how-to-read-a-dendrogram
An example of dendrogram
18
Many types of clustering
Exclusive versus non-exclusive In a non-exclusive clustering, points may belong to multiple clusters. Can represent multiple classes or ‘border’ points Exclusive clustering Non-exclusive clustering
19
Many types of clustering
Fuzzy versus non-fuzzy In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 Weights must sum to 1 Probabilistic clustering has similar characteristics Non fuzzy clustering Fuzzy clustering
20
Many types of clustering
Partial versus complete In some cases, we only want to cluster some of the data e.g. to eliminate the outliers. Complete clustering Partial clustering
21
Many types of clustering
Heterogeneous versus homogeneous Cluster of widely different sizes, shapes, and densities Homogeneous (均匀的) Heterogeneous (各种各样的) (in terms of size)
22
Types of clusters: Well-Separated Clusters
A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters
23
Types of clusters: Center-Based clusters
A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters
24
Types of Clusters: Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive) A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 8 contiguous clusters
25
Types of Clusters: Density-Based
A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters
26
The K-Means algorithm
27
Introduction A simple and popular approach Partitional clustering
Each cluster is associated with a centroid (center point) Each point (object) is assigned to the cluster with the closest centroid Number of clusters, K, must be specified by the user.
28
K-Means Input: Output: k, the number of clusters to be generated
P, a set of points to be clustered Output: k partitions, some may be empty
29
Example – iteration 1 Three points are randomly selected to be the centroids
30
Example – iteration 2 Centroid are recalculated as the average of each category and each point is assigned to the category with the closest centroid.
31
Example – iteration 3 Centroid are recalculated as the average of each category and each point is assigned to the category with the closest centroid.
32
Example – iteration 4 Centroid are recalculated as the average of each category and each point is assigned to the category with the closest centroid.
33
Example – iteration 5 Centroid are recalculated as the average of each category and each point is assigned to the category with the closest centroid.
34
Example – iteration 6 This is the last iteration because after that, the categories do not change.
35
More information about K-Means
Initially, centroids are randomly selected. Thus, if we run K-Means several times, the result may be different. The similarity or distance between points may be calculated using different distance functions such as the Euclidian distance, correlation, etc. For such measures, K-means will always converge to a solution (set of clusters). Usually, the clusters will change more during the first iterations. We can stop K-Means when the results does not change much between two iterations.
36
The choice of the initial centroids can have a huge influence on the final result
Data A clustering that is optimal A clustering that is quite bad
37
In some cases, K-Means can find a good solution despite an initial choice of centroids that appears to not be very good
38
How to evaluate a clustering
Sum of squared errors (SSE) k = number of clusters x : an object from a cluster Ci mi : the prototype (centroid) of Ci SSE: allows to choose the best clustering Note: if we increase k, it will decrease the SSE But a good clustering with still have a small SSE even for a small k value.
39
Some problems with k-means
It may be difficult to find a perfect clustering if k is large, because it becomes unlikely that a centroid will be chosen in each natural clusters. K-Means can create some empty clusters. Many strategies to fix this problem Apply the algorithm several times…
40
Limitations of K-Means
K-means does not work very well for categories: having different sizes, having different densities, Having a globular shape. K-means may also not work very well when the data contains outliers.
41
Limitations of K-means: different sizes
Original Points K-Means (3 clusters)
42
And what if we increase k ?
Original points K-Means (3 clusters)
43
Limitations of K-Means : different densities
Original points K-Means 3 clusters
44
And what if we increase k ?
Original points K-Means 9 clusters
45
Limitations of K-Means: non-globular shapes
Original points K-Means (2 clusters)
46
And what if we increase k ?
Not better…
47
Pre-processing and post-processing
Normalize data, Remove outliers. Post-processing Remove small clusters that could be outliers. If a cluster has a high SSE, split the cluster into two clusters. Merge two clusters that are very similar to each other, if the SSE is low. Some of these operations can be integrated into the K-Means algorithm.
48
Density-BASED CLUSTERING (基于密度的聚类) (DB-SCAN)
49
What is density? Density can be defined as the number of points within a circular area defined by some radius (半径) Density is here defined with respect to a given point
50
DBScan (1996) Input: Output: some data points (objects)
eps: a distance (a positive number) minPoints: a number of points Output: clusters that are created based on the density of points, some points are considered as noise and are not included in any clusters.
51
Definitions Neighbors: points at a distance not greater than eps.
Core point: points having at least MinPts neighbors. Border point: point having less than MinPts neighbors, but having a neighbor that is a core point. Noise: the other points Example: eps = 1 minPts = 4
52
How DBScan works? Current label = 1. For each core point p
IF p has no label THEN: p.Label = Current_label. Current_label = Current_Label + 1. FOR EACH point y in the neighborhood of p (transitively) IF y is a border point or a core point without label THEN y.label = CurrentLabel.
53
DBSCAN: Illustration Types of points core points border points noise
Original points Eps = 10, MinPts = 4
54
Advantages of DBScan Clusters Original points Noise-tolerant
Can discover clusters of various size and shapes.
55
Other examples
56
Limitations of DBScan Various densities High dimensional data
(MinPts=4, Eps=9.75). Original points Various densities High dimensional data (MinPts=4, Eps=9.92)
57
Other examples
58
How to choose the EPS and MinPTS parameters?
We can observe the distance from each point to its kth closest neighbor. Noise points are more far from their kth neighbor than points that are not noise To chose the value k to be used with eps, we choose k, and then we can sort the points by their distance to the kth node.
59
Density-based clustering
Advantages Clusters of different sizes and shapes Do not need to specify the number of clusters Remove points that are noise Can be quite fast, if the software is using appropriate spatial data structures to search quickly for neighbors. Disadvantages It can be difficult to find good parameter values Results may vary greatly depending on how the parameters are set.
60
Density-peak clustering (Science, 2014)
Clusters: peaks in the density of points Allows to find non-spherical clusters of different densities The number of clusters is found automatically Can also remove noise Simple
61
(for a distance dc)
62
Minimum distance Density
63
Minimum distance Density
64
This algorithm solves some problems of DBScan.
65
Clustering evaluation
66
Clustering evaluation
Evaluating a clustering found by an algorithm is very important to avoid finding clusters in noise, to compare different clustering algorithms, to compare two clusterings generated by the same algorithm, to compare two clusters
67
Clusters found in random data
DBSCAN Random points K-means Hierarchical clustering
68
Issues related to cluster evaluation
Is there really some natural categories in the data (or is it just some random data). Evaluating clusters using external data (e.g. some already known class labels). Evaluating clusters without using external data (e.g. using the SSE or other measures). Comparing two clustering to choose one Determining how many categories there is.
69
A method: using a similarity matrix
Order the points by cluster labels. Calculate the similarity between all pairs of points EXAMPLE 1: K-MEANS If categories are well separated, there should be some squares appearing diagonally
70
EXAMPLE 2: DBScan, random data
la diagonale est moins bien définie EXAMPLE 3: K-Means, random data
71
EXAMPLE 4: DBSCAN
72
A method to choose the number of categories
There are various methods A simple method is to use the sum of squared errors (SSE) For example: The SSE with respect to the number of categories for K-Means
73
Another example:
74
Hierarchical clustering (层次聚类)
75
MIN: proximity between the two closest points of two categories
MAX: proximity between the two farthest points of two categories Average: average proximity between points of two categories
76
Comparison of hierarchical clustering methods
5 5 1 2 3 4 5 6 4 4 1 2 3 4 5 6 3 2 2 1 3 1 MIN MAX 5 1 2 3 4 5 6 4 2 3 1 Average
77
Conclusion Today, we discussed clustering.
K-Means DB-Scan Density peak clustering How to evaluate clusters Next week, we will discuss anomaly detection, discuss some more advanced topics. Tutorial: how to use K-Means with the SPMF software:
78
References Chapter 8, 9. Tan, Steinbach & Kumar (2006), Introduction to Data Mining, Pearson education, ISBN-10: (and PPTs) Han & Kamber (2011). Data Mining Concepts and Techniques.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.