数据挖掘 Introduction to Data Mining

Slides:



Advertisements
Similar presentations
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Advertisements

Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis Basics
Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis (1).
What is Cluster Analysis?
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
Jeff Howbert Introduction to Machine Learning Winter Clustering Basic Concepts and Algorithms 1.
Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1.
Critical Issues with Respect to Clustering Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
Data Mining: Basic Cluster Analysis
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
More on Clustering in COSC 4335
Hierarchical Clustering: Time and Space requirements
Clustering CSC 600: Data Mining Class 21.
Clustering 28/03/2016 A diák alatti jegyzetszöveget írta: Balogh Tamás Péter.
Clustering Techniques for Finding Patterns in Large Amounts of Biological Data Michael Steinbach Department of Computer Science
Data Mining K-means Algorithm
Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
CSE 5243 Intro. to Data Mining
MIS2502: Data Analytics Clustering and Segmentation
Selected Topics in AI: Data Clustering
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Clustering Basic Concepts and Algorithms 1
Data Mining Cluster Techniques: Basic
Data Mining Cluster Analysis: Basic Concepts and Algorithms
MIS2502: Data Analytics Clustering and Segmentation
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Critical Issues with Respect to Clustering
Clustering 23/03/2016 A diák alatti jegyzetszöveget írta: Balogh Tamás Péter.
Computational BioMedical Informatics
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Clustering Analysis.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
CSE572, CBS572: Data Mining by H. Liu
Data Mining Cluster Analysis: Basic Concepts and Algorithms
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Text Categorization Berlin Chen 2003 Reference:
Data Mining Cluster Analysis: Basic Concepts and Algorithms
CSE572: Data Mining by H. Liu
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
What is Cluster Analysis?
Presentation transcript:

数据挖掘 Introduction to Data Mining Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2018 S8700113C

Course schedule (日程安排) Lecture 1 Introduction What is the knowledge discovery process? Lecture 2 Exploring the data Lecture 3 Classification (part 1) Lecture 4 Classification (part 2) Lecture 5 Association analysis Lecture 6 Lecture 7 Clustering Lecture 8 Anomaly detection and advanced topics Final exam (date to be announced)

Introduction Last time: Important: Association analysis - part 2 Solution assignment #1 Assignment #2 Important: QQ group: 723166394 The PPTs are on the website.

Clustering (群集)

Introduction Clustering (群集): to automatically group similar objects/instances into clusters (groups). The clusters should capture the natural structure of the data.

Clustering Why do clustering? to summarize the data, to understand the data for decision-making, as a first step before applying other data mining techniques. Clustering is a task that humans naturally do in everyday life Many applications: Grouping similar webpages, Grouping customers with similar behavior or preferences Grouping similar movies, songs

What are « good » clusters? In general, we may want to find clusters that: Minimize the similarity between points of different categories Maximize the similarity between points of a category

To reduce the size of datasets Some data mining techniques such as PCA may be slow if a database is large (since they have an exponential complexity). A solution is to replace all points in each cluster by a single data point representing the cluster. This reduces the size of the database and allows data mining algorithms to run faster.

To reduce the size of datasets Some data mining techniques such as PCA may be slow if a database is large (since they have an exponential complexity). A solution is to replace all points in each cluster by a single data point representing the cluster. This reduces the size of the database and allows data mining algorithms to run faster.

Classification (分类) Classification (分类): predicting the value of a target attribute for some new data. The possible values for the target attributes are called “classes” or categories “target attribute” NAME AGE INCOME GENDER EDUCATION John 99 1 元 Male Ph.D. Lucia 44 20元 Female Master Paul 33 25元 Daisy 20 50元 High school Jack 15 10元 Macy 35 ????????? Classes are known in advance : Ph.D., Master, high school…

Classification (分类) Supervised classification (监督分类) require to have training data that is already labelled for training a classification model. “target attribute” NAME AGE INCOME GENDER EDUCATION John 99 1 元 Male Ph.D. Lucia 44 20元 Female Master Paul 33 25元 Daisy 20 50元 High school Jack 15 10元 Macy 35 ????????? Training data 训练数据

Clustering (群集) Automatically group instances into groups. No training data is required No labels or target attribute needs to be selected.

Clustering (群集) Automatically group instances into groups. No training data is required No labels or target attribute needs to be selected.

What is a good clustering? How many categories? Six? Four? Two?

Partitional Clustering (划分聚类) Each object must belong to exactly one cluster A Partitional Clustering Original Points

Hierarchical Clustering (层次聚类) Clusters are created as a hierarchy of clusters Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram

http://www. instituteofcaninebiology. org/how-to-read-a-dendrogram http://www.instituteofcaninebiology.org/how-to-read-a-dendrogram.html An example of dendrogram

Many types of clustering Exclusive versus non-exclusive In a non-exclusive clustering, points may belong to multiple clusters. Can represent multiple classes or ‘border’ points Exclusive clustering Non-exclusive clustering

Many types of clustering Fuzzy versus non-fuzzy In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 Weights must sum to 1 Probabilistic clustering has similar characteristics Non fuzzy clustering Fuzzy clustering

Many types of clustering Partial versus complete In some cases, we only want to cluster some of the data e.g. to eliminate the outliers. Complete clustering Partial clustering

Many types of clustering Heterogeneous versus homogeneous Cluster of widely different sizes, shapes, and densities Homogeneous (均匀的) Heterogeneous (各种各样的) (in terms of size)

Types of clusters: Well-Separated Clusters A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters

Types of clusters: Center-Based clusters A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters

Types of Clusters: Contiguity-Based Contiguous Cluster (Nearest neighbor or Transitive) A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 8 contiguous clusters

Types of Clusters: Density-Based A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters

The K-Means algorithm

Introduction A simple and popular approach Partitional clustering Each cluster is associated with a centroid (center point) Each point (object) is assigned to the cluster with the closest centroid Number of clusters, K, must be specified by the user.

K-Means Input: Output: k, the number of clusters to be generated P, a set of points to be clustered Output: k partitions, some may be empty

Example – iteration 1 Three points are randomly selected to be the centroids

Example – iteration 2 Centroid are recalculated as the average of each category and each point is assigned to the category with the closest centroid.

Example – iteration 3 Centroid are recalculated as the average of each category and each point is assigned to the category with the closest centroid.

Example – iteration 4 Centroid are recalculated as the average of each category and each point is assigned to the category with the closest centroid.

Example – iteration 5 Centroid are recalculated as the average of each category and each point is assigned to the category with the closest centroid.

Example – iteration 6 This is the last iteration because after that, the categories do not change.

More information about K-Means Initially, centroids are randomly selected. Thus, if we run K-Means several times, the result may be different. The similarity or distance between points may be calculated using different distance functions such as the Euclidian distance, correlation, etc. For such measures, K-means will always converge to a solution (set of clusters). Usually, the clusters will change more during the first iterations. We can stop K-Means when the results does not change much between two iterations.

The choice of the initial centroids can have a huge influence on the final result Data A clustering that is optimal A clustering that is quite bad

In some cases, K-Means can find a good solution despite an initial choice of centroids that appears to not be very good

How to evaluate a clustering Sum of squared errors (SSE) k = number of clusters x : an object from a cluster Ci mi : the prototype (centroid) of Ci SSE: allows to choose the best clustering Note: if we increase k, it will decrease the SSE But a good clustering with still have a small SSE even for a small k value.

Some problems with k-means It may be difficult to find a perfect clustering if k is large, because it becomes unlikely that a centroid will be chosen in each natural clusters. K-Means can create some empty clusters. Many strategies to fix this problem Apply the algorithm several times…

Limitations of K-Means K-means does not work very well for categories: having different sizes, having different densities, Having a globular shape. K-means may also not work very well when the data contains outliers.

Limitations of K-means: different sizes Original Points K-Means (3 clusters)

And what if we increase k ? Original points K-Means (3 clusters)

Limitations of K-Means : different densities Original points K-Means 3 clusters

And what if we increase k ? Original points K-Means 9 clusters

Limitations of K-Means: non-globular shapes Original points K-Means (2 clusters)

And what if we increase k ? Not better…

Pre-processing and post-processing Normalize data, Remove outliers. Post-processing Remove small clusters that could be outliers. If a cluster has a high SSE, split the cluster into two clusters. Merge two clusters that are very similar to each other, if the SSE is low. Some of these operations can be integrated into the K-Means algorithm.

Density-BASED CLUSTERING (基于密度的聚类) (DB-SCAN)

What is density? Density can be defined as the number of points within a circular area defined by some radius (半径) Density is here defined with respect to a given point

DBScan (1996) Input: Output: some data points (objects) eps: a distance (a positive number) minPoints: a number of points Output: clusters that are created based on the density of points, some points are considered as noise and are not included in any clusters.

Definitions Neighbors: points at a distance not greater than eps. Core point: points having at least MinPts neighbors. Border point: point having less than MinPts neighbors, but having a neighbor that is a core point. Noise: the other points Example: eps = 1 minPts = 4

How DBScan works? Current label = 1. For each core point p IF p has no label THEN: p.Label = Current_label. Current_label = Current_Label + 1. FOR EACH point y in the neighborhood of p (transitively) IF y is a border point or a core point without label THEN y.label = CurrentLabel.

DBSCAN: Illustration Types of points core points border points noise Original points Eps = 10, MinPts = 4

Advantages of DBScan Clusters Original points Noise-tolerant Can discover clusters of various size and shapes.

Other examples

Limitations of DBScan Various densities High dimensional data (MinPts=4, Eps=9.75). Original points Various densities High dimensional data (MinPts=4, Eps=9.92)

Other examples

How to choose the EPS and MinPTS parameters? We can observe the distance from each point to its kth closest neighbor. Noise points are more far from their kth neighbor than points that are not noise To chose the value k to be used with eps, we choose k, and then we can sort the points by their distance to the kth node.

Density-based clustering Advantages Clusters of different sizes and shapes Do not need to specify the number of clusters Remove points that are noise Can be quite fast, if the software is using appropriate spatial data structures to search quickly for neighbors. Disadvantages It can be difficult to find good parameter values Results may vary greatly depending on how the parameters are set.

Density-peak clustering (Science, 2014) Clusters: peaks in the density of points Allows to find non-spherical clusters of different densities The number of clusters is found automatically Can also remove noise Simple

(for a distance dc) http://conference.mipt.ru/img/conference/material-design-2014/talks/Laio-talk.pdf

Minimum distance Density

Minimum distance Density

This algorithm solves some problems of DBScan.

Clustering evaluation

Clustering evaluation Evaluating a clustering found by an algorithm is very important to avoid finding clusters in noise, to compare different clustering algorithms, to compare two clusterings generated by the same algorithm, to compare two clusters

Clusters found in random data DBSCAN Random points K-means Hierarchical clustering

Issues related to cluster evaluation Is there really some natural categories in the data (or is it just some random data). Evaluating clusters using external data (e.g. some already known class labels). Evaluating clusters without using external data (e.g. using the SSE or other measures). Comparing two clustering to choose one Determining how many categories there is.

A method: using a similarity matrix Order the points by cluster labels. Calculate the similarity between all pairs of points EXAMPLE 1: K-MEANS If categories are well separated, there should be some squares appearing diagonally

EXAMPLE 2: DBScan, random data la diagonale est moins bien définie EXAMPLE 3: K-Means, random data

EXAMPLE 4: DBSCAN

A method to choose the number of categories There are various methods A simple method is to use the sum of squared errors (SSE) For example: The SSE with respect to the number of categories for K-Means

Another example:

Hierarchical clustering (层次聚类)

MIN: proximity between the two closest points of two categories MAX: proximity between the two farthest points of two categories Average: average proximity between points of two categories

Comparison of hierarchical clustering methods 5 5 1 2 3 4 5 6 4 4 1 2 3 4 5 6 3 2 2 1 3 1 MIN MAX 5 1 2 3 4 5 6 4 2 3 1 Average

Conclusion Today, we discussed clustering. K-Means DB-Scan Density peak clustering How to evaluate clusters Next week, we will discuss anomaly detection, discuss some more advanced topics. Tutorial: how to use K-Means with the SPMF software: http://data-mining.philippe-fournier-viger.com/introduction-clustering-k-means-java-code/

References Chapter 8, 9. Tan, Steinbach & Kumar (2006), Introduction to Data Mining, Pearson education, ISBN-10: 0321321367 (and PPTs) Han & Kamber (2011). Data Mining Concepts and Techniques.