Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育

Slides:



Advertisements
Similar presentations
Copyright Jiawei Han, modified by Charles Ling for CS411a
Advertisements

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
What is Cluster Analysis?
Clustering Basic Concepts and Algorithms
Clustering.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Cluster Analysis.
What is Cluster Analysis
Segmentação (Clustering) (baseado nos slides do Han)
1 Chapter 8: Clustering. 2 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data.
CLUSTERING (Segmentation)
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Lecture 09 Clustering-based Learning
Introduction to Data Mining Engineering Group in ACL.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Cluster Analysis Part I
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Data Mining 資料探勘 DM04 MI4 Thu 9, 10 (16:10-18:00) B216 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management,
Web Mining ( 網路探勘 ) WM04 TLMXM1A Wed 8,9 (15:10-17:00) U705 Unsupervised Learning ( 非監督式學習 ) Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of.
October 27, 2015Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 7 — ©Jiawei Han and Micheline.
1 Clustering Sunita Sarawagi
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Data Mining 資料探勘 DM04 MI4 Wed, 6,7 (13:10-15:00) (B216) 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management,
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining Algorithms
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Data Warehousing 資料倉儲 Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management, Tamkang University Dept. of Information ManagementTamkang.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
Big Data Mining 巨量資料探勘 DM05 MI4 (M2244) (3094) Tue, 3, 4 (10:10-12:00) (B216) 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授.
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
分群分析 (Cluster Analysis)
Machine Learning Clustering: K-means Supervised Learning
Data Mining K-means Algorithm
Topic 3: Cluster Analysis
©Jiawei Han and Micheline Kamber Department of Computer Science
CSE 5243 Intro. to Data Mining
John Nicholas Owen Sarah Smith
Revision (Part II) Ke Chen
Clustering.
Information Organization: Clustering
Revision (Part II) Ke Chen
Data Mining: Clustering
CSCI N317 Computation for Scientific Applications Unit Weka
What Is Good Clustering?
Clustering Wei Wang.
關連分析 (Association Analysis)
Text Categorization Berlin Chen 2003 Reference:
Topic 5: Cluster Analysis
Introduction to Machine learning
What is Cluster Analysis?
Presentation transcript:

Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育 1002DM04 MI4 Thu. 9,10 (16:10-18:00) B513 Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management, Tamkang University 淡江大學 資訊管理學系 http://mail. tku.edu.tw/myday/ 2012-03-08

課程大綱 (Syllabus) 週次 日期 內容(Subject/Topics) 1 101/02/16 資料探勘導論 (Introduction to Data Mining) 2 101/02/23 關連分析 (Association Analysis) 3 101/03/01 分類與預測 (Classification and Prediction) 4 101/03/08 分群分析 (Cluster Analysis) 5 101/03/15 個案分析與實作一 (分群分析): Banking Segmentation (Cluster Analysis – KMeans) 6 101/03/22 個案分析與實作二 (關連分析): Web Site Usage Associations ( Association Analysis) 7 101/03/29 期中報告 (Midterm Presentation) 8 101/04/05 教學行政觀摩日 (--No Class--)

課程大綱 (Syllabus) 週次 日期 內容(Subject/Topics) 9 101/04/12 個案分析與實作三 (決策樹、模型評估): Enrollment Management Case Study (Decision Tree, Model Evaluation 10 101/04/19 期中考試週 11 101/04/26 個案分析與實作四 (迴歸分析、類神經網路): Credit Risk Case Study (Regression Analysis, Artificial Neural Network) 12 101/05/03 文字探勘與網頁探勘 (Text and Web Mining) 13 101/05/10 社會網路分析、意見分析 (Social Network Analysis, Opinion Mining) 14 101/05/17 期末專題報告 (Term Project Presentation) 15 101/05/24 畢業考試週

Outline Cluster Analysis K-Means Clustering Source: Han & Kamber (2006)

Cluster Analysis Used for automatic identification of natural groupings of things Part of the machine-learning family Employ unsupervised learning Learns the clusters of things from past data, then assigns new instances There is not an output variable Also known as segmentation Source: Turban et al. (2011), Decision Support and Business Intelligence Systems

Cluster Analysis Clustering of a set of objects based on the k-means method. (The mean of each cluster is marked by a “+”.) Source: Han & Kamber (2006)

Cluster Analysis Clustering results may be used to Identify natural groupings of customers Identify rules for assigning new cases to classes for targeting/diagnostic purposes Provide characterization, definition, labeling of populations Decrease the size and complexity of problems for other data mining methods Identify outliers in a specific domain (e.g., rare-event detection) Source: Turban et al. (2011), Decision Support and Business Intelligence Systems

Example of Cluster Analysis Point P P(x,y) p01 a (3, 4) p02 b (3, 6) p03 c (3, 8) p04 d (4, 5) p05 e (4, 7) p06 f (5, 1) p07 g (5, 5) p08 h (7, 3) p09 i (7, 5) p10 j (8, 5)

Cluster Analysis for Data Mining Analysis methods Statistical methods (including both hierarchical and nonhierarchical), such as k-means, k-modes, and so on Neural networks (adaptive resonance theory [ART], self-organizing map [SOM]) Fuzzy logic (e.g., fuzzy c-means algorithm) Genetic algorithms Divisive versus Agglomerative methods Source: Turban et al. (2011), Decision Support and Business Intelligence Systems

Cluster Analysis for Data Mining How many clusters? There is not a “truly optimal” way to calculate it Heuristics are often used Look at the sparseness of clusters Number of clusters = (n/2)1/2 (n: no of data points) Use Akaike information criterion (AIC) Use Bayesian information criterion (BIC) Most cluster analysis methods involve the use of a distance measure to calculate the closeness between pairs of items Euclidian versus Manhattan (rectilinear) distance Source: Turban et al. (2011), Decision Support and Business Intelligence Systems

k-Means Clustering Algorithm k : pre-determined number of clusters Algorithm (Step 0: determine value of k) Step 1: Randomly generate k random points as initial cluster centers Step 2: Assign each point to the nearest cluster center Step 3: Re-compute the new cluster centers Repetition step: Repeat steps 2 and 3 until some convergence criterion is met (usually that the assignment of points to clusters becomes stable) Source: Turban et al. (2011), Decision Support and Business Intelligence Systems

Cluster Analysis for Data Mining - k-Means Clustering Algorithm Source: Turban et al. (2011), Decision Support and Business Intelligence Systems

Quality: What Is Good Clustering? A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns Source: Han & Kamber (2006)

Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance: where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer If q = 1, d is Manhattan distance Source: Han & Kamber (2006)

Similarity and Dissimilarity Between Objects (Cont.) If q = 2, d is Euclidean distance: Properties d(i,j)  0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j)  d(i,k) + d(k,j) Also, one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures Source: Han & Kamber (2006)

Euclidean distance vs Manhattan distance Distance of two point x1 = (1, 2) and x2 (3, 5) Euclidean distance: = ((3-1)2 + (5-2)2 )1/2 = (22 + 32)1/2 = (4 + 9)1/2 = (13)1/2 = 3.61 x2 (3, 5) 5 4 3.61 3 3 2 2 x1 = (1, 2) Manhattan distance: = (3-1) + (5-2) = 2 + 3 = 5 1 1 2 3

Binary Variables A contingency table for binary data Object i Object j A contingency table for binary data Distance measure for symmetric binary variables: Distance measure for asymmetric binary variables: Jaccard coefficient (similarity measure for asymmetric binary variables): Source: Han & Kamber (2006)

Dissimilarity between Binary Variables Example gender is a symmetric attribute the remaining attributes are asymmetric binary let the values Y and P be set to 1, and the value N be set to 0 Source: Han & Kamber (2006)

The K-Means Clustering Method Given k, the k-means algorithm is implemented in four steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when no more new assignment Source: Han & Kamber (2006)

The K-Means Clustering Method Example 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 Update the cluster means 4 Assign each objects to most similar center 3 2 1 1 2 3 4 5 6 7 8 9 10 reassign reassign K=2 Arbitrarily choose K object as initial cluster center Update the cluster means Source: Han & Kamber (2006)

K-Means Clustering Step by Step Point P P(x,y) p01 a (3, 4) p02 b (3, 6) p03 c (3, 8) p04 d (4, 5) p05 e (4, 7) p06 f (5, 1) p07 g (5, 5) p08 h (7, 3) p09 i (7, 5) p10 j (8, 5)

K-Means Clustering Point P P(x,y) p01 a (3, 4) p02 b (3, 6) p03 c Step 1: K=2, Arbitrarily choose K object as initial cluster center Point P P(x,y) p01 a (3, 4) p02 b (3, 6) p03 c (3, 8) p04 d (4, 5) p05 e (4, 7) p06 f (5, 1) p07 g (5, 5) p08 h (7, 3) p09 i (7, 5) p10 j (8, 5) Initial m1 m2 M2 = (8, 5) m1 = (3, 4)

K-Means Clustering p01 a (3, 4) 0.00 5.10 Cluster1 p02 b (3, 6) 2.00 Step 2: Compute seed points as the centroids of the clusters of the current partition Step 3: Assign each objects to most similar center Point P P(x,y) m1 distance m2 distance Cluster p01 a (3, 4) 0.00 5.10 Cluster1 p02 b (3, 6) 2.00 p03 c (3, 8) 4.00 5.83 p04 d (4, 5) 1.41 p05 e (4, 7) 3.16 4.47 p06 f (5, 1) 3.61 5.00 p07 g (5, 5) 2.24 3.00 p08 h (7, 3) 4.12 Cluster2 p09 i (7, 5) 1.00 p10 j (8, 5) Initial m1 m2 M2 = (8, 5) m1 = (3, 4) K-Means Clustering

K-Means Clustering Euclidean distance b(3,6) m2(8,5) Step 2: Compute seed points as the centroids of the clusters of the current partition Step 3: Assign each objects to most similar center Point P P(x,y) m1 distance m2 distance Cluster p01 a (3, 4) 0.00 5.10 Cluster1 p02 b (3, 6) 2.00 p03 c (3, 8) 4.00 5.83 p04 d (4, 5) 1.41 p05 e (4, 7) 3.16 4.47 p06 f (5, 1) 3.61 5.00 p07 g (5, 5) 2.24 3.00 p08 h (7, 3) 4.12 Cluster2 p09 i (7, 5) 1.00 p10 j (8, 5) Initial m1 m2 M2 = (8, 5) Euclidean distance b(3,6) m2(8,5) = ((8-3)2 + (5-6)2 )1/2 = (52 + (-1)2)1/2 = (25 + 1)1/2 = (26)1/2 = 5.10 m1 = (3, 4) Euclidean distance b(3,6) m1(3,4) = ((3-3)2 + (4-6)2 )1/2 = (02 + (-2)2)1/2 = (0 + 4)1/2 = (4)1/2 = 2.00 K-Means Clustering

Step 4: Update the cluster means, Repeat Step 2, 3, stop when no more new assignment Point P P(x,y) m1 distance m2 distance Cluster p01 a (3, 4) 1.43 4.34 Cluster1 p02 b (3, 6) 1.22 4.64 p03 c (3, 8) 2.99 5.68 p04 d (4, 5) 0.20 3.40 p05 e (4, 7) 1.87 4.27 p06 f (5, 1) 4.29 4.06 Cluster2 p07 g (5, 5) 1.15 2.42 p08 h (7, 3) 3.80 1.37 p09 i (7, 5) 3.14 0.75 p10 j (8, 5) 4.14 0.95 m1 (3.86, 5.14) m2 (7.33, 4.33) m1 = (3.86, 5.14) M2 = (7.33, 4.33) K-Means Clustering

Step 4: Update the cluster means, Repeat Step 2, 3, stop when no more new assignment Point P P(x,y) m1 distance m2 distance Cluster p01 a (3, 4) 1.95 3.78 Cluster1 p02 b (3, 6) 0.69 4.51 p03 c (3, 8) 2.27 5.86 p04 d (4, 5) 0.89 3.13 p05 e (4, 7) 1.22 4.45 p06 f (5, 1) 5.01 3.05 Cluster2 p07 g (5, 5) 1.57 2.30 p08 h (7, 3) 4.37 0.56 p09 i (7, 5) 3.43 1.52 p10 j (8, 5) 4.41 m1 (3.67, 5.83) m2 (6.75, 3.50) m1 = (3.67, 5.83) M2 = (6.75., 3.50) K-Means Clustering

K-Means Clustering stop when no more new assignment p01 a (3, 4) 1.95 Point P P(x,y) m1 distance m2 distance Cluster p01 a (3, 4) 1.95 3.78 Cluster1 p02 b (3, 6) 0.69 4.51 p03 c (3, 8) 2.27 5.86 p04 d (4, 5) 0.89 3.13 p05 e (4, 7) 1.22 4.45 p06 f (5, 1) 5.01 3.05 Cluster2 p07 g (5, 5) 1.57 2.30 p08 h (7, 3) 4.37 0.56 p09 i (7, 5) 3.43 1.52 p10 j (8, 5) 4.41 m1 (3.67, 5.83) m2 (6.75, 3.50) K-Means Clustering

K-Means Clustering stop when no more new assignment p01 a (3, 4) 1.95 Point P P(x,y) m1 distance m2 distance Cluster p01 a (3, 4) 1.95 3.78 Cluster1 p02 b (3, 6) 0.69 4.51 p03 c (3, 8) 2.27 5.86 p04 d (4, 5) 0.89 3.13 p05 e (4, 7) 1.22 4.45 p06 f (5, 1) 5.01 3.05 Cluster2 p07 g (5, 5) 1.57 2.30 p08 h (7, 3) 4.37 0.56 p09 i (7, 5) 3.43 1.52 p10 j (8, 5) 4.41 m1 (3.67, 5.83) m2 (6.75, 3.50) K-Means Clustering

Summary Cluster Analysis K-Means Clustering Source: Han & Kamber (2006)

References Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Second Edition, 2006, Elsevier Efraim Turban, Ramesh Sharda, Dursun Delen, Decision Support and Business Intelligence Systems, Ninth Edition, 2011, Pearson.