MIS2502: Data Analytics Clustering and Segmentation

Slides:



Advertisements
Similar presentations
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Advertisements

PARTITIONAL CLUSTERING
MIS2502: Data Analytics Clustering and Segmentation.
Cluster Analysis.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Statistics for the Social Sciences
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis (1).
What is Cluster Analysis?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Evaluating Performance for Data Mining Techniques
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Final Exam Review. The following is a list of items that you should review in preparation for the exam. Note that not every item in the following slides.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
MIS2502: Data Analytics Clustering and Segmentation Jeremy Shafer
Data Mining Classification and Clustering Techniques Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining.
Unsupervised Learning
Machine Learning with Spark MLlib
What Is Cluster Analysis?
Data Mining: Basic Cluster Analysis
Clustering CSC 600: Data Mining Class 21.
Chapter 15 – Cluster Analysis
Presented by Khawar Shakeel
MIS2502: Data Analytics Advanced Analytics - Introduction
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Data Science Algorithms: The Basic Methods
CHAPTER 1 Exploring Data
Statistics: The Z score and the normal distribution
Data Mining K-means Algorithm
CHAPTER 3 Data Description 9/17/2018 Kasturiarachi.
Topic 3: Cluster Analysis
Clustering Evaluation The EM Algorithm
MIS2502: Data Analytics Clustering and Segmentation
Exam #3 Review Zuyin (Alvin) Zheng.
MIS2502: Data Analytics Clustering and Segmentation
Critical Issues with Respect to Clustering
Tabulations and Statistics
MIS2502: Data Analytics Classification using Decision Trees
Clustering.
Chapter 14: Analysis of Variance One-way ANOVA Lecture 8
DATA MINING Introductory and Advanced Topics Part II - Clustering
Data Mining – Chapter 4 Cluster Analysis Part 2
MIS2502: Data Analytics Clustering and Segmentation
Nearest Neighbors CSC 576: Data Mining.
MIS2502: Data Analytics Introduction to Advanced Analytics
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
MIS2502: Data Analytics Classification Using Decision Trees
Topic 5: Cluster Analysis
SEEM4630 Tutorial 3 – Clustering.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Business Application & Conceptual Issues
Unsupervised Learning
Scatterplots contd: Correlation The regression line
Presentation transcript:

MIS2502: Data Analytics Clustering and Segmentation Aaron Zhi Cheng http://community.mis.temple.edu/zcheng/ acheng@temple.edu Acknowledgement: David Schuff

Agenda What is cluster analysis? When to use this technique (applications)? What is the basic idea behind K-means clustering algorithm? K: the number of clusters Centroid How to process data before clustering? Normalization Removing outliers Sum of squares errors (SSE) – to measure distance Lower within-cluster SSE: higher cohesion (good) Higher between-cluster SSE: higher seperation (good)

What is Cluster Analysis? The process of grouping data so that Observations within a group will be similar to one another Observations from different groups will be dissimilar Distance within clusters is minimized http://www.datadrivesmedia.com/two-ways-performance-increases-targeting-precision-and-response-rates/ Distance between clusters is maximized

Market Segmentation (for Carpet Fibers) .. .. . . . . . . . . . . . . . D . . . A . . . . . . . Strength (Importance) Typical clients: A: schools B: light commercial C: indoor/outdoor carpeting D: health clubs .. .. . . . . . . . . B . . . C . . . . . . . . . . . . . Water Resistance (Importance) https://www.stat.auckland.ac.nz/~balemi/Segmentation%20&%20Targeting.ppt

Applications Marketing (Market Segmentation) Political forecasting Discover distinct customer groups for targeted promotions Political forecasting Group neighborhoods by demographics, lifestyles and political view Biology Group similar plants into species Finance Define groups of similar stocks based on financial characteristics, and select stocks from different groups to create balanced portfolios

What cluster analysis is NOT People simply place items into categories Classification (like Decision Trees) Dividing students into groups by last name Simple categorization by attributes The clusters must be learned from the data, not from external specifications. Creating the “buckets” beforehand is categorization, but not clustering.

How would you design an algorithm to group these data points? Here is the initial data set

K-Means: Overview “K” stands for number of clusters We have to choose a “K” first The idea is assign each observation to one and only one cluster based on the similarity (or dissimilarity) among observations We call the center of each cluster as “centroid”

K-Means Demonstration Here is the initial data set

K-Means Algorithm Yes No Choose K clusters Select K points as initial centroids Assign each point to the cluster with the closest centroid Yes Recompute the centroid of each cluster (i.e., the centers) No Did the center change? DONE!

K-Means Demonstration Choose K points as initial centroids

K-Means Demonstration Assign each point to the cluster with the closest centroid

K-Means Demonstration Recalculate the centroids (i.e., the centers of each cluster)

K-Means Demonstration And re-assign the points

K-Means Demonstration And keep doing that until you settle on a final set of clusters

Choosing “K” and the initial centroids Choosing the right number “K” Choosing the right initial location It matters They won’t make sense within the context of the problem Unrelated data points will be included in the same group Bad choices create bad groupings There’s no single, best way to choose initial centroids

Example of Poor Initialization This may “work” mathematically but the clusters don’t make much sense.

Before Clustering: Normalize (Standardize) the data Sometimes we have variables in very different scales For example, income between $0-1 billion versus age between 0-100 Normalization (standardization) Adjusts for differences in scale So that we can compare different variables

How to normalize data? For all variables, we want to transform data into a “standard” scale New average = 0 New standard deviation = 1 For each variable: Compute the average and standard deviation Then for each data point: Subtract with the average divide by the standard deviation

Normalization: A Numeric Example If we have 10 students’ age values: The normalized age values would be: Age 19 22 23 20 21 18 Age -1.06 0.93 1.59 -0.40 0.27 -1.73 (19-20.6)/1.51= -1.06 (22-20.6)/1.51= 0.93 Average: 20.6 Standard deviation: 1.51 New average: 0 New standard deviation: 1

Before Clustering: Remove outliers Find observations that are very different from others To reduce dispersion that can skew the cluster centroids And they don’t represent the population anyway http://web.fhnw.ch/personenseiten/taoufik.nouri/kbs/clustering/DM-Part5.htm

Clusters can be ambiguous 6 How many clusters? 2 4 The difference is the threshold you set. How distinct must a cluster be to be it’s own cluster? adapted from Tan, Steinbach, and Kumar. Introduction to Data Mining (2004)

Choosing the number of clusters (K) Can be determined by external reasons In many cases, there’s no single answer… But here’s what we can do: Make sure the clusters are describing distinct groups (separation) Make sure that the data points within a cluster are not too dispersed to be useful (cohesion) But how do we evaluate the separation and cohesion of the clusters?

Evaluating K-means Clusters What do we mean by similarity/dissimilarity, mathematically? Sum-of-Squares Error (SSE) The sum of squared distance to the nearest cluster centroid How close does each point get to the centroid? This just means For each cluster 𝑖, compute distance from a point (𝑥) to the cluster centroid ( 𝑚 𝑖 ) Square that distance (so sign isn’t an issue) Add them all together SSE is always non-negative (>=0)

Example: Within-Cluster SSE 3 3.3 1 1.3 1.5 2 SSE1 = 12 + 1.32 + 22 = 1 + 1.69 + 4 = 6.69 SSE2 = 32 + 3.32 + 1.52 = 9 + 10.89 + 2.25 = 22.14 Reducing within-cluster SSE increases cohesion – how tightly grouped each cluster is (we want that)

Within-Cluster SSE Considerations Lower individual within-cluster SSE = a better cluster (higher cohesion) Lower total SSE = a better set of clusters More clusters will reduce within-cluster SSE And increase cohesion (we want that) Considerations

SSE between clusters (Between-cluster SSE) What we are missing is: distance between centroids Also can use SSE You’d want to maximize SSE between clusters Cluster 1 Cluster 5 Increasing between-cluster SSE increases separation (we want that)

Trade-off: Cohesion versus Separation 2 clusters 6 clusters versus More clusters → higher cohesion (good) More clusters → lower separation (bad) (lower within-cluster SSE) (lower between-cluster SSE)

Figuring out if our clusters are good “Good” means Meaningful Useful Provides insight How to interpret the clusters? Obtain summary statistics Visualize the clusters This is somewhat subjective and depends upon the expectations of the analyst

The Keys to Successful Clustering High cohesion Low within-cluster SSE High separation High between-cluster SSE We want: Choose the right number of clusters Choose the right initial centroids No easy way to do this Trial-and-error, knowledge of the problem, and looking at the output

Limitations of K-Means Clustering Clusters vary widely in size Clusters vary widely in density Clusters are not in rounded shapes The data set has a lot of outliers K-Means gives unreliable results when The clusters may never make sense. In that case, the data may just not be well-suited for clustering!

Summary What is cluster analysis? When to use this technique (applications)? What is the basic idea behind K-means clustering algorithm? K: the number of clusters Centroid How to process data before clustering? Normalization Removing outliers Sum of squares errors (SSE) – to measure distance Lower within-cluster SSE: higher cohesion (good) Higher between-cluster SSE: higher seperation (good)

Classification versus Clustering Differences Classification Clustering Prior Knowledge of classes (categories)? Yes No Goal Classify new cases into known classes Suggest groups based on patterns in data Algorithm Decision Trees K-means Type of model Predictive (to predict future data) Descriptive (to understand or explore data)