Clustering analysis workshop Clustering analysis workshop CITM, Lab 3 18, Oct 2014 Facilitator: Hosam Al-Samarraie, PhD.

Slides:



Advertisements
Similar presentations
McGraw-Hill/Irwin McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved.
Advertisements

Clustering Categorical Data The Case of Quran Verses
Livelihoods analysis using SPSS. Why do we analyze livelihoods?  Food security analysis aims at informing geographical and socio-economic targeting 
AEB 37 / AE 802 Marketing Research Methods Week 7
Cluster Analysis.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
Chapter 17 Overview of Multivariate Analysis Methods
Chapter Seventeen Copyright © 2006 McGraw-Hill/Irwin Data Analysis: Multivariate Techniques for the Research Process.
Week 9 Data Mining System (Knowledge Data Discovery)
What is Cluster Analysis
Segmentation and Profiling using SPSS for Windows Kate Grayson.
Multivariate Data Analysis Chapter 9 - Cluster Analysis
Developing a Questionnaire. Goals Discuss asking the right questions in the right way as part of an epidemiologic study. Review the steps for creating.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Chapter 7 Correlational Research Gay, Mills, and Airasian
CORRELATIO NAL RESEARCH METHOD. The researcher wanted to determine if there is a significant relationship between the nursing personnel characteristics.
Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103,  Four Scales  Categorical.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Discriminant Analysis Testing latent variables as predictors of groups.
Multivariate Methods EPSY 5245 Michael C. Rodriguez.
Data Mining Chun-Hung Chou
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Instrumentation.
CLUSTER ANALYSIS.
CHAPTER 6, INDEXES, SCALES, AND TYPOLOGIES
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Multivariate Data Analysis CHAPTER seventeen.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Discriminant Analysis
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Statistical analysis Prepared and gathered by Alireza Yousefy(Ph.D)
Lecture 20: Cluster Validation
© 2007 Prentice Hall20-1 Chapter Twenty Cluster Analysis.
Chapter 1 Introduction to Statistics. Statistical Methods Were developed to serve a purpose Were developed to serve a purpose The purpose for each statistical.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 16.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Neural Networks - Lecture 81 Unsupervised competitive learning Particularities of unsupervised learning Data clustering Neural networks for clustering.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
The Practice of Social Research Chapter 6 – Indexes, Scales, and Typologies.
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
SECOND EDITION Chapter 5 Standardized Measurement and Assessment
Stretching Your Data Management Skills Chuck Humphrey University of Alberta Atlantic DLI Workshop 2003.
Chapter 6 - Standardized Measurement and Assessment
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Clustering / Scaling. Cluster Analysis Objective: – Partitions observations into meaningful groups with individuals in a group being more “similar” to.
Chapter Seventeen Copyright © 2004 John Wiley & Sons, Inc. Multivariate Data Analysis.
1 Cluster Analysis Prepared by : Prof Neha Yadav.
Appendix I A Refresher on some Statistical Terms and Tests.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Chapter_20 Cluster Analysis Naresh K. Malhotra
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Unsupervised Learning
Basic Statistical Terms
EPSY 5245 EPSY 5245 Michael C. Rodriguez
CSCI N317 Computation for Scientific Applications Unit Weka
MIS2502: Data Analytics Clustering and Segmentation
Chapter_20 Cluster Analysis
Cluster Analysis.
Cluster analysis Presented by Dr.Chayada Bhadrakom
Unsupervised Learning
Presentation transcript:

Clustering analysis workshop Clustering analysis workshop CITM, Lab 3 18, Oct 2014 Facilitator: Hosam Al-Samarraie, PhD.

Outline – The basic concepts of cluster analysis. – The basic concepts of cluster analysis. – The different types of clustering procedures. – The different types of clustering procedures. – How to execute and generate clustering results. – How to execute and generate clustering results. – The SPSS clustering outputs. – The SPSS clustering outputs. – The learning machine outputs. – The learning machine outputs.

What Does Data Mining Do? Data mining extract patterns from data – Pattern? A mathematical (numeric and/or symbolic) relationship among data items Types of patterns – Association – Prediction – Cluster (segmentation)

Knowledge Discovery Steps in a Knowledge Discovery process

Supervised vs. Unsupervised Learning Supervised learning (classification) –Supervision: I know the output and I want to examine the effect between the Independent variable on Dependent one. Unsupervised learning (clustering) –The class or the nature of the variables is unknown –Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

The concept of cluster analysis Cluster analysis is unsupervised learning for identifying homogenous groups of objects called clusters. Cluster share many characteristics, but are very dissimilar to objects not belonging to that cluster.

Cont… Measuring distances (differences or dissimilarities between subjects) Measuring distances (differences or dissimilarities between subjects) Measuring proximities (similarity between subjects) Measuring proximities (similarity between subjects)

Types of Data!! Gender…. Age group Length Numeric Not numeric Count

Typical research questions the Cluster Analysis answers are as follows: Medicine – What are the diagnostic clusters? To answer this question the researcher would devise a diagnostic questionnaire that entails the symptoms (for example in psychology standardized scales for anxiety, depression etc.). The cluster analysis can then identify groups of patients that present with similar symptoms and simultaneously maximize the difference between the groups. Marketing – What are the customer segments? To answer this question a market researcher conducts a survey most commonly covering needs, attitudes, demographics, and behavior of customers. The researcher then uses the cluster analysis to identify homogenous groups of customers that have similar needs and attitudes but are distinctively different from other customer segments. Education – What are student groups that need special attention? The researcher measures a couple of psychological, aptitude, and achievement characteristics. A cluster analysis then identifies what homogeneous groups exist among students (for example, high achievers in all subjects, or students that excel in certain subjects but fail in others, etc.). A discriminant analysis then profiles these performance clusters and tells us what psychological, environmental, aptitudinal, affective, and attitudinal factors characterize these student groups.

Types of clustering

Hierarchical Clustering 1.use agglomerative ("bottom-up”) algorithms begin with each element as a separate cluster and merge them into successively larger clusters. 2.Handles continuous data.

Cont… Can be visualized as a dendrogram – A tree-like diagram that records the sequences of merges or splits

Non hierarchical K-means clustering 1.Begin with two starting center points and allocate each item to nearest cluster center. 2.Allocate items to nearest cluster center.

Mix Two-Steps Clustering 1.designed to handle very large data sets. 2.can handle both continuous and categorical variables or attributes. 3.automatically select the number of clusters.

Generate clustering 1

1. Decide on cluster variables At the beginning of the clustering process, we have to select appropriate variables for clustering.

Note!!! It is important to avoid using an abundance of clustering variables, as this increases the odds that the variables are no longer dissimilar. Meaning? If highly correlated variables are used for cluster analysis, specific aspects covered by these variables will be overrepresented in the clustering solution. In this regard, absolute correlations above 0.90 are always problematic. For example, measuring happiness and joy of a person.

Insight!! When we usually use factor analysis, we usually get factor solution that does not explain a certain amount of variance; As such, discarding of information will be performed before identifying the segments. However, removing variables with low loadings on all the extracted factors means that some potential information for the identification of segments are discarded. This in turn reduce the possibility of identifying different groups. Finally, the resulted factors based on the original variables become questionable.

2

2.Decide on the Clustering Procedure Refers to the process of forming the cluster.

Dataset Lets say I have different people with different measures of height and weight (variables). Now, if I want to group those people by weight and height into different groups, then I need to use Cluster analysis.

The SPSS clustering Variables People to be clustered. It can be performance, achievement, etc…

Cont… Hierarchical Methods: If there is a limited number of observation, usually <200. ▸ Analyze ▸ Classify ▸ Hierarchical Cluster K-Means: If there are many observations, usually > 500. ▸ Analyze ▸ Classify ▸ K-Means Cluster Two-step cluster: If there are many observations and the clusters are measured on different scale levels (5 likert scale, nominal, ordinal, etc..) ▸ Analyze ▸ Classify ▸ Two-Step Cluster

In Hierarchical Select a Clustering Algorithm Ward’s method (only hierarchical clustering) ▸ Analyze ▸ Classify ▸ Hierarchical Cluster ▸ Method ▸ Cluster Method

Select measure of Similarity In hierarchal Only apply for Hierarchal and two-steps methods Euclidean is the most commonly used type when it comes to analyzing ratio or interval-scaled data.

Select measure of Similarity In Two-step Two-step clustering: ▸ Analyze ▸ Classify ▸ Two-Step Cluster ▸ Distance Measure

Standardize in Hierarchal only. In both methods, convert variables with multiple categories (on a range of 0 to 1 or 1 to 1, or use Z score).

3

Identifying the number of clusters? For hierarchical clustering by examining the dendrogram: ▸ Analyze ▸ Classify ▸ Hierarchical Cluster ▸ Plots ▸ Dendrogram Not always recommended

Alternative solution Draw a scree plot (e.g., using Microsoft Excel) based on the coefficients in the agglomeration schedule. (Elbow method)..2 clusters are possible to use..

For two-step and k-means Note: two-step clustering identify the number of clusters automatically. However, K-means use default of 2. The most recommended one is 3-4 clusters. So you need to try both and see which one provides useful output.

Save membership After identifying the number of clusters, we save the memberships between the cases. Click save Add 2

Membership to be used Here is the membership

4

Assess the solution’s stability By using other methods and compare between each other.....

Assess the solution’s validity Criterion validity: Evaluate whether there are significant differences between the segments resulted from the membership step. P<0.05 We are doing well…

Interpret the cluster solution Examine cluster centroids and assess whether these differ significantly from each other (e.g., by means of t-tests or ANOVA). As we did earlier. Identify names or labels for each cluster and characterize each cluster by means of observable variables, if necessary (cluster profiling).

SPSS That’s all…..now lets try it in spss.

Another example Lets say I want to explore children that needs special learning. So I collected some data about children's reading and cognitive performance gain. Now I ask the question, What are children groups that need extra learning?

For the data place this url Download the cluster children data. Open the file in spss (or just double click) Now observe the data.

Thank you