Clustering / Scaling. Cluster Analysis Objective: – Partitions observations into meaningful groups with individuals in a group being more “similar” to.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Cluster Analysis Measuring latent groups. Cluster Analysis - Discussion Definition Vocabulary Simple Procedure SPSS example ICPSR and hands on.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
AEB 37 / AE 802 Marketing Research Methods Week 7
Cluster Analysis.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
Chapter 17 Overview of Multivariate Analysis Methods
Clustering II.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
What is Cluster Analysis?
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Multivariate Data Analysis Chapter 9 - Cluster Analysis
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Dr. Michael R. Hyman Cluster Analysis. 2 Introduction Also called classification analysis and numerical taxonomy Goal: assign objects to groups so that.
Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables.
Discriminant Analysis Testing latent variables as predictors of groups.
Clustering analysis workshop Clustering analysis workshop CITM, Lab 3 18, Oct 2014 Facilitator: Hosam Al-Samarraie, PhD.
Clustering Unsupervised learning Generating “classes”
Elizabeth Garrett-Mayer November 5, 2003 Oncology Biostatistics
Cluster Analysis Chapter 12.
Descriptive Methods in Regression and Correlation
Inferential Statistics: SPSS
Selecting the Correct Statistical Test
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
CLUSTER ANALYSIS.
Statistical Analysis. Statistics u Description –Describes the data –Mean –Median –Mode u Inferential –Allows prediction from the sample to the population.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Microarrays.
© 2007 Prentice Hall20-1 Chapter Twenty Cluster Analysis.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Within Subjects Analysis of Variance PowerPoint.
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
1 Hair, Babin, Money & Samouel, Essentials of Business Research, Wiley, Learning Objectives: 1.Explain the difference between dependence and interdependence.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
SW388R7 Data Analysis & Computers II Slide 1 Detecting Outliers Detecting univariate outliers Detecting multivariate outliers.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.
Chapter 13.  Both Principle components analysis (PCA) and Exploratory factor analysis (EFA) are used to understand the underlying patterns in the data.
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
Today: Standard Deviations & Z-Scores Any questions from last time?
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
Multidimensional Scaling
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Copyright © 2010 Pearson Education, Inc Chapter Twenty Cluster Analysis.
Cluster Analysis, an Overview Laurie Heyer. Why Cluster? Data reduction – Analyze representative data points, not the whole dataset Hypothesis generation.
Chapter Seventeen Copyright © 2004 John Wiley & Sons, Inc. Multivariate Data Analysis.
1 Cluster Analysis Prepared by : Prof Neha Yadav.
Multivariate statistical methods Cluster analysis.
Chapter_20 Cluster Analysis Naresh K. Malhotra
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Multivariate statistical methods
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter_20 Cluster Analysis
Cluster Analysis.
SEEM4630 Tutorial 3 – Clustering.
Cluster analysis Presented by Dr.Chayada Bhadrakom
Hierarchical Clustering
Presentation transcript:

Clustering / Scaling

Cluster Analysis Objective: – Partitions observations into meaningful groups with individuals in a group being more “similar” to each other than to individuals in other groups

Cluster Analysis Similar to factor analysis (which groups IVs) but instead groups people in groups. Cluster will also partition variables into groups (but FA is better for this)

Cluster Analysis Orders individuals into similarity groups while simultaneously ordering variables according to importance.

Cluster Analysis We are always trying to identify groups – Discriminate analysis (which we are going to do later) – we know who is in what group and figure out a good way to classify them – Then log regression – nonparametric version of discriminate analysis

Cluster Analysis Cluster tells you if there are groups in the data that you didn’t know about – If there are groups – are there differences in the means? ANOVA/MANOVA – If I have somebody new, what group do they go in? Discriminate analysis

What’s CA give us? Taxonomic description – – Use this partitioning to generate hypothesis about how to group people (or how people should be grouped) – Maybe then used for classification (schools military memory, etc)

What’s CA give us? Data simplification – – Observations are no longer individuals but parts of groups

What’s CA give us? Relationship identification – – Reveals relationships among observations that are not immediately obvious when considering only one variable at a time

What’s CA give us? Outlier detection – – Observations that are very different, in multivariate sense, will not classify

Several Approaches to Clustering Graphical approaches Distance approaches SPSS stuff

Graphical Objective: map variables to separate plot characteristics then group observations visually – Approaches Profile plots Andrews plots Faces Stars Trees

Graphical Cereal data

Distance Approaches Inter-object similarity – measure of resemblance between individuals to be clustered Dissimilarity – lack of resemblance between individuals Distance = measures are all dissimilarity measures

Distance Approaches For continuous variables – Euclidean or ruler distance – Square root of (x-x)transpose (x-x)

Distance Approaches For data with different scales, may be better to z-score them first, so they don’t weight differently – Normalized ruler distance (same formula with z- scores)

Distance Approaches Mahalanobis distance!

How distance measures translate to ways to do this… Hierarchical approaches – Agglomerative methods – each object starts out as its own cluster The two closest clusters are combined into a new aggregate cluster Continues until clusters no longer make sense

How distance measures translate to ways to do this… Hierarchical approaches – Divisive Methods – opposite of agglomerative methods All observations are one cluster and then each cluster is split until all observations are left

What does that mean? Most programs are agglomerative – They use the distance measures to figure out which individuals/clusters to combine

K-means cluster analysis Uses squared Euclidean distance Initial cluster centers are chosen in the “first pass” of the data – Adds values to the cluster based on the cluster mean – Stops when means do not change

K-Means cluster You need to have an idea of how many clusters you expect – then you can see if there are differences on the IVs when they are clustered into these groups

Hierarchical clustering More common type of clustering analysis – Because it’s pretty pictures! – Dendrogram – tree diagram that represents the results of a cluster analysis

Hierarchical clustering Trees are usually depicted horizontally – Cases with high similarity are adjacent – Lines indicate the degree of similarity or dissimilarity between cases

2-step clustering Better with very large datasets Great for continuous and categorical data – In step one – pre-cluster into smaller clusters – Step two – create the desired clusters Unless you don’t know – then the program will decide the best for you.

Which one? K-means is much faster than hierarchical – Does not compute the distances between all pairs – Only Euclidean distance – Needs standardized data for best

Which one? Hierarchical is much more flexible – All types of data, all types of distance measures – Don’t need to know the number of clusters – Take those saved clusters and use to analyze with anova or crosstabs

Assumptions Data are continuous truly OR real dichotomous Same assumptions as correlation/regression – Outliers are OK K-Means = big samples >200

Issues Different methods (distance procedures) will give you drastically different results Clustering is usually a descriptive procedure