Grouping Data Methods of cluster analysis. Goals 1 1.We want to identify groups of similar artifacts or features or sites or graves, etc that represent.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Cluster Analysis.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
Clustering and Dimensionality Reduction Brendan and Yifang April
Clustering II.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
4. Ad-hoc I: Hierarchical clustering
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis (1).
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
tch?v=Y6ljFaKRTrI Fireflies.
es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
LISA Short Course Series Multivariate Clustering Analysis in R Yuhyun Song Nov 03, 2015 LISA: Multivariate Clustering Analysis in RNov 3, 2015.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Machine Learning Queens College Lecture 7: Clustering.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering / Scaling. Cluster Analysis Objective: – Partitions observations into meaningful groups with individuals in a group being more “similar” to.
Multivariate statistical methods Cluster analysis.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Unsupervised Learning
Multivariate statistical methods
Data Mining: Basic Cluster Analysis
Data Mining K-means Algorithm
Jagdish Gangolly State University of New York at Albany
Clustering (3) Center-based algorithms Fuzzy k-means
K-Means Lab.
Clustering and Multidimensional Scaling
Information Organization: Clustering
Jagdish Gangolly State University of New York at Albany
Dimension reduction : PCA and Clustering
Data Mining – Chapter 4 Cluster Analysis Part 2
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Hierarchical Clustering
Hierarchical Clustering
Unsupervised Learning
Presentation transcript:

Grouping Data Methods of cluster analysis

Goals 1 1.We want to identify groups of similar artifacts or features or sites or graves, etc that represent cultural, functional, or chronological differences 2.We want to create groups as a measurement technique to see how they vary with external variables

Goals 2 3.We want to cluster artifacts or sites based on their location to identify spatial clusters

Real vs. Created Types Differences in goals –Real types are the aim of Goal 1 –Created types are the aim of Goal 2 Debate over whether Real types can be discovered with any degree of certainty Cluster analysis guarantees groups – you must confirm their utility

Initial Decisions 1 What variables to use? –All possible –Constructed variables (from principal components, correspondence analysis, or multi-dimensional scaling) –Restricted set of variables that support the goal(s) of creating groups (e.g. functional groups, cultural or stylistic groups)

Initial Decisions 2 How to transform the variables? –Log transforms –Conversion to percentages (to weight rows equally) –Size standardization (dividing by geometric mean) –Z – scores (to weight columns equally) –Conversion of categorical variables

Initial Decisions 3 How to measure distance? –Types of variables –Goals of the analysis –If uncertain, try multiple methods

Methods of Grouping Partitioning Methods – divide the data into groups Hierarchical Methods –Agglomerating – from n clusters to 1 cluster –Divisive – from 1 cluster to k clusters

Partitioning K – Means, K – Medoids, Fuzzy Measure of distance – but do not need to compute full distance matrix Specify number of groups in advance Minimizing within group variability Finds spherical clusters

Procedure Start with centers for k groups (user- supplied or random) Repeat up to iter.max times (default 10) –Allocate rows to their closest center –Recalculate the center positions Stop Different criteria for allocation Use multiple starts (e.g. 5 – 15)

Evaluation 1 Compute groups for a range of cluster sizes and plot within group sums of squares to look for sharp increases Cluster randomized versions of the data and compare the results Examine table of statistics by group

Evaluation 2 Plot groups in two dimensions with PCA, CA, or MDS Compare the groups using data or information not included in the analysis

Partitioning Using R Base R includes kmeans() for forming groups by partitioning Rcmdr includes KMeans() to iterate kmeans() for best solution Package cluster() includes pam() which uses medoids for more robust grouping and fanny() which forms fuzzy clusters

Example DarlPoints (not DartPoints) has 4 measurements for 23 Darl points Create Z-scores to weight variables equally with Data | Manage variables in active data set | Standardize variables … (or could use PCA and PC Scores)

Example (cont) Use Rcmdr to partition the data into 5, 4, 3, and 2 groups Statistics | Dimensional analysis | Cluster analysis | k-means cluster analysis … TWSS = 15.42, 19.78, 25.83, Select group number and have Rcmdr add group to data set

Evaluation Evaluate groups against randomized data –Randomly permute each variable –Run k-means –Compare random and non-random results Evaluate groups against external criteria (location, material, age, etc)

KMPlotWSS <- function(data, ming, maxg) { WSS <- sapply(ming:maxg, function(x) kmeans(data, centers = x, iter.max = 10, nstart = 10)$tot.withinss) plot(ming:maxg, WSS, las=1, type="b", xlab="Number of Groups", ylab="Total Within Sum of Squares", pch=16) print(WSS) } KMRandWSS <- function(data, samples, min, max) { KRand <- function(data, min, max){ Rnd <- apply(data, 2, sample) sapply(min:max, function(y) kmeans(Rnd, y, iter.max= 10, nstart=5)$tot.withinss) } Sim <- sapply(1:samples, function(x) KRand(data, min, max)) t(apply(Sim, 1, quantile, c(0,.005,.01,.025,.5,.975,.99,.995, 1))) }

# Compare data to randomized sets KMPlotWSS(DarlPoints[,6:9], 1, 10) Qtiles <- KMRandWSS(DarlPoints[,6:9], 2000, 1, 10) matlines(1:10, Qtiles[,c(1, 5, 9)], lty=c(3, 2, 3), lwd=2, col="dark gray") legend("topright", c("Observed", "Median (Random)", "Max/Min Random"), col=c("black", "dark gray", "dark gray"), lwd=c(1, 2, 2), lty=c(1, 2, 3))

Hierarchical Methods Agglomerative – successive merging Divisive - successive splitting –Monothetic – binary data –Polythetic – interval/ratio

Agglomerative At the start all rows are in separate groups (n groups or clusters) At each stage two rows are merged, a row and a group are merged, or two groups are merged The process stops when all rows are in a single cluster

Agglomeration Methods How should clusters be formed? –Single Linkage, irregular shape groups –Average Linkage – spherical groups –Complete Linkage – spherical groups –Ward’s Method – spherical groups –Median – dendrogram inversions –Centroid – dendrogram inversions –McQuitty – similarity by reciprocal pairs

Agglomerating with R Base R includes hclus() for forming groups by partitioning Package cluster() includes agnes() Rcmdr uses hclus() via Statistics | Dimensional analysis | Cluster analysis | Hierarchical cluster analysis …

HClust Rcmdr menus provide –Cluster analysis and plot –Summary statistics by group –Adding cluster to data set To get traditional dendrogram: –plot(HClust.1, hang=-1, main= "Darl Points", xlab= "Catalog Number", sub="Method=Ward; Distance=Euclidian") –rect.hclust(HClust.1, 3)

summary(as.factor(cutree(HClust.1, k = 3))) # Cluster Sizes by(model.matrix(~-1 + Z.Length + Z.Thickness + Z.Weight + Z.Width, DarlPoints), as.factor(cutree(HClust.1, k = 3)), mean) # Cluster Centroids INDICES: 1 Z.Length Z.Thickness Z.Weight Z.Width INDICES: 2 Z.Length Z.Thickness Z.Weight Z.Width INDICES: 3 Z.Length Z.Thickness Z.Weight Z.Width > biplot(princomp(model.matrix(~-1 + Z.Length + Z.Thickness + Z.Weight + Z.Width, DarlPoints)), xlabs = as.character(cutree(HClust.1, k = 3)))

> cbind(HClust.1$merge, HClust.1$height) [,1] [,2] [,3] [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,] [21,] [22,]

Divisive At the start all rows are considered to be a single group At each stage a group is divided into two groups based on the average dissimilarities The process stops when all rows are in separate clusters