Clustering Procedure Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 16, 2015.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

DESCRIBING DISTRIBUTION NUMERICALLY
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Hierarchical Clustering
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Francisco Moreno Extractos de Mining of Massive Datasets
CHAPTER 24 MRPP (Multi-response Permutation Procedures) and Related Techniques From: McCune, B. & J. B. Grace Analysis of Ecological Communities.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
1. Variance of Probability Distribution 2. Spread 3. Standard Deviation 4. Unbiased Estimate 5. Sample Variance and Standard Deviation 6. Alternative Definitions.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
AEB 37 / AE 802 Marketing Research Methods Week 7
Cluster Analysis.
Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.
2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information.
6-1 ©2006 Raj Jain Clustering Techniques  Goal: Partition into groups so the members of a group are as similar as possible and different.
Chapter 1 Introduction to Clustering. Section 1.1 Introduction.
Clustering II.
4. Ad-hoc I: Hierarchical clustering
Measures of Dispersion CJ 526 Statistical Analysis in Criminal Justice.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Statistical Methods in Computer Science Hypothesis Testing II: Single-Factor Experiments Ido Dagan.
Cluster Analysis Chapter 12.
Statistical Modeling with SAS/STAT Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 9, 2015.
Store segmentation using SAS clustering Baofu Ma Merchandising AUTOZONE ANALYST,MERCH RESEARCH.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Lecture 6 Statistical Lecture ─ Cluster Analysis.
© 2007 Prentice Hall20-1 Chapter Twenty Cluster Analysis.
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1.
An Overview of SAS University Edition Cheng Lei Department of Electrical and Computer Engineering University of Victoria Mar 12, 2015.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Educ 200C Wed. Oct 3, Variation What is it? What does it look like in a data set?
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
ANOVA: Analysis of Variance.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Standard Deviation Lecture 18 Sec Tue, Feb 15, 2005.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
Selecting Diverse Sets of Compounds C371 Fall 2004.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
1 Where we are going : a graphic. 1 2 Paired 2 or more Means Proportions Variances Categories Slopes Ho: / CI Samples Ho: / CI Ho: Ho: / CI.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
DATA ANIMAL; INPUT attrib1-attrib7; datalines; PROC CLUSTER DATA=ANIMAL OUTTREE=OUTXX METHOD=COMPLETE;
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Copyright © 2010 Pearson Education, Inc Chapter Twenty Cluster Analysis.
1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.
1 Cluster Analysis Prepared by : Prof Neha Yadav.
Multivariate statistical methods Cluster analysis.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning
Multivariate statistical methods
Data Mining: Basic Cluster Analysis
Chapter 15 – Cluster Analysis
K-means and Hierarchical Clustering
Clustering and Multidimensional Scaling
Multivariate Statistical Methods
Cluster Analysis.
Clustering The process of grouping samples so that the samples are similar within each group.
SEEM4630 Tutorial 3 – Clustering.
Cluster analysis Presented by Dr.Chayada Bhadrakom
Unsupervised Learning
T2_MAN , n=92 min max mean median std range 25 quartile 50 quartile
Presentation transcript:

Clustering Procedure Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 16, 2015

Outline ❖ Overview ❖ CLUSTER Procedure ❖ Clustering Methods

Overview Data: Distances Coordinates Clustering methods 11 methods supported FASTCLUS Procedure CPU time: proportional to the number of observations Use FASTCLUS for a preliminary cluster analysis Use CLUSTER to cluster the preliminary clusters hierarchically Principles Each observation begins in a cluster by itself Two closet clusters are merged to form a new one to replace the two old ones Repeat the merging step until only one cluster is left

Overview CLUSTER Procedure Not practical to very large data sets as CPU time is roughly proportional to the square or cube of the number of the observations Displays a history of the clustering process Shows statistics for estimating the number of clusters RMSSTD Pseudo F Pseudo T-squre Creates dendrogram Create output data sets for TREE procedure to output the cluster membership

CLUSTER Procedure PROC CLUSTER METHOD=method-name ; BY variables; COPY variables; FREQ variables; ID variables; RMSSTD variables; VAR variables;

Options RMSSTD Root mean squared standard deviation of a cluster Pseudo F The ratio of between-cluster variance to within cluster variance Pseudo T-square A measure of merging two clusters to a new cluster

RMSSTD : the within-group sum of squares of cluster k : the number of elements in cluster k : the number of variables

Pseudo F : the between-group sum of squares : the within-group sum of squares : the number of clusters at a certain step : the number of observations

Pseudo T-Square : within-cluster sum of squares of clusters K and L : number of observations in cluster k and L : between-cluster sum of squares

METHODS Average Linkage (AVE or AVERAGE) Centroid Method (CEN or CENTROID) Complete Linkage (COM or COMPLETE) Density Linkage (DEN or DENSITY) Maximum likelihood (EML) Flexible-Beta Method (FLE or FLEXIBLE) McQuitty’s Similarity Analysis (MCQ or MCQUITTY) Median Method (MED or MEDIAN) Single Linkage (SIN or SINGLE) Two-Stage Density Linkage (TWO or TWOSTAGE) Ward’s minimum-variance method (WAR or WARD)

Average Linkage Idea: Compute the distance between two clusters and it is defined as the average distance between pairs of observations, one in each cluster

Centroid Method Idea: Compute the Euclidean distance between two clusters

Next week’s work Do examples with SAS base language More reading about other procedures in SAS/STAT

Thank You!!!