Seven (plus or minus two) Clusters, A Monte Carlo Study Larry Hoyle, Policy Research Institute, The University of Kansas.

Slides:



Advertisements
Similar presentations
Statistics for the Social Sciences Psychology 340 Fall 2006 Distributions.
Advertisements

Describing Quantitative Variables
Sampling: Final and Initial Sample Size Determination
1 Practical Psychology 1 Week 5 Relative frequency, introduction to probability.
Section #1 October 5 th Research & Variables 2.Frequency Distributions 3.Graphs 4.Percentiles 5.Central Tendency 6.Variability.
Simple Linear Regression. G. Baker, Department of Statistics University of South Carolina; Slide 2 Relationship Between Two Quantitative Variables If.
Dual Tragedies in the B-ham Paper. Module 2 Simple Descriptive Statistics and Univariate Displays of Data A Tale of Three Cities George Howard, DrPH.
ANOVA notes NR 245 Austin Troy
Statistics for Decision Making Descriptive Statistics QM Fall 2003 Instructor: John Seydel, Ph.D.
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.
Introduction Learning theory Information Processing Theory George A. Miller (Originator)
CS 376b Introduction to Computer Vision 04 / 04 / 2008 Instructor: Michael Eckmann.
DESCRIPTIVE STATISTICS: GRAPHICAL AND NUMERICAL SUMMARIES
QM Spring 2002 Statistics for Decision Making Descriptive Statistics.
Statistical Analysis SC504/HS927 Spring Term 2008 Week 17 (25th January 2008): Analysing data.
Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data Lesson6-1 Lesson 6: Sampling Methods and the Central Limit Theorem.
1 Confidence Intervals for Means. 2 When the sample size n< 30 case1-1. the underlying distribution is normal with known variance case1-2. the underlying.
Normal and Sampling Distributions A normal distribution is uniquely determined by its mean, , and variance,  2 The random variable Z = (X-  /  is.
QUIZ CHAPTER Seven Psy302 Quantitative Methods. 1. A distribution of all sample means or sample variances that could be obtained in samples of a given.
Hydrologic Statistics
Statistical analyses. SPSS  Statistical analysis program  It is an analytical software recognized by the scientific world (e.g.: the Microsoft Excel.
Descriptive Statistics: Maarten Buis Lecture 1: Central tendency, scales of measurement, and shapes of distributions.
STAT 13 -Lecture 2 Lecture 2 Standardization, Normal distribution, Stem-leaf, histogram Standardization is a re-scaling technique, useful for conveying.
Introduction to Linear Regression and Correlation Analysis
Quantitative Skills: Data Analysis and Graphing.
Tutor: Prof. A. Taleb-Bendiab Contact: Telephone: +44 (0) CMPDLLM002 Research Methods Lecture 9: Quantitative.
APPENDIX B Data Preparation and Univariate Statistics How are computer used in data collection and analysis? How are collected data prepared for statistical.
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
Topic 4 - Continuous distributions
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Basic Statistics Concepts Marketing Logistics. Basic Statistics Concepts Including: histograms, means, normal distributions, standard deviations.
Chapter 3.2 Measures of Variance.
Are You Smarter Than a 5 th Grader?. 1,000,000 5th Grade Topic 15th Grade Topic 24th Grade Topic 34th Grade Topic 43rd Grade Topic 53rd Grade Topic 62nd.
1 Statistical Distribution Fitting Dr. Jason Merrick.
Measures of Variation Section 3-3.
How to find measures variability using SPSS
Statistical Significance. Office Hour Sign Up I’d like to meet with everybody 1 on 1 re papers Please sign up during office hours, or let me know If those.
Statistics in Biology. Histogram Shows continuous data – Data within a particular range.
Chapter 22: Comparing Two Proportions. Yet Another Standard Deviation (YASD) Standard deviation of the sampling distribution The variance of the sum or.
Psy 230 Jeopardy Measurement Research Strategies Frequency Distributions Descriptive Stats Grab Bag $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500.
Review Lecture 51 Tue, Dec 13, Chapter 1 Sections 1.1 – 1.4. Sections 1.1 – 1.4. Be familiar with the language and principles of hypothesis testing.
CY1B2 Statistics1 (ii) Poisson distribution The Poisson distribution resembles the binomial distribution if the probability of an accident is very small.
Ka-fu Wong © 2003 Chap 6- 1 Dr. Ka-fu Wong ECON1003 Analysis of Economic Data.
Sampling and estimation Petter Mostad
Designing Social Inquiry STATISTICAL METHOD Jaechun Kim.
Section 7.2 P1 Means and Variances of Random Variables AP Statistics.
Section 3-2 Measures of Variation. Objectives Compute the range, variance, and standard deviation.
Chapter 4 Exploring Chemical Analysis, Harris
Chapter Eleven Sample Size Determination Chapter Eleven.
Statistics 22 Comparing Two Proportions. Comparisons between two percentages are much more common than questions about isolated percentages. And they.
Political Science 30: Political Inquiry. The Magic of the Normal Curve Normal Curves (Essentials, pp ) The family of normal curves The rule of.
Chapter 14 Single-Population Estimation. Population Statistics Population Statistics:  , usually unknown Using Sample Statistics to estimate population.
Descriptive Statistics
Practice As part of a program to reducing smoking, a national organization ran an advertising campaign to convince people to quit or reduce their smoking.
Chapter 7 Review.
Review 1. Describing variables.
Test for Goodness of Fit
Chapter 7: Sampling Distributions
AP Statistics: Chapter 7
Statistical Inference for the Mean Confidence Interval
Sampling Distribution Models
CHAPTER 22: Inference about a Population Proportion
Institute for Policy and Social Research
Statistical Process Control
Tutorial 7 Consider the example discussed in the lecture, concerning the comparison of two teaching methods A and B, and let W denote the sum of the.
Univariate Statistics
Summary of Tests Confidence Limits
Section Means and Variances of Random Variables
Biostatistics Lecture (2).
Introductory Statistics
Presentation transcript:

Seven (plus or minus two) Clusters, A Monte Carlo Study Larry Hoyle, Policy Research Institute, The University of Kansas

1972 Kansas Statistical Abstract

Shading by Overprinting

Shading by Line Spacing

Line Shading Detail

What did they have in common? Neither method is “continuous” So both methods required grouping or classes Fixed number of combinations Characters on a fixed grid Integer number of lines in the polygon Lines are relatively coarse

How to Group for Shading Equal Intervals Equal numbers (quantiles) By clusters Don’t group (unclassed)

Population Density – 7 Equal Intervals 100 counties fall into the bottom class

Population Density - Equal Numbers 15 counties in each class - a very different picture

Population Density - Cluster Means Group around the 7 values that “best” represent the data

Population Density - Unclassed No classes, just shade in proportion to value

Clustering Tries for “Best” grouping Each member of cluster can be represented by the mean of the group

Proc Fastclus You specify the number of clusters Minimizes cluster sum of squared distance (e.g. minimum within cluster variance) inspired by: – k-means (MacQueen) leader algorithm (Hartigan)

Example clustering - data

4 clusters y cluster data. x R-squared=.9912

4 clusters data Correlation.9956 R-squared=.9912

3 clusters y cluster data. x R-squared=.9609

How many clusters is enough?

Plot R-squared by number of clusters Sample of 300 observations, Uniform distribution, 11 cluster analyses

What happens if there really aren’t any clusters? Let’s try 500 samples

Uniform, 300 obs. per sample 500 samples, 11 clusterings each

Uniform, 1000 obs. per sample 500 samples, 11 clusterings each

Normal, 300 obs. per sample 500 samples, 11 clusterings each

Normal, 1000 obs. per sample 500 samples, 11 clusterings each

Exponential, 300 obs. per sample 500 samples, 11 clusterings each

Distribution of worst sample

Exponential, 1000 obs. per sample 500 samples, 11 clusterings each

So What’s with 7  2?

Uniform, 7  samples, 11 clusterings each

Normal, 7  samples, 11 clusterings each

Exponential, 7  samples, 11 clusterings each

Minimum R squared by sample size and distribution At least 95% of the variance for all

Histograms Equal intervals Number of observations in each interval

Needle Plot of Cluster Means

Bar chart needs more bars

The Magical Number Seven, Plus or Minus Two: Some Limits on our Capacity for Information Processing George Miller, The Psychological Review 1956, vol.63 pp

Limits on Categories for Absolute Judgments Pitch 6 Loudness 5 Visual position 9 Size of a square 5 Hue 8 Name the colors in this slide

“And finally, what about the magical number seven?” George A. Miller

Miller – Quote 1 seven wonders of the world seven seas seven deadly sins seven daughters of Atlas in the Pleiades seven ages of man seven levels of hell seven primary colors seven notes of the musical scale seven days of the week” “What about the

Miller – Quote 2 seven-point rating scale seven categories for absolute judgment seven objects in the span of attention seven digits in the span of immediate memory” “What about the

“…Perhaps there is something deep and profound behind all these sevens, something just calling out for us to discover it.” Miller – Quote 3

Miller - close “But I suspect that it is only a pernicious, Pythagorean coincidence.”

Coincidence or Nature’s Parsimony? Does our capacity match what’s needed for 95% of the variance? 95%? Hmmmm……. confidence intervals an A 19 fingers and toes 970,000 web pages Larry Hoyle Policy Research Institute University of Kansas