Part I. How to Perform A Cluster Analysis on Directional Data Phillip Hendrickson.

Slides:



Advertisements
Similar presentations
Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Advertisements

Repeated Measure Ideally, we want the data to maintain compound symmetry if we want to justify using univariate approaches to deal with repeated measures.
Spatial Autocorrelation using GIS
8.1 Types of Data Displays Remember to Silence Your Cell Phone and Put It In Your Bag!
Data and Methodology Snowfall totals were derived from the Midwest Regional Climate Center (MRCC) for individual National Weather Service (NWS) Cooperative.
Zakaria A. Khamis GE 2110 GEOGRAPHICAL STATISTICS GE 2110.
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Spatial Statistics II RESM 575 Spring 2010 Lecture 8.
Econometric Details -- the market model Assume that asset returns are jointly multivariate normal and independently and identically distributed through.
Border around project area Everything else is hardly noticeable… but it’s there Big circles… and semi- transparent Color distinction is clear.
Correlation and Autocorrelation
Statistical Methods Chichang Jou Tamkang University.
Spatial Interpolation
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Chapter 4 Probability Distributions
Chapter 16 Chi Squared Tests.
Subcenters in the Los Angeles region Genevieve Giuliano & Kenneth Small Presented by Kemeng Li.
1 On statistical models of cluster stability Z. Volkovich a, b, Z. Barzily a, L. Morozensky a a. Software Engineering Department, ORT Braude College of.
STRATEGIES FOR RESEARCH Approaching the Paper Assignment.
Identifying Interplanetary Shock Parameters in Heliospheric MHD Simulation Results S. A. Ledvina 1, D. Odstrcil 2 and J. G. Luhmann 1 1.Space Sciences.
Why Geography is important.
Social Research Methods
The Tenacious Mappers Group Members: ROLANDO, THUY, AND CHARITY GIS 469 Workshop A King County Environment Justice Study: Demographic Analysis of Multi-family.
Chapter 21 & 22 Electric Charge Coulomb’s Law This force of repulsion or attraction due to the charge properties of objects is called an electrostatic.
The Research Process. Purposes of Research  Exploration gaining some familiarity with a topic, discovering some of its main dimensions, and possibly.
1 Spatial Statistics and Analysis Methods (for GEOG 104 class). Provided by Dr. An Li, San Diego State University.
Multivariate Statistics for the Environmental Sciences Peter J. A. Shaw Chapter 1 Introduction.
IS415 Geospatial Analytics for Business Intelligence
Multivariate Methods EPSY 5245 Michael C. Rodriguez.
1 Statistical Tools for Multivariate Six Sigma Dr. Neil W. Polhemus CTO & Director of Development StatPoint, Inc.
Chapter 13: Inference in Regression
LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.
Chapter 14: Nonparametric Statistics
Hypothesis Testing.
Regression Analysis (2)
Spatial Statistics and Spatial Knowledge Discovery First law of geography [Tobler]: Everything is related to everything, but nearby things are more related.
The Diminishing Rhinoceros & the Crescive Cow Exploring, Organizing, and Describing, Qualitative Data.
Slide 1 Copyright © 2004 Pearson Education, Inc..
Interpolation.
Statistical Analysis Topic – Math skills requirements.
Chapter 12 – Discriminant Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Association between 2 variables
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
Repeated Measurements Analysis. Repeated Measures Analysis of Variance Situations in which biologists would make repeated measurements on same individual.
Hypothesis testing Intermediate Food Security Analysis Training Rome, July 2010.
Chapter 13: Correlation An Introduction to Statistical Problem Solving in Geography As Reviewed by: Michelle Guzdek GEOG 3000 Prof. Sutton 2/27/2010.
Correlation Analysis. Correlation Analysis: Introduction Management questions frequently revolve around the study of relationships between two or more.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Confidence intervals and hypothesis testing Petter Mostad
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
The Effect of on By:. Purpose The purpose of this project was to.
Chapter Eight: Using Statistics to Answer Questions.
Data Analysis.
Spatial Statistics and Analysis Methods (for GEOG 104 class).
Research refers to a search for knowledge Research means a scientific and systematic search for pertinent information on a specific topic In fact, research.
Chapter Two Copyright © 2006 McGraw-Hill/Irwin The Marketing Research Process.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Data Analysis. Qualitative vs. Quantitative Data collection methods can be roughly divided into two groups. It is essential to understand the difference.
The Data Collection and Statistical Analysis in IB Biology John Gasparini The Munich International School Part II – Basic Stats, Standard Deviation and.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
PCB 3043L - General Ecology Data Analysis Organizing an ecological study What is the aim of the study? What is the main question being asked? What are.
The Diminishing Rhinoceros & the Crescive Cow
Chapter 12 – Discriminant Analysis
An Integrated Approach for Subsidence Monitoring and Sinkhole Formation in the Karst Terrain of Dougherty County, Georgia Matthew Cahalan1 and Adam Milewski1.
PCB 3043L - General Ecology Data Analysis.
Introduction to Statistics
EPSY 5245 EPSY 5245 Michael C. Rodriguez
Chapter Nine: Using Statistics to Answer Questions
Presentation transcript:

Part I. How to Perform A Cluster Analysis on Directional Data Phillip Hendrickson

Caveat This presentation aims at explaining how a cluster analysis would work on directional data in the first part. Part II reviews my actual project which is regrettably incomplete at this time due to lack of funding for necessary datasets. In lieu of this shortcoming, I will briefly describe how I would further my project along by incorporating directional data.

Introduction Cluster analysis in the spatial realm has many applications, but the general principal behind any statistical analysis in the spatial world relates to the first law of geography where according to Waldo Tobler, "Everything is related to everything else, but near things are more related than distant things."

Introduction (con’t) Knowing the first law of Geography, it is difficult to disagree that cluster analysis is a frequently used exploratory procedure in the geophysical world with arguably one of its biggest applications seen in climatology.

Recap of Cluster Analysis In class, we touched on how to perform a cluster analysis in which we look at the observations and perform a sequence of steps to see which observations look alike. Remember: 1. Develop a classification 2. Investigate grouping schemes 3. Data exploration 4. Hypothesis Testing

Steps involved in Cluster Analysis 1. Select a sample which reflects the population 2. Select variables (what is important, what should you measure, what can you measure) 3. Create a similarity matrix (PMCC or covariance) that has symmetry, equality, distinguishability, and triangle inequality. 4. Select grouping method (hierarchical agglomerative or k-means)

The Problem with Spatial Data (in some cases) The steps to conduct a spatial clustering are no different expect that you will automatically have two variables (one for your latitude and one for your longitude). So what is being introduced here that is different? How do you analyze directional data?

Directional Data In many situations a study may require the analysis of directional data (also referred to as aspect). For example, let’s say you want to track a storm event. Already you know you will need to handle wind vectors. Or perhaps you’re interested in geological events such as landslides or earthquakes, chances are good that you will need to process slope angles.

The Problem With Directional Data Statistics for linear data will not work for directional data because there is no accounting for circular nature (1º and 359º are only 2º apart) If you tried to force linearity you would essentially cut the circle in half which would negate the observations you cut out and prevent any proper clustering.

Exploit Periodicity Lund 1999 suggests a method for univariate directional clustering that uses the circular nature of directional data to find an optimal number of clusters from the dataset.

Location and Dipersion A measure given by the first trigonometric moment Asks if there exists a positive finite Borel measure on the unit circle whose first n trigonometric moments take some specified values

Parameters μ - mean direction Measures the location of the distribution p - mean resultant length of Measures the dispersion also referred to as r bar Values close to 0 indicate large dispersion and values close to 1 signify a high degree of concentration in the data.

How to Identify Clusters We use the measure of dispersion Clusters are made with reference to the largest arc lengths between observations Lund 1999 FIG. 1 Mean resultant length of the entire dataset would be 0.05, but separated into two clusters, the mean resultant lengths are 0.88 and 0.87

How To Measure Significance of Clusters k clusters are determined by the k largest spaces denoted. To calculate the significance of the clusters: Large values of represent a high level of clustering between clusters.

The Optimal Amount of Clusters This will be the number of k clusters which maximizes the significance. Values of significance can be negative, but occur only when data is evenly distributed.

Part II. Spatial Clustering of Ice Storm Vulnerability in New England

Abstract The New England region of the United States is periodically hit by severe winter weather which endangers county residents, damages transportation infrastructure, and devastates local economies. This project implements the use of cluster analysis and GIS to perform an exploratory investigation in order to understand which counties within the New England states are under the most strain from these potentially disastrous events. The scope of this project looks at all recorded icing and glazing events from 1993 to 2009 in the New England region and forces regional clustering through spatial data.

Data My cluster analysis uses data from NOAA covering a 16 year period which lists the counties by state that were hit by each ice storm or glazing event from US census data was provided by John Mackenzie from the FREC Department.

Methods Extensive data mining was performed using Microsoft Excel Storm events were matched and joined to counties by FIPS code using VLOOKUP A pivot table was constructed to get total ice storm events per county in the New England region Using JMP a cluster analysis was performed Variables included:  Population age 65 and up  Population density  Poverty  Number of households  Number of events in 16 year period  Decimal degrees for longitude and latitude of county centroids

Methods (con’t) Using ArcMap, a Hot Spot analysis was performed and weighted by the clusters calculated from JMP.

Results Cluster analysis in JMP returned an optimal amount of 10 clusters using the “Elbow Method” introduced by Robert Thorndike in 1953

Results (con’t) Proposed vulnerability ranking regarding frequency of events:

Further Analysis Incorporate TIGER and DEM data into this analysis to find the most vulnerable counties with regard to paved road surfaces (a large factor that should dictate an areas vulnerability) Use DEM data to get aspect from the topology. With more time, use directional clustering mentioned in part I to discover which elevations and slope faces are more frequently affected by ice storm events.

Conclusions In hindsight I would have selected fewer clusters for my analysis, but in terms of an exploratory project, I would say this cluster analysis met with success. In the future it may be interesting to perform this same analysis for other regions of the CONUS. With more time and reliable datasets, I would also like to perform a multivariate directional cluster analysis by incorporating wind data and slope aspect into this project.