Multidimensional Space, Chapter 11 Multivariate Data, Multidimensional Space, and Spatialization By Martin Bartnes
Chapter 11 Multivariate Data and Multidimensional Space Distance, Difference, and Similarity Cluster Analysis Spatialization Reducing the Numbers of Variables
Multivariate Data and Multidimensional Space Multivariate data are data where there is more than one item recorded for each observation. Such data are commonly represented in tabular form with rows and columns.
One, Two and Three Variables in a Single Plot
Distance, Difference, and Similarity Distance is a measure of the difference between pairs of observations. The Pythagoras’s theorem: Spatial distances are the same as geographical distances. Each observation in a multidimensional space has a set of coordinates given by its value on each of the recorded variables. We use these to construct a distance matrix recording all the distances between all the observations. Minkowski and Manhattan distances.
Cluster Analysis: Cluster is a set of observations that are similar to each other and relatively different from other set of observations. Cluster analysis can help to identify potential classifications in statistical data. Since we are thinking of each observation as a point in multidimensional space defined by the variables, an obvious step is to look for clusters of observations. This may be an important first step in the developing of a theory.
Simple Cluster Technique
Hierarchical Cluster Analysis Work by building a nested hierarchy clusters. Smal clusters consist of observations that are very similar to one another. Such small cluster are grouped in larger, looser associations higher up the hierarchy.
Spatialization: Mapping Multivariate Data Multidimensional data can not be visualizing in more than three dimensions. Multivariate data may be 5-, 10, or 20-dimentional.
Reducing the Number of Variables Principal components analysis identifies a set of independent, uncorrelated variates, called the principal components. Factor analysis is looking for hidden factors and attempt to identify them. That can replace the original observed variables. The values of the principal components for each observation are calculated from the original variables. There are as many principal components as there are original variables, but a subset of them that capture most of the variability in the original data can be used
Questions!!! What are the main challenges with multidimensional data? What are the point of cluster analysis?