Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory
Reduce complexity Visual Computational Identify the intrinsic dimensionality of data Identify the most relevant aspects of data given a task
Lower Dimension Higher Dimension
a) b) Not all projections are equal
Desired properties Reduced, compressed representation Preserved useful/intrinsic properties of the data Applify patterns of interest (e.g. outliers) Simple, interpretable Trade-off between simplicity and preservation of structure
Helps us organize the data Helps us discriminate patterns
Manhattan distance (1 norm, taxicab distance) Euclidean distance (2 norm)
L-p Distance As p grows the largest coordinate distances tends to dominate the global distance
Projective methods: preserve a property of data Principal Component Analysis (PCA) Many others: ICA, Factor Analysis, Manifold Learning Multidimensional Dimension Reduction (MDS) LLE, Isomap
Goal: Find a linear projection that captures most of variance 1 st Principal Component 2 nd Principal Component 1 st Principal Component
PCA pseudo code: Centralize the data by subtracting the mean Calculate the covariance matrix: Calculate the eigenvectors(principal components) of the covariance matrix Select top few(2-3) eigenvectors (highest eigenvalues) Project the data using these eigenvectors as axis
Screeplot Biplot
Goal: Find a lower embedding of the data that preserves pairwise distances Formally: : Input distance values : Output distances values
Shepard Diagram MDS Distances Data Distances
More features are not necessarily better Understand the assumptions of different modeling choices When choosing distance functions, projection methods Consider the characteristics of the data Consider the learning objective Explore multiple choices simultaneously to gain better insight
clustering-in-r/ Multidimensional Scaling, Leland Wilkinson Dimension Reduction: A Guided Tour, Christopher J.C. Burgesti When is “nearest neighbor” meaningful?, Beyer, K.S., GoldStein, J. Ramakrishnan, R. & Shaft g, by
The effect of concentration of distances Lower DimensionHigher Dimension