Explaining High-Dimensional Data Hoa Nguyen, Rutgers University Mentors: Ofer Melnik Kobbi Nissim
High-Dimensional Data A great deal of data from different domains (medicine, finance, science) is high-dimensional. High-dimensional data is hard to visualize and understand.
Example: Given a set of images (represented by 8x8 pixel matrices), we could consider each image as a point in 64-dimensional space.
Analyzing Data Instead of visualizing, we find properties of data to describe it. Statistics, Data Mining, Machine Learning: Finding properties of data directly Finding models to capture data
Classifier Given a set of data points with assigned labels. Build a model for the data. Use this model to label new unclassified data points.
The Geometrical View of Classifiers: Example: Given a set of data points in 2 dimensional-space with a + or – label for each point: (x i ,y i)± _ _ _ _ + + + _ + + + + _ _ _ We are interested in classifiers that take all the points of a class, and enclose them in a convex shape.
Goal To understand the properties of the data enclosed by a convex shape.
Convex Shapes The convex hull: Problem: Our solution: The convex hull C of a set of points is the smallest convex set that includes all the points. Problem: It is difficult to study the convex hull directly. Our solution: Instead of looking at the convex hull, we use the simpler convex shape to approximate the hull – the ellipsoid.
MVE (Minimum Volume Ellipsoid) Example: + + + + + + + + + + Advantage: Use an ellipsoid to approximate the convex region, or to bound the geometry of the convex hull.
Using MVE to approximate the convex hull [John 1948] has shown that if we shrink the minimum volume outer ellipsoid of a convex set C by a factor k about its center, we obtain an ellipsoid contained in C. (k is the dimension of the space) Example: + + + + + + + + + +
Calculate the MVE The MVE is described by the equation: v’ -1 v = k V={v1,v2,…,vh} Rk. v is an exterior point : the scatter matrix = wiviviT the eigen vectors of correspond to the directions of the ellipsoid axes. the eigen values of correspond to the half-lengths or radii of the axes. k: a constant equal to the dimension of the space wi: the weight of a point vi. h i=1
Calculate the MVE (cont.) [Titterington 1978] An algorithm to calculate the weights of MVE: A point has a positive weight if it lies on the surface of the ellipsoid. At least k+1 points have non-zero weights. At most k(k+3)/2+1 points have non-zero weights.
Use MVE for data analysis: Finding extreme points by looking at points on the ellipsoid surface. Finding the subspace of data by looking at the directions where the ellipsoid is thin.
Points on the surface of the ellipsoid i.e., points with non-zero weights. Example: In our hand-written digit file, there are 376 points which belong to class “0”. By using MVE, we find that there are about 178 points with non-zero weights, i.e., these points lie on the surface of the MVE. The mean-zero Some “0” points on the surface of the MVE
Directions where the ellipsoid is thin The directions and size of an ellipsoid’s axes correspond to the eigen vectors and values of its scatter matrix. Direction of thinness: A short axis defines a direction in which the data does not extend. If V is a zero-valued eigen vector, then it defines a constraint for any data point x: Vx=0
A simple Null Space Any basis for the Null Space of the scatter matrix is an equivalent set of constraints. In order to understand the data, we would like to find constraints that are easy to interpret. Goal: simplify the null space basis, e.g.,find a basis with many zeros.
The Null Space Problem (NSP) The Null Space Problem is defined as finding the basis with the maximal number of zeros. It is an NP-hard problem. [Pothen, Coleman 1986] An approach: Find a heuristic algorithm to simplify an existing basis of the null space of Class “0”: e.g., using Gaussian elimination to get a null space basis with more 0 components.
The null space basis = the set of eigenvectors with 0-eigenvalues The null space basis after using Gaussian elimination
Summary Data analysis Classifier with convex shape MVE Points on the surface of the MVE Simple basis for the null space