Download presentation
Presentation is loading. Please wait.
1
Geometry of data sets Alexander Gorban University of Leicester, UK
2
Plan The problem Approximation of multidimensional data by low-dimensional objects Self-simplification of essentially high- dimensional sets Terra Incognita between low-dimensional sets and self-simplified high-dimensional ones.
3
3 Change of era From Einstein’s “flight from miracle.” «… The development of this world of thought is in a certain sense a continuous flight from “miracle”.» To struggle with complexity "I think the next century will be the century of complexity." Stephen Hawking
4
4 Two main approaches in our struggle with complexity A “minimal” space with this interesting content In high dimensionality many different things become similar, if we choose the proper point of view A large space with something interesting inside Model reduction Self-simplification in large dim
5
Karl Pearson 1901
6
Principal Component Analysis Maximal dispersion 1 st Principal axis Minimal squared averaged distance Approximation by straight lines: Subtract the projection and repeat
7
Principal points (K-means) Centres y (i) Data points x (j) Approximation by smaller finite sets: 1.Select several centres; 2.Attach datapoints to the closest centres by springs; 3.Minimize energy; 4.Repeat 2&3 until converges. Steinhaus, 1956; Lloyd, 1957; MacQueen, 1967
8
Approximation by algebraic curves and surfaces 1 st Principal axis: Are we happy with this approximation? Extend the space by values of additional functions and apply PCA y x x2x2 y+a+bx+cx 2 =0
9
Illustration: Nonlinear happiness Principal curve Quality of Life = -1 Quality of Life = +1 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Russia trajectory Linear index explains 76% Non-linear index explains 93% Gross product per person, $/person Life expectancy, years Infant mortality, case/1000 Tuberculosis incidence, case/100000 x = (YEAR=1989,…,2005) (COUNTRY=1…192)
10
Constructing elastic nets y E (0) E (1) R (1) R (0) R (2)
11
Definition of elastic energy: we borrow this approach from splines E (0) E (1) R (1) R (0) R (2) y XjXj
12
Definition of elastic energy E (0) E (1) R (1) R (0) R (2) y XjXj
13
Are non-linear projections better than linear projections? Breast cancer Wang et al., 2005 Bladder cancer Dyrskjot et al., 2003
14
Principal graphs? (V) 3-Star 2-Star RNRN RNRN ?
15
Generalization: what is principal graph? Ideal object: pluriharmonic graph embedment Elastic k-star (k edges, k+1 nodes). The branching energy is Primitive elastic graph: all non-terminal nodes with k edges are elastic k-stars. The graph energy is 2-stars (ribs) 3-stars Pluriharmonic graph embedments generalize straight line, rectangular grid (with proper choice of k-stars), etc. S0 Ideal position of S0 (mean point of the star’s leaves) negative (repulsing) spring
16
Principal harmonic dendrites (trees) approximating complex data structures Linear PCA Non-linear PCA Branching PCA
17
Visualization of 7-cluster genome sequence structure Algorithm iterations 3D PCA plot Metro map Here clusters overlapping on 3D PCA plot are in fact well-separated and the principal tree reveals this fact
18
And much more for low-dimensional subsets: Local Linear Embedding Isomap Laplace Eigenmaps Nonlinear Multidimensional Scaling Independent Component Analysis Persistent cohomology...................
19
19 Measure concentration effects Self-simplification in large dim BnBn SnSn S n-1 = = For large n Maxwell Gibbs Milman Talgrand Gromov ……….. Projection Density of shadow Gaussian The Maxwell distribution SnSn
20
20 A 3D representation of an 8D hypercube The body has the same radial distribution and the same number of vertices as the hypercube. A very small fraction of the mass lies near a vertex. Also, most of the interior is void. (Illustration by Hamprecht & Agrell, 2002) Self-simplification in large dim
21
Strange properties of high dimensional sets Observable diameter of the sphere S n, n = 3, 10, 100, 2500. Illustrations by V. Pestov, 2005 Distribution of distances for pairs of points in the unit hypercube I n, n = 3, 10, 100, 1000. (For random samples of 10,000 pairs.).
22
22 Three provinces of the Complexity Land Reducible models (Princ. Comp. …) Wild complexity ??? Self- simplification (Stat. Phys. …)
23
23 Three provinces of the Complexity Land Reducible models (Princ. Comp. …) Wild complexity ??? Self- simplification (Stat. Phys. …)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.