Geometry of data sets Alexander Gorban University of Leicester, UK.

Geometry of data sets Alexander Gorban University of Leicester, UK

Plan The problem Approximation of multidimensional data by low-dimensional objects Self-simplification of essentially high- dimensional sets Terra Incognita between low-dimensional sets and self-simplified high-dimensional ones.

3 Change of era From Einstein’s “flight from miracle.” «… The development of this world of thought is in a certain sense a continuous flight from “miracle”.» To struggle with complexity "I think the next century will be the century of complexity." Stephen Hawking

4 Two main approaches in our struggle with complexity A “minimal” space with this interesting content In high dimensionality many different things become similar, if we choose the proper point of view A large space with something interesting inside Model reduction Self-simplification in large dim

Karl Pearson 1901

Principal Component Analysis Maximal dispersion 1 st Principal axis Minimal squared averaged distance Approximation by straight lines: Subtract the projection and repeat

Principal points (K-means) Centres y (i) Data points x (j) Approximation by smaller finite sets: 1.Select several centres; 2.Attach datapoints to the closest centres by springs; 3.Minimize energy; 4.Repeat 2&3 until converges. Steinhaus, 1956; Lloyd, 1957; MacQueen, 1967

Approximation by algebraic curves and surfaces 1 st Principal axis: Are we happy with this approximation? Extend the space by values of additional functions and apply PCA y x x2x2 y+a+bx+cx 2 =0

Illustration: Nonlinear happiness Principal curve Quality of Life = -1 Quality of Life = +1 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Russia trajectory Linear index explains 76% Non-linear index explains 93% Gross product per person, $/person Life expectancy, years Infant mortality, case/1000 Tuberculosis incidence, case/100000 x = (YEAR=1989,…,2005) (COUNTRY=1…192)

Constructing elastic nets y E (0) E (1) R (1) R (0) R (2)

Definition of elastic energy: we borrow this approach from splines E (0) E (1) R (1) R (0) R (2) y XjXj

Definition of elastic energy E (0) E (1) R (1) R (0) R (2) y XjXj

Are non-linear projections better than linear projections? Breast cancer Wang et al., 2005 Bladder cancer Dyrskjot et al., 2003

Principal graphs?  (V) 3-Star 2-Star RNRN RNRN ?

Generalization: what is principal graph? Ideal object: pluriharmonic graph embedment Elastic k-star (k edges, k+1 nodes). The branching energy is Primitive elastic graph: all non-terminal nodes with k edges are elastic k-stars. The graph energy is 2-stars (ribs) 3-stars Pluriharmonic graph embedments generalize straight line, rectangular grid (with proper choice of k-stars), etc. S0 Ideal position of S0 (mean point of the star’s leaves) negative (repulsing) spring

Principal harmonic dendrites (trees) approximating complex data structures Linear PCA Non-linear PCA Branching PCA

Visualization of 7-cluster genome sequence structure Algorithm iterations 3D PCA plot Metro map Here clusters overlapping on 3D PCA plot are in fact well-separated and the principal tree reveals this fact

And much more for low-dimensional subsets: Local Linear Embedding Isomap Laplace Eigenmaps Nonlinear Multidimensional Scaling Independent Component Analysis Persistent cohomology...................

19 Measure concentration effects Self-simplification in large dim BnBn SnSn S n-1 = = For large n Maxwell Gibbs Milman Talgrand Gromov ……….. Projection Density of shadow Gaussian The Maxwell distribution SnSn

20 A 3D representation of an 8D hypercube The body has the same radial distribution and the same number of vertices as the hypercube. A very small fraction of the mass lies near a vertex. Also, most of the interior is void. (Illustration by Hamprecht & Agrell, 2002) Self-simplification in large dim

Strange properties of high dimensional sets Observable diameter of the sphere S n, n = 3, 10, 100, 2500. Illustrations by V. Pestov, 2005 Distribution of distances for pairs of points in the unit hypercube I n, n = 3, 10, 100, 1000. (For random samples of 10,000 pairs.).

22 Three provinces of the Complexity Land Reducible models (Princ. Comp. …) Wild complexity ??? Self- simplification (Stat. Phys. …)

23 Three provinces of the Complexity Land Reducible models (Princ. Comp. …) Wild complexity ??? Self- simplification (Stat. Phys. …)

Geometry of data sets Alexander Gorban University of Leicester, UK.

Similar presentations

Presentation on theme: "Geometry of data sets Alexander Gorban University of Leicester, UK."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Geometry of data sets Alexander Gorban University of Leicester, UK.

Similar presentations

Presentation on theme: "Geometry of data sets Alexander Gorban University of Leicester, UK."— Presentation transcript:

Similar presentations

About project

Feedback