Geometry of data sets Alexander Gorban University of Leicester, UK.

Slides:



Advertisements
Similar presentations
Self-Organizing Maps Projection of p dimensional observations to a two (or one) dimensional grid space Constraint version of K-means clustering –Prototypes.
Advertisements

Dimensionality Reduction PCA -- SVD
Dimension reduction (1)
Presented by: Mingyuan Zhou Duke University, ECE April 3, 2009
Non-linear Dimensionality Reduction CMPUT 466/551 Nilanjan Ray Prepared on materials from the book Non-linear dimensionality reduction By Lee and Verleysen,
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
Clustering and Dimensionality Reduction Brendan and Yifang April
Two Technique Papers on High Dimensionality Allan Rempel December 5, 2005.
Non-linear Principal Manifolds a Useful Tool in Bioinformatics and Medical Applications Andrei Zinovyev Institute des Hautes Etudes Scientifique, France.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Elastic Maps, Graphs, and Topological Grammars Alexander Gorban, Leicester with Andrei Zinovyev, Paris and Neil Sumner, Leicester.
6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.
Principal Component Analysis
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
1 Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Topological Data Analysis MATH 800 Fall Topological Data Analysis (TDA) An ε-chain is a finite sequence of points x 1,..., x n such that |x i –
Force Directed Algorithm Adel Alshayji 4/28/2005.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Three Algorithms for Nonlinear Dimensionality Reduction Haixuan Yang Group Meeting Jan. 011, 2005.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Unsupervised Learning
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Introduction to Bioinformatics - Tutorial no. 12
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
A Global Geometric Framework for Nonlinear Dimensionality Reduction Joshua B. Tenenbaum, Vin de Silva, John C. Langford Presented by Napat Triroj.
Part 3 Vector Quantization and Mixture Density Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Dimensionality Reduction
NonLinear Dimensionality Reduction or Unfolding Manifolds Tennenbaum|Silva|Langford [Isomap] Roweis|Saul [Locally Linear Embedding] Presented by Vikas.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Projection methods in chemistry Autumn 2011 By: Atefe Malek.khatabi M. Daszykowski, B. Walczak, D.L. Massart* Chemometrics and Intelligent Laboratory.
Diffusion Maps and Spectral Clustering
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Nonlinear Dimensionality Reduction Approaches. Dimensionality Reduction The goal: The meaningful low-dimensional structures hidden in their high-dimensional.
3-dimensional shape cross section. 3-dimensional space.
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
Graph Embedding: A General Framework for Dimensionality Reduction Dong XU School of Computer Engineering Nanyang Technological University
Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Some Shape Descriptors for 3D Visual Objects
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction Presented by Xianwang Wang Masashi Sugiyama.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
ISOMAP TRACKING WITH PARTICLE FILTER Presented by Nikhil Rane.
Extraction and remeshing of ellipsoidal representations from mesh data Patricio Simari Karan Singh.
Clustering.
CSC2535: Computation in Neural Networks Lecture 12: Non-linear dimensionality reduction Geoffrey Hinton.
Nonlinear Dimensionality Reduction Approach (ISOMAP)
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
1 Overview (Part 1) Background notions A reference framework for multiresolution meshes Classification of multiresolution meshes An introduction to LOD.
CSC321: Lecture 25: Non-linear dimensionality reduction Geoffrey Hinton.
Advanced Artificial Intelligence Lecture 8: Advance machine learning.
CSC321: Extra Lecture (not on the exam) Non-linear dimensionality reduction Geoffrey Hinton.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Unsupervised Learning
Algorithms and Networks
Dipartimento di Ingegneria «Enzo Ferrari»,
Principal Nested Spheres Analysis
Seam Carving Project 1a due at midnight tonight.
Dimension reduction : PCA and Clustering
Clustering.
Unsupervised Learning
Presentation transcript:

Geometry of data sets Alexander Gorban University of Leicester, UK

Plan The problem Approximation of multidimensional data by low-dimensional objects Self-simplification of essentially high- dimensional sets Terra Incognita between low-dimensional sets and self-simplified high-dimensional ones.

3 Change of era From Einstein’s “flight from miracle.” «… The development of this world of thought is in a certain sense a continuous flight from “miracle”.» To struggle with complexity "I think the next century will be the century of complexity." Stephen Hawking

4 Two main approaches in our struggle with complexity A “minimal” space with this interesting content In high dimensionality many different things become similar, if we choose the proper point of view A large space with something interesting inside Model reduction Self-simplification in large dim

Karl Pearson 1901

Principal Component Analysis Maximal dispersion 1 st Principal axis Minimal squared averaged distance Approximation by straight lines: Subtract the projection and repeat

Principal points (K-means) Centres y (i) Data points x (j) Approximation by smaller finite sets: 1.Select several centres; 2.Attach datapoints to the closest centres by springs; 3.Minimize energy; 4.Repeat 2&3 until converges. Steinhaus, 1956; Lloyd, 1957; MacQueen, 1967

Approximation by algebraic curves and surfaces 1 st Principal axis: Are we happy with this approximation? Extend the space by values of additional functions and apply PCA y x x2x2 y+a+bx+cx 2 =0

Illustration: Nonlinear happiness Principal curve Quality of Life = -1 Quality of Life = Russia trajectory Linear index explains 76% Non-linear index explains 93% Gross product per person, $/person Life expectancy, years Infant mortality, case/1000 Tuberculosis incidence, case/ x = (YEAR=1989,…,2005) (COUNTRY=1…192)

Constructing elastic nets y E (0) E (1) R (1) R (0) R (2)

Definition of elastic energy: we borrow this approach from splines E (0) E (1) R (1) R (0) R (2) y XjXj

Definition of elastic energy E (0) E (1) R (1) R (0) R (2) y XjXj

Are non-linear projections better than linear projections? Breast cancer Wang et al., 2005 Bladder cancer Dyrskjot et al., 2003

Principal graphs?  (V) 3-Star 2-Star RNRN RNRN ?

Generalization: what is principal graph? Ideal object: pluriharmonic graph embedment Elastic k-star (k edges, k+1 nodes). The branching energy is Primitive elastic graph: all non-terminal nodes with k edges are elastic k-stars. The graph energy is 2-stars (ribs) 3-stars Pluriharmonic graph embedments generalize straight line, rectangular grid (with proper choice of k-stars), etc. S0 Ideal position of S0 (mean point of the star’s leaves) negative (repulsing) spring

Principal harmonic dendrites (trees) approximating complex data structures Linear PCA Non-linear PCA Branching PCA

Visualization of 7-cluster genome sequence structure Algorithm iterations 3D PCA plot Metro map Here clusters overlapping on 3D PCA plot are in fact well-separated and the principal tree reveals this fact

And much more for low-dimensional subsets: Local Linear Embedding Isomap Laplace Eigenmaps Nonlinear Multidimensional Scaling Independent Component Analysis Persistent cohomology

19 Measure concentration effects Self-simplification in large dim BnBn SnSn S n-1 = = For large n Maxwell Gibbs Milman Talgrand Gromov ……….. Projection Density of shadow Gaussian The Maxwell distribution SnSn

20 A 3D representation of an 8D hypercube The body has the same radial distribution and the same number of vertices as the hypercube. A very small fraction of the mass lies near a vertex. Also, most of the interior is void. (Illustration by Hamprecht & Agrell, 2002) Self-simplification in large dim

Strange properties of high dimensional sets Observable diameter of the sphere S n, n = 3, 10, 100, Illustrations by V. Pestov, 2005 Distribution of distances for pairs of points in the unit hypercube I n, n = 3, 10, 100, (For random samples of 10,000 pairs.).

22 Three provinces of the Complexity Land Reducible models (Princ. Comp. …) Wild complexity ??? Self- simplification (Stat. Phys. …)

23 Three provinces of the Complexity Land Reducible models (Princ. Comp. …) Wild complexity ??? Self- simplification (Stat. Phys. …)