Extending metric multidimensional scaling with Bregman divergences

Slides:

Advertisements

Similar presentations

Estimation of Means and Proportions

Advertisements

Self-Organizing Maps Projection of p dimensional observations to a two (or one) dimensional grid space Constraint version of K-means clustering –Prototypes.

General Linear Model With correlated error terms  =  2 V ≠  2 I.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Mutidimensional Data Analysis Growth of big databases requires important data processing.  Need for having methods allowing to extract this information.

Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.

CLUSTERING PROXIMITY MEASURES

Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.

Visualizing and Exploring Data Summary statistics for data (mean, median, mode, quartile, variance, skewnes) Distribution of values for single variables.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.

9. SIMPLE LINEAR REGESSION AND CORRELATION

The Normal Distribution

Dimensional reduction, PCA

1 Numerical geometry of non-rigid shapes Spectral Methods Tutorial. Spectral Methods Tutorial 6 © Maks Ovsjanikov tosca.cs.technion.ac.il/book Numerical.

INTEGRALS 5. INTEGRALS We saw in Section 5.1 that a limit of the form arises when we compute an area.  We also saw that it arises when we try to find.

Evaluating Hypotheses

Transforms What does the word transform mean?. Transforms What does the word transform mean? –Changing something into another thing.

Log-linear and logistic models

Proximity matrices and scaling Purpose of scaling Similarities and dissimilarities Classical Euclidean scaling Non-Euclidean scaling Horseshoe effect Non-Metric.

Proximity matrices and scaling Purpose of scaling Similarities and dissimilarities Classical Euclidean scaling Non-Euclidean scaling Horseshoe effect Non-Metric.

1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.

Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact.

Linear Discriminant Functions Chapter 5 (Duda et al.)

Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.

Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals.

1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

Proximity matrices and scaling Purpose of scaling Classical Euclidean scaling Non-Euclidean scaling Non-Metric Scaling Example.

NonLinear Dimensionality Reduction or Unfolding Manifolds Tennenbaum|Silva|Langford [Isomap] Roweis|Saul [Locally Linear Embedding] Presented by Vikas.

Understanding and Comparing Distributions

Distance Indexing on Road Networks A summary Andrew Chiang CS 4440.

CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.

Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University

The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.

Introduction to the gradient analysis. Community concept (from Mike Austin)

Additive Data Perturbation: data reconstruction attacks.

Physics 114: Exam 2 Review Lectures 11-16

MULTIPLE TRIANGLE MODELLING ( or MPTF ) APPLICATIONS MULTIPLE LINES OF BUSINESS- DIVERSIFICATION? MULTIPLE SEGMENTS –MEDICAL VERSUS INDEMNITY –SAME LINE,

IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Neighbourhood relation preservation (NRP) A rank-based data visualisation quality assessment criterion Jigang Sun PhD studies finished in July 2011 PhD.

© Copyright McGraw-Hill 2000

11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.

Jan Kamenický.  Many features ⇒ many dimensions  Dimensionality reduction ◦ Feature extraction (useful representation) ◦ Classification ◦ Visualization.

Non-Linear Dimensionality Reduction

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Copyright © 2012 Pearson Education, Inc. All rights reserved Chapter 9 Statistics.

CpSc 881: Machine Learning

Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.

DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 14 th February 2013.

Principle Component Analysis and its use in MA clustering Lecture 12.

Multidimensional Scaling By Marc Sobel. The Goal  We observe (possibly non-euclidean) proximity data. For each pair of objects number ‘i’ and ‘j’ we.

Extending metric multidimensional scaling with Bregman divergences Mr. Jigang Sun Supervisor: Prof. Colin Fyfe Nov 2009.

CSC321: Lecture 25: Non-linear dimensionality reduction Geoffrey Hinton.

Neural Network Approximation of High- dimensional Functions Peter Andras School of Computing and Mathematics Keele University

Principal Warps: Thin-Plate Splines and the Decomposition of Deformations 김진욱 ( 이동통신망연구실 ; 박천현 (3D 모델링 및 처리연구실 ;

Lecture 7: Bivariate Statistics. 2 Properties of Standard Deviation Variance is just the square of the S.D. If a constant is added to all scores, it has.

The Normal Approximation for Data. History The normal curve was discovered by Abraham de Moivre around Around 1870, the Belgian mathematician Adolph.

Copyright © 2016 Brooks/Cole Cengage Learning Intro to Statistics Part II Descriptive Statistics Intro to Statistics Part II Descriptive Statistics Ernesto.

Out of sample extension of PCA, Kernel PCA, and MDS WILSON A. FLORERO-SALINAS DAN LI MATH 285, FALL

Confidence Intervals. Point Estimate u A specific numerical value estimate of a parameter. u The best point estimate for the population mean is the sample.

THE NORMAL DISTRIBUTION

Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Curvilinear Component Analysis and Bregman divergences

Mathematics Transformation of Lines

Chapter 1 Warm Up .

Feature space tansformation methods

Confidence Intervals for Proportions and Variances

NonLinear Dimensionality Reduction or Unfolding Manifolds

Data Exploration and Pattern Recognition © R. El-Yaniv

Presentation transcript:

Extending metric multidimensional scaling with Bregman divergences Jigang Sun and Colin Fyfe

Visualising 18 dimensional data

Outline Bregman divergence. Multidimensional scaling(MDS). Extending MDS with Bregman divergences. Relating the Sammon mapping to mappings with Bregman divergences. Comparison of effects and explanation. Conclusion

Strictly Convex function Pictorially, the strictly convex function F(x) lies below segment connecting two points q and p.

Bregman Divergences is the Bregman divergence between x and y based on convex function, φ. Taylor Series expansion is

Bregman Divergences

Euclidean distance is a Bregman divergence

Kullback Leibler Divergence

Generalised Information Divergence φ(z)=z log(z)

Other Divergences Itakura-Saito Divergence Mahalanobis distance Logistic loss Any convex function

Some Properties dφ(x,y)≥0, with equality iff x==y. Not a metric since dφ(x,y)≠ dφ(y,x) (Though d(x,y)=(dφ(x,y)+dφ(y,x)) is symmetric) Convex in the first parameter. Linear, dφ+aγ(x,y)= dφ(x,y) + a.dγ(x,y)

Multidimensional Scaling Creates one latent point for each data point. The latent space is often 2 dimensional. Positions the latent points so that they best represent the data distances. Two latent points are close if the two corresponding data points are close. Two latent points are distant if the two corresponding data points are distant.

Classical/Basic Metric MDS We minimise the stress function data space Latent space

Sammon Mapping (1969) Focuses on small distances: for the same error, the smaller distance is given bigger stress.

Possible Extensions Bregman divergences in both data space and latent space Or even

Metric MDs with Bregman divergence between distances Euclidean distance on latents. Any divergence on data Itakura-Saito divergence between them: (Sammon-like) to minimise divergence.

Moving the Latent Points F1 for I.S. divergence, F2 for euclidean , F3 any divergence

The algae data set

The algae data set

Two representations The standard Bregman representation: Concentrating on the residual errors:

Basic MDS is a special BMMDS Base convex function is chosen as And higher order derivatives are So is derived as

Sammon Mapping Select Then

Example 2: Extended Sammon Base convex function This is equivalent to The Sammon mapping is rewritten as

Sammon and Extended Sammon The common term The Sammon mapping is thus an approximation to the Extended Sammon mapping via the common term. The Extended Sammon mapping will do more adjustments on the basis of the higher order terms.

An Experiment on Swiss roll data set

Distance preservation

Relative standard deviation

Relative standard deviation On short distances, Sammon has smaller variance than BasicMDS, Extended Sammon has smaller variance than Sammon, i.e. control of small distances is enhanced. Large distances are given more and more freedom in the same order as above.

LCMC: local continuity meta-criterion (L. Chen 2006) A common measure assesses projection quality of different MDS methods. In terms of neighbourhood preservation. Value between 0 and 1, the higher the better.

Quality accessed by LCMC

Why Extended Sammon outperforms Sammon Stress formation

Features of the base convex function Recall that the base convex function for the Extended Sammon mapping is Higher order derivatives are Even orders are positive and odd ones are negative.

Stress comparison between Sammon and Extended Sammon

Stress configured by Sammon, calculated and mapped by Extended Sammon

Stress configured by Sammon, calculated and mapped by Extended Sammon The Extended Sammon mapping calculates stress on the basis of the configuration found by the Sammon mapping. For , the mean stresses calculated by the Extended Sammon are much higher than mapped by the Sammon mapping. For , the calculated mean stresses are obviously lower than that of the Sammon mapping. The Extended Sammon makes shorter mapped distance even more short, longer even more long.

Stress formation by items

Generalisation: from MDS to Bregman divergences A group of MDS is generalised as C is a normalisation scalar which is used for quantitative comparison purposes. It does not affect the mapping results. Weight function for missing samples The Basic MDS and the Sammon mapping belong to this group.

Generalisation: from MDS to Bregman divergences If C=1, then set Then the generalised MDS is the first term of BMMDS and BMMDS is an extension of MDS. Recall that BMMDS is equivalent to

Criterion for base convex function selection In order to focus on local distances and concentrate less on long distances, the base convex function must satisfy Not all convex functions can be considered, such as F(x)=exp(x). The 2nd order derivative is primarily considered. We wish it to be big for small distances and small for long distances. It represents the focusing power on local distances.

Two groups of Convex functions The even order derivatives are positive, odd order ones are negative. No 1 is that of the Extended Sammon mapping.

Focusing power

Different strategies for focusing power Vertical axis is logarithm of 2nd order derivative. These use different strategies for increasing focusing power. In the first group, the second order derivatives are higher and higher for small distances and lower and lower for long distances. In the second group, second order derivatives have limited maximum values for very small distances, but derivatives are drastically lower and lower for long distances when λ increases.

Two groups of Bregman divergences Elastic scaling(Victor E McGee, 1966)

Experiment on Swiss roll: The FirstGroup

Experiment on Swiss roll: FirstGroup For Extended Sammon, Itakura-Saito, , local distances are mapped better and better, long distances are stretched such that unfolding trend is obvious.

Distances mapping : FirstGroup

Standard deviation : FirstGroup

LCMC measure : FirstGroup

Experiment on Swiss roll:SecondGroup

Distance mapping: SecondGroup

StandardDeviation: SecondGroup

LCMC: SecondGroup

OpenBox, Sammon and FirstGroup

SecondGroup on OpenBox

Distance mapping: two groups

LCMC: two groups

Standard deviation: two groups

Swiss roll distances distribution

OpenBox distances distribution

Swiss roll vs OpenBox Distances formation: Swiss roll: proportion of longer distances is greater than that of the shorter distances. OpenBox: Very large quantity of a set of medium distances, small distances take much of the rest. Mapping results: Swiss roll: Long distances are stretched and local distances are usually mapped shorter. The OpenBox: the longest distances are not stretched obviously, perhaps even compressed. Small distances are mapped longer than original values in data space by some methods. Conclusion: Tug of war between local and long distances. Trying to get the opportunities to be mapped to their original values in data space.

Left and right Bregman divergences All of this is with left divergences – latent points are in left position in divergence, ... We can show that right divergences produce extensions of curvilinear component analysis. (Sun et al, ESANN2010)

Conclusion Applied Bregman divergences to multidimensional scaling. Shown that basic MMDS is a special case and Sammon mapping approximates a BMMDS. Improved upon both with 2 families of divergences. Shown results on two artificial data sets.