Download presentation
Presentation is loading. Please wait.
Published byLewis Little Modified over 8 years ago
1
Visual Correlation Analysis of Numerical and Categorical Data on the Correlation Map Zhiyuan Zhang, Kevin T. McDonnell, Erez Zadok, Klaus Mueller
2
What’s Correlation? A statistical measure that indicates the extent to which two or more variables fluctuate together
3
What’s the Problem With These Visualizations? Just really hard to tell exactly how strong they are correlated Yes, there have been papers that studied this But can you tell which variable is 2 nd -most correlated with ‘Income’? Yes, we can use a correlation matrix heat map But brightness and color are poor visual variables to communicate quantitative information
4
What’s the #1 Visual Variable for QI? The spatial (planar) variables!! That’s why geographic maps work so well Can we build a correlation map? You bet… (J. Bertin, ‘67)
5
It’s Actually Quite Simple… Create a correlation matrix Run a mass-spring model You can even use it to order your parallel coordinate axes via TSP Run Traveling Salesman on the correlation nodes But is it really that simple?
6
TM-FAQ … The Most Frequently Asked Q Sure, I know about numerical variables But how about categorical variables? And what when there are both numerical and categorical variables in the data? Like a car’s mpg and its color.. how do they correlate? numerical variable categorical variable
7
Unifying Categorical & Numerical Variables Two choices Transform Numerical to Categorical use Cramer’s V Transform Categorical to Numerical use Pearson’s r Binning numerical variables to categories results in loss of resolution … not good Better use the second option … transform categorical to numerical No known procedures
8
The Coefficient of Determination r 2 Gauges how well the data fit a regression model r 2 is the square of the correlation coefficient r The similarity to correlation is no accident Good correlation good (linear) regression model uncorrelated, poor fit correlated, good fit
9
How Can This Help? Let’s plot a numerical (mpg) and a categorical variable (color) Assume we have 6 cars: color (=independent variable) and mpg (=dependent variable) color mpg r 2 = 0.2 r 2 = 0.9
10
Transforming the Categorical Variable y x RSS TSS
11
Regression With Categorical Variables
12
Efficiently Transforming X There’s no need to compute the regression model Instead minimize RSS such that After some manipulations… Minimization occurs when all Y where X=level i transformed X(i) mean of Y where X=level i X Y
13
Efficiently Transforming X There’s no need to compute the regression model Instead minimize RSS such that After some manipulations… Minimization occurs when X all Y where X=level i transformed X(i) mean of Y where X=level i Y
14
Efficiently Transforming X Applied to the cars color mpg
15
Multivariate Regression / Correlation Categorical variables may participate in more than one pair This generalizes the problem to multivariate regression Multivariate regression solves each variable separately re-ordering/re-spacing scheme can also applied separately But note that the order/spacing of a categorical variable may be different in each N/C pair Note also that the order/spacing is data-driven different data will produce different solutions
16
First Transformation Results Auto and car dataset visualized in parallel coordinates Correlations can be clearly better observed after transformations
17
Interaction with the Correlation Network all edges filtered by strength attribute centric subset of attributes
18
Multiscale Zooming
19
Merging Operations We choose the vertex with the largest accumulated correlation Some edges cannot be merged
20
Exploring Correlation Sensitivity Correlation strength can often be improved by constraining a variable’s value range (bracketing) This limits the derived relationships to this value range Such limits are commonplace in targeted marketing, etc. no bracketing lower price range higher price range
21
Multivariate Analysis of University Data Fused dataset of 50 US colleges US News: academic rankings College Prowler: survey on campus life attributes
22
Integrating Data – the Subspace Scatterplot Unify correlation network with parallel coordinates Steps Delaunay triangulation sort edges sort correlations threshold edge list or interactively pop edges generates (concave) polygons map points using Each polygon represents a data subspace
23
How to Read the Subspace Scatterplots Generalizes Radviz from a circle to a polygon Location of a projected point indicates how much it gravitates towards a particular attribute Observe correlations with edges Observe biases, trends, tradeoffs with scatterplot Diverse set of cars with continuous distribution Tradeoff between weight and horsepower Cars with lower weight and hp get better mpg
24
Example – Sales Campaign Dataset # opportunities pipeline revenue
25
Conclusions The correlation map is an integrative visualization of multi-scale correlation network clusters in user-definable high-dimensional subspaces supports numerical and categorical variables in a unified way Also enables interactive variable selection interactive data brushing and cluster analysis/sculpting Future work extend to causal network and inference
26
Questions? Supported by NSF, DOE, ITCCP (NIPA-Korea)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.