Presentation is loading. Please wait.

Presentation is loading. Please wait.

Visual Correlation Analysis of Numerical and Categorical Data on the Correlation Map Zhiyuan Zhang, Kevin T. McDonnell, Erez Zadok, Klaus Mueller.

Similar presentations


Presentation on theme: "Visual Correlation Analysis of Numerical and Categorical Data on the Correlation Map Zhiyuan Zhang, Kevin T. McDonnell, Erez Zadok, Klaus Mueller."— Presentation transcript:

1 Visual Correlation Analysis of Numerical and Categorical Data on the Correlation Map Zhiyuan Zhang, Kevin T. McDonnell, Erez Zadok, Klaus Mueller

2 What’s Correlation? A statistical measure that indicates the extent to which two or more variables fluctuate together

3 What’s the Problem With These Visualizations? Just really hard to tell exactly how strong they are correlated Yes, there have been papers that studied this But can you tell which variable is 2 nd -most correlated with ‘Income’? Yes, we can use a correlation matrix heat map But brightness and color are poor visual variables to communicate quantitative information

4 What’s the #1 Visual Variable for QI? The spatial (planar) variables!! That’s why geographic maps work so well Can we build a correlation map? You bet… (J. Bertin, ‘67)

5 It’s Actually Quite Simple… Create a correlation matrix Run a mass-spring model You can even use it to order your parallel coordinate axes via TSP Run Traveling Salesman on the correlation nodes But is it really that simple?

6 TM-FAQ … The Most Frequently Asked Q Sure, I know about numerical variables But how about categorical variables? And what when there are both numerical and categorical variables in the data? Like a car’s mpg and its color.. how do they correlate? numerical variable categorical variable

7 Unifying Categorical & Numerical Variables Two choices Transform Numerical to Categorical  use Cramer’s V Transform Categorical to Numerical  use Pearson’s r Binning numerical variables to categories results in loss of resolution … not good Better use the second option … transform categorical to numerical No known procedures 

8 The Coefficient of Determination r 2 Gauges how well the data fit a regression model r 2 is the square of the correlation coefficient r The similarity to correlation is no accident Good correlation  good (linear) regression model uncorrelated, poor fit correlated, good fit

9 How Can This Help? Let’s plot a numerical (mpg) and a categorical variable (color) Assume we have 6 cars: color (=independent variable) and mpg (=dependent variable) color mpg r 2 = 0.2 r 2 = 0.9

10 Transforming the Categorical Variable y x RSS TSS

11 Regression With Categorical Variables

12 Efficiently Transforming X There’s no need to compute the regression model Instead minimize RSS such that After some manipulations… Minimization occurs when all Y where X=level i transformed X(i) mean of Y where X=level i X Y

13 Efficiently Transforming X There’s no need to compute the regression model Instead minimize RSS such that After some manipulations… Minimization occurs when X all Y where X=level i transformed X(i) mean of Y where X=level i Y

14 Efficiently Transforming X Applied to the cars color mpg

15 Multivariate Regression / Correlation Categorical variables may participate in more than one pair This generalizes the problem to multivariate regression Multivariate regression solves each variable separately  re-ordering/re-spacing scheme can also applied separately But note that the order/spacing of a categorical variable may be different in each N/C pair Note also that the order/spacing is data-driven  different data will produce different solutions

16 First Transformation Results Auto and car dataset visualized in parallel coordinates Correlations can be clearly better observed after transformations

17 Interaction with the Correlation Network all edges filtered by strength attribute centric subset of attributes

18 Multiscale Zooming

19 Merging Operations We choose the vertex with the largest accumulated correlation Some edges cannot be merged

20 Exploring Correlation Sensitivity Correlation strength can often be improved by constraining a variable’s value range (bracketing) This limits the derived relationships to this value range Such limits are commonplace in targeted marketing, etc. no bracketing lower price range higher price range

21 Multivariate Analysis of University Data Fused dataset of 50 US colleges US News: academic rankings College Prowler: survey on campus life attributes

22 Integrating Data – the Subspace Scatterplot Unify correlation network with parallel coordinates Steps Delaunay triangulation sort edges  sort correlations threshold edge list or interactively pop edges  generates (concave) polygons map points using Each polygon represents a data subspace

23 How to Read the Subspace Scatterplots Generalizes Radviz from a circle to a polygon Location of a projected point indicates how much it gravitates towards a particular attribute Observe correlations with edges Observe biases, trends, tradeoffs with scatterplot Diverse set of cars with continuous distribution Tradeoff between weight and horsepower Cars with lower weight and hp get better mpg

24 Example – Sales Campaign Dataset # opportunities pipeline revenue

25 Conclusions The correlation map is an integrative visualization of multi-scale correlation network clusters in user-definable high-dimensional subspaces supports numerical and categorical variables in a unified way Also enables interactive variable selection interactive data brushing and cluster analysis/sculpting Future work extend to causal network and inference

26 Questions? Supported by NSF, DOE, ITCCP (NIPA-Korea)

27


Download ppt "Visual Correlation Analysis of Numerical and Categorical Data on the Correlation Map Zhiyuan Zhang, Kevin T. McDonnell, Erez Zadok, Klaus Mueller."

Similar presentations


Ads by Google