Visual Correlation Analysis of Numerical and Categorical Data on the Correlation Map Zhiyuan Zhang, Kevin T. McDonnell, Erez Zadok, Klaus Mueller.

Slides:



Advertisements
Similar presentations
Chapter 3 Examining Relationships
Advertisements

Correlation and Linear Regression.
Ch11 Curve Fitting Dr. Deshi Ye
Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.
AP Statistics Chapters 3 & 4 Measuring Relationships Between 2 Variables.
Chapter 2: Looking at Data - Relationships /true-fact-the-lack-of-pirates-is-causing-global-warming/
x – independent variable (input)
The Simple Regression Model
1 Simple Linear Regression Linear regression model Prediction Limitation Correlation.
Lecture 21 – Thurs., Nov. 20 Review of Interpreting Coefficients and Prediction in Multiple Regression Strategy for Data Analysis and Graphics (Chapters.
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
8/10/2015Slide 1 The relationship between two quantitative variables is pictured with a scatterplot. The dependent variable is plotted on the vertical.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
Correlation and Regression
Descriptive Methods in Regression and Correlation
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
How to Analyze Data? Aravinda Guntupalli. SPSS windows process Data window Variable view window Output window Chart editor window.
Correlation and Linear Regression
Chapter 3 Data Exploration and Dimension Reduction 1.
STATISTICS: BASICS Aswath Damodaran 1. 2 The role of statistics Aswath Damodaran 2  When you are given lots of data, and especially when that data is.
Relationships between Variables. Two variables are related if they move together in some way Relationship between two variables can be strong, weak or.
Class Meeting #11 Data Analysis. Types of Statistics Descriptive Statistics used to describe things, frequently groups of people.  Central Tendency 
Chapter 3 Describing Bivariate Data General Objectives: Sometimes the data that are collected consist of observations for two variables on the same experimental.
CHAPTER NINE Correlational Research Designs. Copyright © Houghton Mifflin Company. All rights reserved.Chapter 9 | 2 Study Questions What are correlational.
Scatterplots, Association,
by B. Zadrozny and C. Elkan
Variable  An item of data  Examples: –gender –test scores –weight  Value varies from one observation to another.
Two Approaches to Calculating Correlated Reserve Indications Across Multiple Lines of Business Gerald Kirschner Classic Solutions Casualty Loss Reserve.
UNDERSTANDING RESEARCH RESULTS: DESCRIPTION AND CORRELATION © 2012 The McGraw-Hill Companies, Inc.
Correlational Research Chapter Fifteen Bring Schraw et al.
Scatterplot and trendline. Scatterplot Scatterplot explores the relationship between two quantitative variables. Example:
Chapter 13: Correlation An Introduction to Statistical Problem Solving in Geography As Reviewed by: Michelle Guzdek GEOG 3000 Prof. Sutton 2/27/2010.
Recap of data analysis and procedures Food Security Indicators Training Bangkok January 2009.
Objectives 2.1Scatterplots  Scatterplots  Explanatory and response variables  Interpreting scatterplots  Outliers Adapted from authors’ slides © 2012.
STATISTICS 12.0 Correlation and Linear Regression “Correlation and Linear Regression -”Causal Forecasting Method.
Chapter 11 Correlation and Simple Linear Regression Statistics for Business (Econ) 1.
Describing Relationships Using Correlations. 2 More Statistical Notation Correlational analysis requires scores from two variables. X stands for the scores.
Scatter Diagrams scatter plot scatter diagram A scatter plot is a graph that may be used to represent the relationship between two variables. Also referred.
28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.
Chapter 12: Correlation and Linear Regression 1.
Chapter 8: Simple Linear Regression Yang Zhenlin.
I271B QUANTITATIVE METHODS Regression and Diagnostics.
APPLIED DATA ANALYSIS IN CRIMINAL JUSTICE CJ 525 MONMOUTH UNIVERSITY Juan P. Rodriguez.
The Visual Causality Analyst: An Interactive Interface for Causal Reasoning Jun Wang, Stony Brook University Klaus Mueller, Stony Brook University, SUNY.
Chapter 7 Scatterplots, Association, and Correlation.
Chapter 11: Linear Regression and Correlation Regression analysis is a statistical tool that utilizes the relation between two or more quantitative variables.
1 Simple Linear Regression and Correlation Least Squares Method The Model Estimating the Coefficients EXAMPLE 1: USED CAR SALES.
More on regression Petter Mostad More on indicator variables If an independent variable is an indicator variable, cases where it is 1 will.
Correlation  We can often see the strength of the relationship between two quantitative variables in a scatterplot, but be careful. The two figures here.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Data Mining: Neural Network Applications by Louise Francis CAS Convention, Nov 13, 2001 Francis Analytics and Actuarial Data Mining, Inc.
Beginners statistics Assoc Prof Terry Haines. 5 simple steps 1.Understand the type of measurement you are dealing with 2.Understand the type of question.
Chapter 1-2 Review MDM 4U Mr. Lieff. Ch1 Learning Goals Classify data as Quantitative (and continous or discrete) or Qualitatitive Identify the population,
Correlation and Regression. O UTLINE Introduction  10-1 Scatter plots.  10-2 Correlation.  10-3 Correlation Coefficient.  10-4 Regression.
Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Unit 1 Review. 1.1: representing data Types of data: 1. Quantitative – can be represented by a number Discrete Data Data where a fraction/decimal is not.
Correlation & Regression
Topic 10 - Linear Regression
CHAPTER 7 LINEAR RELATIONSHIPS
Linear Regression CSC 600: Data Mining Class 13.
Chapter 5 STATISTICS (PART 4).
SIMPLE LINEAR REGRESSION MODEL
Correlation and Regression Basics
Correlation and Regression Basics
Tabulations and Statistics
Inferential Statistics and Probability a Holistic Approach
Unit XI: Data Analysis in nursing research
Correlation & Regression
Presentation transcript:

Visual Correlation Analysis of Numerical and Categorical Data on the Correlation Map Zhiyuan Zhang, Kevin T. McDonnell, Erez Zadok, Klaus Mueller

What’s Correlation? A statistical measure that indicates the extent to which two or more variables fluctuate together

What’s the Problem With These Visualizations? Just really hard to tell exactly how strong they are correlated Yes, there have been papers that studied this But can you tell which variable is 2 nd -most correlated with ‘Income’? Yes, we can use a correlation matrix heat map But brightness and color are poor visual variables to communicate quantitative information

What’s the #1 Visual Variable for QI? The spatial (planar) variables!! That’s why geographic maps work so well Can we build a correlation map? You bet… (J. Bertin, ‘67)

It’s Actually Quite Simple… Create a correlation matrix Run a mass-spring model You can even use it to order your parallel coordinate axes via TSP Run Traveling Salesman on the correlation nodes But is it really that simple?

TM-FAQ … The Most Frequently Asked Q Sure, I know about numerical variables But how about categorical variables? And what when there are both numerical and categorical variables in the data? Like a car’s mpg and its color.. how do they correlate? numerical variable categorical variable

Unifying Categorical & Numerical Variables Two choices Transform Numerical to Categorical  use Cramer’s V Transform Categorical to Numerical  use Pearson’s r Binning numerical variables to categories results in loss of resolution … not good Better use the second option … transform categorical to numerical No known procedures 

The Coefficient of Determination r 2 Gauges how well the data fit a regression model r 2 is the square of the correlation coefficient r The similarity to correlation is no accident Good correlation  good (linear) regression model uncorrelated, poor fit correlated, good fit

How Can This Help? Let’s plot a numerical (mpg) and a categorical variable (color) Assume we have 6 cars: color (=independent variable) and mpg (=dependent variable) color mpg r 2 = 0.2 r 2 = 0.9

Transforming the Categorical Variable y x RSS TSS

Regression With Categorical Variables

Efficiently Transforming X There’s no need to compute the regression model Instead minimize RSS such that After some manipulations… Minimization occurs when all Y where X=level i transformed X(i) mean of Y where X=level i X Y

Efficiently Transforming X There’s no need to compute the regression model Instead minimize RSS such that After some manipulations… Minimization occurs when X all Y where X=level i transformed X(i) mean of Y where X=level i Y

Efficiently Transforming X Applied to the cars color mpg

Multivariate Regression / Correlation Categorical variables may participate in more than one pair This generalizes the problem to multivariate regression Multivariate regression solves each variable separately  re-ordering/re-spacing scheme can also applied separately But note that the order/spacing of a categorical variable may be different in each N/C pair Note also that the order/spacing is data-driven  different data will produce different solutions

First Transformation Results Auto and car dataset visualized in parallel coordinates Correlations can be clearly better observed after transformations

Interaction with the Correlation Network all edges filtered by strength attribute centric subset of attributes

Multiscale Zooming

Merging Operations We choose the vertex with the largest accumulated correlation Some edges cannot be merged

Exploring Correlation Sensitivity Correlation strength can often be improved by constraining a variable’s value range (bracketing) This limits the derived relationships to this value range Such limits are commonplace in targeted marketing, etc. no bracketing lower price range higher price range

Multivariate Analysis of University Data Fused dataset of 50 US colleges US News: academic rankings College Prowler: survey on campus life attributes

Integrating Data – the Subspace Scatterplot Unify correlation network with parallel coordinates Steps Delaunay triangulation sort edges  sort correlations threshold edge list or interactively pop edges  generates (concave) polygons map points using Each polygon represents a data subspace

How to Read the Subspace Scatterplots Generalizes Radviz from a circle to a polygon Location of a projected point indicates how much it gravitates towards a particular attribute Observe correlations with edges Observe biases, trends, tradeoffs with scatterplot Diverse set of cars with continuous distribution Tradeoff between weight and horsepower Cars with lower weight and hp get better mpg

Example – Sales Campaign Dataset # opportunities pipeline revenue

Conclusions The correlation map is an integrative visualization of multi-scale correlation network clusters in user-definable high-dimensional subspaces supports numerical and categorical variables in a unified way Also enables interactive variable selection interactive data brushing and cluster analysis/sculpting Future work extend to causal network and inference

Questions? Supported by NSF, DOE, ITCCP (NIPA-Korea)