Chapter 3 – Data Visualization

Slides:



Advertisements
Similar presentations
Lesson Describing Distributions with Numbers parts from Mr. Molesky’s Statmonkey website.
Advertisements

Descriptive Measures MARE 250 Dr. Jason Turner.
Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Chapter 3 – Data Visualization © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Statistics 100 Lecture Set 6. Re-cap Last day, looked at a variety of plots For categorical variables, most useful plots were bar charts and pie charts.
Chapter 8 – Logistic Regression
Psychology: A Modular Approach to Mind and Behavior, Tenth Edition, Dennis Coon Appendix Appendix: Behavioral Statistics.
Table of Contents Exit Appendix Behavioral Statistics.
1 Chapter 1: Sampling and Descriptive Statistics.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Introduction to Data Mining with XLMiner
Chapter 7 – K-Nearest-Neighbor
Jan Shapes of distributions… “Statistics” for one quantitative variable… Mean and median Percentiles Standard deviations Transforming data… Rescale:
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Business and Economics 7th Edition
Describing Distributions Numerically
Data Tutorial Tutorial on Types of Graphs Used for Data Analysis, Along with How to Enter Them in MS Excel Carryn Bellomo University of Nevada, Las Vegas.
How to Analyze Data? Aravinda Guntupalli. SPSS windows process Data window Variable view window Output window Chart editor window.
Overview DM for Business Intelligence.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Describing distributions with numbers
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
STAT 250 Dr. Kari Lock Morgan
Chapter 3 Data Exploration and Dimension Reduction 1.
Graphical Summary of Data Distribution Statistical View Point Histograms Skewness Kurtosis Other Descriptive Summary Measures Source:
ITEC6310 Research Methods in Information Technology Instructor: Prof. Z. Yang Course Website: c6310.htm Office:
Variable  An item of data  Examples: –gender –test scores –weight  Value varies from one observation to another.
Quantitative Skills 1: Graphing
The introduction to SPSS Ⅱ.Tables and Graphs for one variable ---Descriptive Statistics & Graphs.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Describing Data.
Chapter 9 – Classification and Regression Trees
UNDERSTANDING RESEARCH RESULTS: DESCRIPTION AND CORRELATION © 2012 The McGraw-Hill Companies, Inc.
STAT 280: Elementary Applied Statistics Describing Data Using Numerical Measures.
Chapter 2 Describing Data.
Lecture 3 Describing Data Using Numerical Measures.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
L643: Evaluation of Information Systems Week 13: March, 2008.
Mr. Magdi Morsi Statistician Department of Research and Studies, MOH
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall2(2)-1 Chapter 2: Displaying and Summarizing Data Part 2: Descriptive Statistics.
Statistics with TI-Nspire™ Technology Module E Lesson 1: Elementary concepts.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Chapter 1: Exploring Data, cont. 1.2 Describing Distributions with Numbers Measuring Center: The Mean Most common measure of center Arithmetic average,
Chapter 4 –Dimension Reduction Data Mining for Business Analytics Shmueli, Patel & Bruce.
Statistical Fundamentals: Using Microsoft Excel for Univariate and Bivariate Analysis Alfred P. Rovai Charts Overview PowerPoint Prepared by Alfred P.
3/13/2016 Data Mining 1 Lecture 2-1 Data Exploration: Understanding Data Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB)
Chapter 5 Describing Distributions Numerically Describing a Quantitative Variable using Percentiles Percentile –A given percent of the observations are.
Univariate Point Estimation Confidence Interval Estimation Bivariate: Linear Regression Multivariate: Multiple Regression 1 Chapter 4: Statistical Approaches.
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
Descriptive Statistics
Descriptive Statistics ( )
Exploring Data: Summary Statistics and Visualizations
Data Transformation: Normalization
MATH-138 Elementary Statistics
Data Mining: Concepts and Techniques
Description of Data (Summary and Variability measures)
Chapter 3 Describing Data Using Numerical Measures
Numerical Descriptive Measures
Topic 5: Exploring Quantitative data
Numerical Descriptive Measures
Descriptive Statistics vs. Factor Analysis
Honors Statistics Review Chapters 4 - 5
Range, Width, min-max Values and Graphs
Numerical Descriptive Measures
Chapter 4 –Dimension Reduction
Presentation transcript:

Chapter 3 – Data Visualization Data Mining for Business Intelligence Shmueli, Patel & Bruce © Galit Shmueli and Peter Bruce 2010

Graphs for Data Exploration Basic Plots Line Graphs Bar Charts Scatterplots Distribution Plots Boxplots Histograms © Galit Shmueli and Peter Bruce 2010

Line Graph for Time Series © Galit Shmueli and Peter Bruce 2010

Bar Chart for Categorical Variable 95% of tracts do not border Charles River Excel can confuse: y-axis is actually “% of records that have a value for CATMEDV” (i.e., “% of all records”) © Galit Shmueli and Peter Bruce 2010

Scatterplot Displays relationship between two numerical variables © Galit Shmueli and Peter Bruce 2010

Distribution Plots Display “how many” of each value occur in a data set Or, for continuous data or data with many possible values, “how many” values are in each of a series of ranges or “bins” © Galit Shmueli and Peter Bruce 2010

Histograms Boston Housing example: Histogram shows the distribution of the outcome variable (median house value) © Galit Shmueli and Peter Bruce 2010

Boxplots Side-by-side boxplots are useful for comparing subgroups Boston Housing Example: Display distribution of outcome variable (MEDV) for neighborhoods on Charles river (1) and not on Charles river (0) © Galit Shmueli and Peter Bruce 2010

Box Plot Top outliers defined as those above Q3+1.5(Q3-Q1). “max” = maximum of non-outliers Analogous definitions for bottom outliers and for “min” Details may differ across software outliers “max” Quartile 3 mean Median Quartile 1 “min” © Galit Shmueli and Peter Bruce 2010

Heat Maps Color conveys information In data mining, used to visualize Correlations Missing Data © Galit Shmueli and Peter Bruce 2010

Heatmap to highlight correlations (Boston Housing) In Excel (using conditional formatting) In Spotfire © Galit Shmueli and Peter Bruce 2010

Multidimensional Visualization © Galit Shmueli and Peter Bruce 2010

Scatterplot with color added Boston Housing NOX vs. LSTAT Red = low median value Blue = high median value © Galit Shmueli and Peter Bruce 2010

Matrix Plot Shows scatterplots for variable pairs Example: scatterplots for 3 Boston Housing variables © Galit Shmueli and Peter Bruce 2010

Rescaling to log scale (on right) “uncrowds” the data © Galit Shmueli and Peter Bruce 2010

Aggregation © Galit Shmueli and Peter Bruce 2010

Amtrak Ridership – Monthly Data © Galit Shmueli and Peter Bruce 2010

Aggregation – Monthly Average © Galit Shmueli and Peter Bruce 2010

Aggregation – Yearly Average © Galit Shmueli and Peter Bruce 2010

Scatter Plot with Labels (Utilities) © Galit Shmueli and Peter Bruce 2010

Scaling: Smaller markers, jittering, color contrast (Universal Bank; red = accept loan) © Galit Shmueli and Peter Bruce 2010

Jittering Moving markers by a small random amount Uncrowds the data by allowing more markers to be seen © Galit Shmueli and Peter Bruce 2010

Without jittering (for comparison) © Galit Shmueli and Peter Bruce 2010

Parallel Coordinate Plot (Boston Housing) CATMEDV =1 CATMEDV =0 - CAT. MEDV: (1) Filter Settings © Galit Shmueli and Peter Bruce 2010

Linked plots (same record is highlighted in each plot) © Galit Shmueli and Peter Bruce 2010

Network Graph – eBay Auctions (sellers on left, buyers on right) Circle size = # of transactions for the node Line width =# of auctions for the buyer- seller pair Arrows point from buyer to seller © Galit Shmueli and Peter Bruce 2010

Treemap – eBay Auctions (Hierarchical eBay data: Category> sub-category> Brand) Rectangle size = average closing price (=item value) Color = % sellers with negative feedback (darker=more) © Galit Shmueli and Peter Bruce 2010

(Comparing countries’ well-being with GDP) Map Chart (Comparing countries’ well-being with GDP) Darker = higher value © Galit Shmueli and Peter Bruce 2010

Summary of Major Visualizations & Operations, According to Data Mining Goal © Galit Shmueli and Peter Bruce 2010

Summary of Major Visualizations & Operations, According to Data Mining Goal © Galit Shmueli and Peter Bruce 2010

Summary of Major Visualizations & Operations, According to Data Mining Goal © Galit Shmueli and Peter Bruce 2010

Chapter 4 –Dimension Reduction Data Mining for Business Intelligence Shmueli, Patel & Bruce © Galit Shmueli and Peter Bruce 2010

Exploring the data Statistical summary of data: common metrics Average Median Minimum Maximum Standard deviation Counts & percentages © Galit Shmueli and Peter Bruce 2010

Summary Statistics – Boston Housing © Galit Shmueli and Peter Bruce 2010

Correlations Between Pairs of Variables: Correlation Matrix from Excel © Galit Shmueli and Peter Bruce 2010

Summarize Using Pivot Tables Counts & percentages are useful for summarizing categorical data Boston Housing example: 471 neighborhoods border the Charles River (1) 35 neighborhoods do not (0) © Galit Shmueli and Peter Bruce 2010

Pivot Tables - cont. Averages are useful for summarizing grouped numerical data Boston Housing example: Compare average home values in neighborhoods that border Charles River (1) and those that do not (0) © Galit Shmueli and Peter Bruce 2010

Pivot Tables, cont. Group by multiple criteria: By # rooms and location E.g., neighborhoods on the Charles with 6-7 rooms have average house value of 25.92 ($000) © Galit Shmueli and Peter Bruce 2010

Pivot Table - Hint To get counts, drag any variable (e.g. “ID”) to the data area Select “settings” then change “sum” to “count” © Galit Shmueli and Peter Bruce 2010

Correlation Analysis Below: Correlation matrix for portion of Boston Housing data Shows correlation between variable pairs © Galit Shmueli and Peter Bruce 2010

Reducing Categories A single categorical variable with m categories is typically transformed into m-1 dummy variables Each dummy variable takes the values 0 or 1 0 = “no” for the category 1 = “yes” Problem: Can end up with too many variables Solution: Reduce by combining categories that are close to each other Use pivot tables to assess outcome variable sensitivity to the dummies Exception: Naïve Bayes can handle categorical variables without transforming them into dummies © Galit Shmueli and Peter Bruce 2010

Combining Categories Many zoning categories are the same or similar with respect to CATMEDV © Galit Shmueli and Peter Bruce 2010

Principal Components Analysis Goal: Reduce a set of numerical variables. The idea: Remove the overlap of information between these variable. [“Information” is measured by the sum of the variances of the variables.] Final product: A smaller number of numerical variables that contain most of the information © Galit Shmueli and Peter Bruce 2010

Principal Components Analysis How does PCA do this? Create new variables that are linear combinations of the original variables (i.e., they are weighted averages of the original variables). These linear combinations are uncorrelated (no information overlap), and only a few of them contain most of the original information. The new variables are called principal components. © Galit Shmueli and Peter Bruce 2010 44

Example – Breakfast Cereals © Galit Shmueli and Peter Bruce 2010

Description of Variables Name: name of cereal mfr: manufacturer type: cold or hot calories: calories per serving protein: grams fat: grams sodium: mg. fiber: grams carbo: grams complex carbohydrates sugars: grams potass: mg. vitamins: % FDA rec shelf: display shelf weight: oz. 1 serving cups: in one serving rating: consumer reports © Galit Shmueli and Peter Bruce 2010

Consider calories & ratings Total variance (=“information”) is sum of individual variances: 379.63 + 197.32 Calories accounts for 379.63/197.32 = 66% © Galit Shmueli and Peter Bruce 2010

First & Second Principal Components Z1 and Z2 are two linear combinations. Z1 has the highest variation (spread of values) Z2 has the lowest variation © Galit Shmueli and Peter Bruce 2010

PCA output for these 2 variables Top: weights to project original data onto Z1 & Z2 e.g. (-0.847, 0.532) are weights for Z1 Bottom: reallocated variance for new variables Z1 : 86% of total variance Z2 : 14% © Galit Shmueli and Peter Bruce 2010

Principal Component Scores Weights are used to compute the above scores e.g., col. 1 scores are computed Z1 scores using weights (-0.847, 0.532) © Galit Shmueli and Peter Bruce 2010

Properties of the resulting variables New distribution of information: New variances = 498 (for Z1) and 79 (for Z2) Sum of variances = sum of variances for original variables calories and ratings New variable Z1 has most of the total variance, might be used as proxy for both calories and ratings Z1 and Z2 have correlation of zero (no information overlap) © Galit Shmueli and Peter Bruce 2010

Generalization X1, X2, X3, … Xp, original p variables Z1, Z2, Z3, … Zp, weighted averages of original variables All pairs of Z variables have 0 correlation Order Z’s by variance (z1 largest, Zp smallest) Usually the first few Z variables contain most of the information, and so the rest can be dropped. © Galit Shmueli and Peter Bruce 2010

PCA on full data set First 6 components shown First 2 capture 93% of the total variation Note: data differ slightly from text © Galit Shmueli and Peter Bruce 2010

Normalizing data In these results, sodium dominates first PC Just because of the way it is measured (mg), its scale is greater than almost all other variables Hence its variance will be a dominant component of the total variance Normalize each variable to remove scale effect Divide by std. deviation (may subtract mean first) Normalization (= standardization) is usually performed in PCA; otherwise measurement units affect results Note: In XLMiner, use correlation matrix option to use normalized variables © Galit Shmueli and Peter Bruce 2010

PCA using standardized variables First component accounts for smaller part of variance Need to use more components to capture same amount of information © Galit Shmueli and Peter Bruce 2010

PCA in Classification/Prediction Apply PCA to training data Decide how many PC’s to use Use variable weights in those PC’s with validation/new data This creates a new reduced set of predictors in validation/new data © Galit Shmueli and Peter Bruce 2010

Regression-Based Dimension Reduction Multiple Linear Regression or Logistic Regression Use subset selection Algorithm chooses a subset of variables This procedure is integrated directly into the predictive task © Galit Shmueli and Peter Bruce 2010

Summary Data summarization is an important for data exploration Data summaries include numerical metrics (average, median, etc.) and graphical summaries Data reduction is useful for compressing the information in the data into a smaller subset Categorical variables can be reduced by combining similar categories Principal components analysis transforms an original set of numerical data into a smaller set of weighted averages of the original data that contain most of the original information in less variables. © Galit Shmueli and Peter Bruce 2010