Exploratory data analysis (EDA) Detective Alex Yu

Slides:



Advertisements
Similar presentations
Assumptions underlying regression analysis
Advertisements

Transformations & Data Cleaning
Topic 9: Remedies.
Correlation and regression
Polynomial Regression and Transformations STA 671 Summer 2008.
Linear Regression t-Tests Cardiovascular fitness among skiers.
Chapter 10 Re-Expressing data: Get it Straight
Econ 140 Lecture 81 Classical Regression II Lecture 8.
Chapter 8 Linear Regression © 2010 Pearson Education 1.
Section 4.2 Fitting Curves and Surfaces by Least Squares.
Lecture 18: Thurs., Nov. 6th Chapters 8.3.2, 8.4, Outliers and Influential Observations Transformations Interpretation of log transformations (8.4)
Lecture 23: Tues., Dec. 2 Today: Thursday:
BA 555 Practical Business Analysis
Descriptive Statistics In SAS Exploring Your Data.
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Quantitative Business Analysis for Decision Making Simple Linear Regression.
1 Re-expressing Data  Chapter 6 – Normal Model –What if data do not follow a Normal model?  Chapters 8 & 9 – Linear Model –What if a relationship between.
Lecture 19 Transformations, Predictions after Transformations Other diagnostic tools: Residual plot for nonconstant variance, histogram to check normality.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Class 11: Thurs., Oct. 14 Finish transformations Example Regression Analysis Next Tuesday: Review for Midterm (I will take questions and go over practice.
8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the.
Slide 1 Testing Multivariate Assumptions The multivariate statistical techniques which we will cover in this class require one or more the following assumptions.
Linear Regression 2 Sociology 5811 Lecture 21 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Practical statistics for Neuroscience miniprojects Steven Kiddle Slides & data :
Hypothesis Testing in Linear Regression Analysis
The Examination of Residuals. The residuals are defined as the n differences : where is an observation and is the corresponding fitted value obtained.
Ch4 Describing Relationships Between Variables. Pressure.
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
CHAPTER 7: Exploring Data: Part I Review
Other Regression Models Andy Wang CIS Computer Systems Performance Analysis.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Ch4 Describing Relationships Between Variables. Section 4.1: Fitting a Line by Least Squares Often we want to fit a straight line to data. For example.
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
TODAY we will Review what we have learned so far about Regression Develop the ability to use Residual Analysis to assess if a model (LSRL) is appropriate.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Summarizing Bivariate Data Non-linear Regression Example.
Bivariate Data Analysis Bivariate Data analysis 4.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Robust Estimators.
Correlation – Recap Correlation provides an estimate of how well change in ‘ x ’ causes change in ‘ y ’. The relationship has a magnitude (the r value)
Chapter 14: Inference for Regression. A brief review of chapter 4... (Regression Analysis: Exploring Association BetweenVariables )  Bi-variate data.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 13, 2013.
ANOVA, Regression and Multiple Regression March
Robust Regression. Regression Methods  We are going to look at three approaches to robust regression:  Regression with robust standard errors  Regression.
Machine Learning 5. Parametric Methods.
From OLS to Generalized Regression Chong Ho Yu (I am regressing)
Chong Ho Yu.  Data mining (DM) is a cluster of techniques, including decision trees, artificial neural networks, and clustering, which has been employed.
D/RS 1013 Data Screening/Cleaning/ Preparation for Analyses.
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
Stat 112 Notes 14 Assessing the assumptions of the multiple regression model and remedies when assumptions are not met (Chapter 6).
Understanding & Comparing Distributions Chapter 5.
AP Statistics Review Day 1 Chapters 1-4. AP Exam Exploring Data accounts for 20%-30% of the material covered on the AP Exam. “Exploratory analysis of.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Statistical Data Analysis - Lecture /04/03
(Residuals and
Ungraded quiz Unit 1.
CHAPTER 26: Inference for Regression
Regression Assumptions
Regression Forecasting and Model Building
Simple Linear Regression
Regression Assumptions
Ungraded quiz Unit 1.
Presentation transcript:

Exploratory data analysis (EDA) Detective Alex Yu

What isn't EDA EDA does not mean lack of planning or messy planning.  “I don't know what I am doing; just ask as many questions as possible in the survey; I don't need a well- conceptualized research question or a well-planned research design. Just explore.” EDA is not opposed to confirmatory data factor (CDA) e.g. check assumptions, residual analysis, model diagnosis.

What is EDA? Pattern-seeking Skepticism (detective spirit) Abductive reasoning John Tukey (not Turkey): Explore the data in as many ways as possible until a plausible story of the data emerges.

Elements of EDA Velleman & Hoaglin (1981):  Residual analysis  Re-expression (data transformation)  Resistant  Display (revelation, data visualization)

Residual Data = fit + residual Data = model + error Residual is a modern concept. In the past many scientists ignored it. They reported the “fit” only  Johannanes Kepler  Gregor Mendel

Random residual plot No systematic pattern Normal distribution

Strange residual patterns Fitness data Residuals are not normally distributed. Explore another model!

Strange residual patterns Non-random, systematic Check the data!

Robust residual Robust regression in SAS The residual plot tags the influential points (less severe) and outliers (more severe).

Re-expression or transformation Parametric tests require certain assumptions e.g. normality, homogeneity of variances, linearity...etc. When your data structure cannot meet the requirements, you need a transformer (ask Autobots, not Deceptions)!

Transformers! Normalize the distribution: log transformation or inverse probability Stabilize the variance: square root transformation: y* = sqrt(y) Linearize the trend = log transformation (but sometime it is better to leave it alone and do a nonlinear fit, will be discussed next)

Skewed distribution The distributions of publication of scientific studies and patents are skewed. A few countries (e.g. US, Japan) have the most. Log transformation can normalize them.

JMP Create the transformed variable while doing analysis. Faster, but will not store the new variable. You cannot preview the distribution.

JMP Create a permanent new variable for re- analysis later.

Before and after Regression with transformed variables makes much more sense!

Example from JMP Corn.jmp DV: yield IV: nitrate

Skewed distributions Both DV and IV distributions are skewed. What regression result would you expect?

Remove outliers? Three observations are located outside the boundary of the 99% density ellipse (the majority of the data) Only one is considered an outlier.

Remove outliers? Removing the two observations at the lower left will not make things better. They fall along the nonlinear path.

Transform yield only Remove the outlier at the far right. It didn't look any better.

Transform nitrate only The regression model looks linear. It is acceptable, but the underlying pattern is really nonlinear.

Interactive nonlinear fit

Linear model is too simplistic and underfit

Overfit and complicated model

Smooth things out: Almost right Lambda: Smoothing parameter Not a bad model, but the data points at the lower left are neglected.

General Ambrose says:

Polynominal (nonlinear) fit Quadratic = 2 turns Cubic = 3 turns Quartic = 4 turns Quintic = 5 turns, take the lower left into account, but too complicated (too many turns)

Fit spline Like Graph Builder, in Fit Spline you can control the curve interactively. It shows you the R- square (variance explained), too. It still does not take the lower left data into account.

Kernel Smoother Local smoother: take localized variations and patterns into account. Interactive, too But the line still does not go towards the data points at the lower left.

Fit nonlinear MM has the lowest AICc and it takes the data points at the lower left into account. Should we take it? MM is a specific model of enzyme kinetics in biochemistry.

Custom formula for data transformation

Custom transformation You need prior research to support it. You cannot makeup a transformation or an equation. It is a linear model, it might distort the real pattern (non- linear).

Fit special It works! Now the line passes through all data points! Yeah!

I am the best transformer!

Resistance Resistance is not the same as robustness. Resistance: Immune to outliers Robustness: immune to parametric assumption violations Use median, trimean, winsorized mean, trimmed mean to countermeasure outliers, but it is less important today (will be explained next).

Data visualization: Revelation Data visualization is the primary tool of EDA. Without “seeing” the data pattern,...  how can you know whether the residuals are random or not.  how can you spot the skewed distribution, nonlinear relationship, and decide whether transformation is needed?  how can you detect outliers and decide whether you need resistance or robust procedures? DV will be explained in detail in the next unit.

Data visualization One of the great inventions of graphical techniques by John Tukey is the boxplot.  It is resistant against extreme cases (use the median)  It can easily spot outliers.  It can check distributional assumption using a quick 5-point summary.

Classical EDA Some classical EDA techniques are less important because today many new procedures...  do not require parametric assumptions or are robust against the violations (e.g. decision tree, generalized regression).  Are immune against outliers (e.g. decision tree, two-step clustering).  Can handle strange data structure or perform transformation during the process (e.g. artificial neural networks).

EDA and data mining Same:  Data mining is an extension of EDA: it inherits the exploratory spirit; don't start with a preconceived hypothesis.  Both heavily rely on data visualization. Difference:  DM: Machine learning and resampling  DM: More robust  DM: can get the conclusion with CDA

Assignment 6.1 Download the World Bank data set from the Unit 6 folder. Use 2005 patents by residents to predict 2007 GNP per person employed. Make a regression model using log transformation and another one using log10 transformation. Which one is better? Copy and paste the graphs into a Word document, and explain your answer.

Assignment 6.2 Open the sample data set “US demographics” from JMP. Use college degrees to predict alcohol consumption. Use Fit Y by X or Fit nonlinear to find the relationship between the two variables. You can try different transformation methods, too. What is the underlying relationship between college degrees and alcohol consumption? Copy an paste the graphs into the same document. Explain you answer and upload the file to Sakai.

Assignment 6.3 Transform yourself into a Pink Volkswagen or a GMC truck.