how to do a data analysis Stan Siranovich Crucial Connection LLC Prepared for SQL Saturday – Louisville 2018
The Story (based on a true life adventure) You log into your email first thing in the morning and the rumors are confirmed; your company is expanding with branch offices in three new cities. As you read, the Big Boss drops by your cubicle and says that she needs an analysis of the real estate situation in all three cities. The analysis needs to include summaries of prices based on factors such as number bedrooms, number of bathrooms, and number of square feet. It should include lots of visualizations, be clear and easy to understand, and point out any interesting relationships that you've uncovered. And you need to have it done by 11:30 a.m.
Summary Analysis for Louisville, Indianapolis, Cincinnati Requirements Plan of Attack Analysis for Louisville, Indianapolis, Cincinnati Beds, Baths, Sq. Ft., etc. Clear Visualizations Concise Report Due in Two Hours Use JMP data analysis software from SAS Collect, clean and examine data Summarize data Explore data visually Analyze data Prepare report
Residential Real Estate Data
The Software
By the Numbers Download and Concatenate Use Analyze > Distribution platform for visualization and data cleaning Use Recode function for further cleaning Use Analyze > Distribution platform for visualization and analysis
Concatenate Data in Analysis Software Open files and import into JMP data table Concatenate all three tables Include Source Column
Main Table with Source Column
Visual Data Cleaning Use Analyze > Distribution platform for first pass at cleaning
Partial Result from Analyze > Distribution
Cleaned Result from Analyze > Distribution
Recode Function
Recode Result with Formula Column Property Displays Match function Documentation Reproducible work flows
Analyze > Distribution Window Requested variables, all three cities
Result with Statistical Data and Boxplots
Box Plot Summary
Analyze > Distribution By Variable By Source Table
Result with Statistical Data and Boxplot
Stacked Results Red Triangle > Stack
Easy to Read Table Right Click > Edit > Make table of graphs like this
Progress Summary of prices, beds, baths, sq. ft. Done Next Summary of prices, beds, baths, sq. ft. Visualizations – clear, easy to understand Analysis Visualize distributions Comparisons of two variables Fit Y by X platform Data types and statistical measures
Output is Determined by Variable Type Analyze > Fit Y by X Examines the relationship between two variables Output depends on the variable modeling type
Price vs. Source
Statistical Results Red Triangle > Means / Anova Red Triangle > Compare Means > All Pairs, Tukey HSD
Multiple Variable vs. Source
Fit Y by X with Categorical and Continuous By Variables
Definitions R-square Measures the proportion of the variation accounted for by fitting means to each factor level. The remaining variation is attributed to random error. The R2 value is 1 if fitting the group means account for all the variation with no error. An R2 of 0 indicates that the fit serves no better as a prediction model than the overall response mean. F ratio Model mean square divided by the error mean square. If the analysis of variance model results in a significant reduction of variation from the total, the F ratio is higher than expected. Mean Square is a sum of squares divided by its associated degrees of freedom.
THE END How to Do a Data Analysis TITLE AUTHOR How to Do a Data Analysis Stan Siranovich Principal Analyst Crucial Connection LLC Jeffersonville, IN stan@CrucialConnection.com www.CrucialConnection.com www.StanSiranovich.com This work is the copyright of Stan Siranovich and Crucial Connection LLC