Exploring Data in R Introduction to R, Part II Anna Blackstock Statistician, Biostatistics and Information Management Office (BIMO) NCEZID/DFWED
Exploratory Data Analysis Exploratory Data Analysis (EDA) is a term used to describe the process of exploring general dataset characteristics. Three realms of EDA*: Transformation Visualization Modelling *From the book “R for Data Science” (http://r4ds.had.co.nz/exploratory-data-analysis.html).
Exploratory Data Analysis There is not a specific formula for doing EDA “the right way” Goals of EDA: Checking data quality Any mistakes? Any unusual or unexpected values? Understanding distributions Investigating relationships between variables
EDA Today Will cover a few basic functions that you can use to get started with data exploration Will NOT provide in-depth instruction on EDA tools (especially data transformation—stay tuned for other courses)
Exploring Data When you first read your data into R, you will want to do a few data checks. Functions you may consider:
EDA for Categorical Variables Tables Univariate and multi-way tables Plots Bar plots, stacked bar plots Statistical tests Chi-square tests
EDA for Continuous Variables Summary statistics Mean, variance, percentiles . . . Plots Scatterplots, Box plots Statistical tests T-tests
EDA for Categorical + Continuous Variables Summary statistics Mean, variance, etc. by category Plots Box plots, bubble plots Statistical tests T-tests
Where to next? Take an online course Keep an eye out for future CDC courses See the book “R for Data Science” by Garrett Grolemund and Hadley Wickham: http://r4ds.had.co.nz/ Find an online guide to EDA