Managing and Curating Data Chapter 8
Introduction Data organization Data management Data curation Raw data is required to repeat a scientific study Any data supported by public funds is legally required to be available for other scientists and the public
Step 1: Managing Raw Data Various sources of data –Data loggers –Handwritten notes This data must be transferred to an organized format, checked and analyzed
Spreadsheets Row: single observation Column: single measured or observed variable Enter data ASAP! –Detect mistakes –Memory (doesn’t last long) –2 copies –Timely analysis Proofread the data Check it NumberBiomass Carrots Peppers Broccoli Garden Yield
Metadata: Data about data “Must have” metadata: –Name and contact info of collector –Location of data collection –Name of study –Source of funding –Description of the organization of the data file Methods used to collect Types of experimental units Description of abbreviations Explicit description of data in columns and rows May be created before in some cases Very important to assemble because it’s easily forgotten
Step 3: Checking the Data Outliers: values of measurements or observations that are outside the range of the bulk of the data Values beyond the upper or lower deciles (the 90% or the 10%) Outliers increase the variance in data and increase the chance of a Type II error
How to deal with outliers Do not delete them; this could be considered fraud Only delete if an error or the data no longer are valid Think about them –Interesting hypotheses –A large body of science is devoted to outliers –What type of distribution does your data have?
Errors and Missing Data Errors are often outliers and can be identified Sources: Mistyping (decimal points), instrument, field entry Checking data can reduce errors Never leave blank cells in spreadsheets; enter a zero or NA (not available)
Detecting Outliers and Errors Three techniques –Calculating column statistics –Checking ranges and precision of column values –Graphical exploratory data analysis
Detecting Outliers and Errors cont. Column stats: –Mean, median, standard deviation, variance Logical functions to check your columns Range checking your data Carrot Id # lengthBiomass Mean Median1812 St Dev Variance Min105 Max26118
Graphical Exploratory Data Analysis Box plots (univariate) Stem-and-leaf plots (univariate) Scatterplots (bivariate or multivariate)
Stem-and-leaf plots Example: Vegetable biomass: 7,15, 35,36,37,23,27,21,42, ,3,7 3 5,6,
Scatter plots Use to see how traits relate to one another
Creating an Audit Trail Examining data for outliers and errors is a QA/QC for research Document how you perform QA/QC in your metadata Your audit trail allows others to reanalyze and recreate your results May be required for legal documentation