PART 2: DATA READINESS CASRAI CONFERENCE RECONNECT BIG DATA: THE ADVANCE OF DATA-DRIVEN DISCOVERY OCTOBER 16, 2013 JANE FRY Research Data Management: planning and implementation
Agenda 2 Before data collection and processing Planning and organizing Data collection and processing After data collection and processing Metadata Your turn No data expertise needed! Moore and Fry, CASRAI 2013 (October 16, 2013)
Before: why? 3 Why an RDMP? Essential For any type of data Why plan & organize? Journal requirements be proactive Safety protect your data Efficiency easier to write up analyses and reports Quality ensures high quality when guidelines laid out at beginning Make a checklist or a template Moore and Fry, CASRAI 2013 (October 16, 2013)
If no RDMP 4 Potential problems Each type of data has its own ‘peculiarities’ will you remember them after 1, 2, 3, … years What about other researchers Loss of information Inability to share Inability to replicate Not receive all monies from grant Not as much analysis can be conducted Cannot submit to journals Moore and Fry, CASRAI 2013 (October 16, 2013)
Before: plan and organize 5 What type of data How the data will be collected and processed Where and how will they be stored How will they be secured Where will the back-up be kept How will confidentiality be maintained What metadata to record Moore and Fry, CASRAI 2013 (October 16, 2013)
Before: type of data? 6 The type chosen will determine the format to be used for analysis Quantitative Microdata (.sav) Aggregate data (.xls) Qualitative (NVivo) Geospatial (Vector and raster data) Digital images (.jpeg) Digital audio (.wav) Digital video (.mp4) Documentation, scripts (.doc) Moore and Fry, CASRAI 2013 (October 16, 2013)
Before: collection methods? 7 Depends on type of data Questionnaires Interviews Focus groups Observations Transcripts Newspaper articles Journals Diaries … Moore and Fry, CASRAI 2013 (October 16, 2013)
Before: collection methods (cont’d) 8 Partially determined by type of data Paper Face-to-face Web Telephone Snail mail Audio Video … Moore and Fry, CASRAI 2013 (October 16, 2013)
Before: storage? 9 Where will it be stored Your laptop, pc, Smartphone Your researchers' laptop, pc, Smartphone The shared drive in the office A dropbox Controlled by what country What format will be used for storage Proprietary? Preservation How Repository Where Your institution Moore and Fry, CASRAI 2013 (October 16, 2013)
Before: storage strategies 10 Two different locations Two copies (at least) Keep original data with no manipulations 2 copies What to keep Everything! Use meaningful file names Set out format to be used Everyone has to use this format Moore and Fry, CASRAI 2013 (October 16, 2013)
Before: security issues? 11 How to secure data Determine before hand To prevent unauthorized access Intentional Unintentional Remote access – yes or no Off-site investigators Off-site research team members Personal or sensitive data Separate location from the main dataset Limited, controlled access Encrypted Moore and Fry, CASRAI 2013 (October 16, 2013)
Before: back-up? 12 Where will all information be backed-up If at your institution How often do they back-up What are their policies for data retention How often will you back-up When the project is over After a year Monthly Weekly Moore and Fry, CASRAI 2013 (October 16, 2013)
Before: confidentiality? 13 What procedures will be taken to ensure confidentiality Data must be anonymised (unless permission has been granted) Not possible to identify any individual Aggregate certain variables e.g., no low levels of geography Hide outliers by recoding Record all decisions made Why this decision made How the variable has been recoded Moore and Fry, CASRAI 2013 (October 16, 2013)
Before: confidentiality (cont’d) 14 Disclosure processing At what point in the data collection/processing Remove direct identifiers Names Addresses Telephone numbers Remove indirect identifiers Detailed geographic information Exact occupations Exact dates of events Birth Marriage Income Moore and Fry, CASRAI 2013 (October 16, 2013)
Before: confidentiality (cont’d) 15 Legal and ethical obligations to managing and sharing data Ethics approval of your institution National Data Policy (regarding sharing of data) Canada (FIPPA) UK (ESRC) How will confidentiality be maintained How to protect the privacy of the respondents How will the confidential information be handled and managed How to store respondents’ identification, if necessary Disclosure only if agreed to by respondent Moore and Fry, CASRAI 2013 (October 16, 2013)
Before: metadata? 16 Why keep metadata Researchers re-use data Secondary analysis Comparative research Teaching Replicate a study Requirement of our funders Good research practice Start documenting at the very beginning of the project End goal For this data to be replicated, if needed Moore and Fry, CASRAI 2013 (October 16, 2013)
Before: metadata (cont’d) 17 What to keep - everything! Research design Data collection Data preparation Questionnaires Interviewer instructions Meeting notes among researchers Details of decisions made Why certain decisions were made e.g. if data collection not to be done on a certain date (Easter) Moore and Fry, CASRAI 2013 (October 16, 2013)
Before: metadata (cont’d) 18 Processes What worked What didn’t work Changes made after pilots conducted Why they were made Was another pilot conducted after changes made Any and all changes that were made or not made Moore and Fry, CASRAI 2013 (October 16, 2013)
Before: metadata (cont’d) 19 Consent of participant (if needed) Disclosure processing Names of everyone involved in the project Source of all funding Monetary In kind Source of any data used that is not from this data collection e.g., postal code conversion file Moore and Fry, CASRAI 2013 (October 16, 2013)
Before: a tip 20 If contracting out data processing Specify deliverables User Guide Date work performed Methodology of data cleaning, input, … Details of any new variables Reasons for making them Procedures, … Name and contact information Copy of questionnaire (if applicable) Raw data Questionnaires, interviews, … Example of incomplete deliverable Moore and Fry, CASRAI 2013 (October 16, 2013)
Data collection and processing 21 Some of the steps are Transcribe Code Enter Check Validate Clean Anonymise Vary depending on the type of data collected One element in common with all types of data Must record metadata Moore and Fry, CASRAI 2013 (October 16, 2013)
And next 22 All the decisions have been made Your checklist/template has been made The data have been collected and processed What now? Complete metadata on the data the documentation Moore and Fry, CASRAI 2013 (October 16, 2013)
After: data 23 Metadata on data: must be well organized How they were created How they were digitized How they were anonymised Explanation of codes used Explanation of classification scheme(s) used e.g., occupation Any and all changes that were made Access conditions e.g., member of your institution Terms of use e.g., academic or teaching purposes e.g., non-profit Moore and Fry, CASRAI 2013 (October 16, 2013)
After: data (cont’d) 24 Data metadata File names Meaningful Set up a system beforehand Make sure everyone sticks to it Versioning Set up a system beforehand What changes necessitate a new version number Version 1 to Version 2 e.g., one of the variables was coded incorrectly, therefore the dataset was replaced What changes do not necessitate a new version number Version 1 to Version 1.1 e.g., Something small like a spelling mistake Moore and Fry, CASRAI 2013 (October 16, 2013)
After: data (cont’d) 25 Transcribing guidelines set up beforehand Transcribing conventions Instructions Guidelines Variables Names Labels Comprehensible Unique Description Value labels Comprehensible Complete Associated question Moore and Fry, CASRAI 2013 (October 16, 2013)
After: data (cont’d) 26 Recoded variables Why they were needed (e.g., geographic location) Why they were done the way they were (e.g., age) All of the above list under variables Derived variables Derived from what Be specific Why was it done All of the above list under variables Missing values Codes used Should be consistent Reasons for missing values Weighting variable(s) Description Formula(s) Moore and Fry, CASRAI 2013 (October 16, 2013)
After: documentation 27 What to put in? Information for a researcher looking at your dataset for the first time with no prior knowledge As specific as possible All associated documentation about the research Moore and Fry, CASRAI 2013 (October 16, 2013)
28
After: documentation (cont’d) 29 Study background Purpose Time frame Geographic location Creator, principal investigator(s), other investigator(s) Funders Sampling design Description Size Any changes that were made Moore and Fry, CASRAI 2013 (October 16, 2013)
After: documentation (cont’d) 30 Study description Describes all aspects of the data collection and processing Data collection methodology Data preparation procedure Data validation protocols Instruments used Geographic coverage Temporal coverage Date of file creation Description of codes and classifications used Moore and Fry, CASRAI 2013 (October 16, 2013)
After: documentation (cont’d) 31 Codebook or user guide Original questionnaire/data collection instrument All interviewer instructions Any documentation describing variables Original ones Recoded Derived Weight Include formulas used to construct variables Moore and Fry, CASRAI 2013 (October 16, 2013)
A tip: 32 Much of the information in the previous slides may seem like common sense You will be tempted not to follow it No time No facilities to record it Will do it later Minor change, therefore not important enough to mark down Of course, I will remember it! What if? You forget to mark it down You forget to tell rest of research team If you follow a checklist, neither you nor your team will be caught short! Moore and Fry, CASRAI 2013 (October 16, 2013)
In sum 33 In this section you have learned What to do before data collection Plan and organize Data type, data collection and processing, storage, security, back-up, confidentiality, metadata To make a checklist or template About data collection and processing (in brief) After data collection and processing Metadata data, documentation Moore and Fry, CASRAI 2013 (October 16, 2013)
Research Data Management 34 Exercise #2: Data Readiness Is this data set ready for deposit? Why? Why not? Dataset Title: Attitudes of Pets towards their Owners (October 1998) Documentation available: The following text file: “This survey was conducted by the Pet Researchers of Canada and was analysed by the Acme Research Company. There is no documentation available for this survey. Use basic survey methodology if necessary. There are some interesting results in this survey.” Data available: A microdata file with some variable and value labels. Example 1: Name of variable: V35 Frequency: Yes = 35%, No = 47% Example 2: Name of Variable: Region of Country Frequency: 1 = 12%; 2 = 32%; 3 = 35%; 4 = 15%; 5 = 4%
Pat Moore Associate University Librarian: Research, Scholarship and Technology Carleton University x2745 Jane Fry Data Specialist Carleton University x1121 Contact Information 35 Moore and Fry, CASRAI 2013 (October 16, 2013)
References Corti, L “Managing qualitative data”. Datum Workshop, Newcastle, 26 May Retrieved 7 October 2013 from /corti_dataforlife_ pdf /corti_dataforlife_ pdf Fry, J. and Edwards, A.M. (2009). “ Protocols for accepting data.” Retrieved 7 October 2013 from UK Data Archive. “Create & manage data: Research Data lifecycle”. Retrieved 13 October 2013 from Stephenson, L. “Data management for advanced research”. Presentation given 28 March UCLA Social Science Data Archive, Unpublished. 36 Moore and Fry, CASRAI 2013 (October 16, 2013)