P20 Seminar November 12, Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management
P20 Seminar November 12, Objectives Participants will learn about: process of consulting and collaborating with statistician general principles of database setup, data entry, verification, cleaning and storage
P20 Seminar November 12, Part 1: Working with Statistician from Start to Finish Kay Savik, MS
P20 Seminar November 12, Collaboration “Collaboration implies that statistician and researcher want to learn and exchange information. This exchange should be mutually beneficial.” Gerald van Belle
P20 Seminar November 12, Types of Consulting Cross sectional - statistical advice for data already collected or analyzed Longitudinal – a long term relationship between statistician and researcher
P20 Seminar November 12, First Meeting Intent of study Source of data Sampling unit Randomization Model of effects Type of study Type of data
P20 Seminar November 12, First Meeting What is the research question? What level of statistical knowledge does researcher have? What are the data and what form are they in? What are the conventions in this specific area of study?
P20 Seminar November 12, The Conversation To prevent type III error – the right answer to the wrong question! Clarify research aims Appropriate design Measurement Data management Analysis
P20 Seminar November 12, Analysis Choice Sir David Cox – “Begin with very simple methods and, if possible, end with simple methods” Rinndskopf’s Rules of Statistical Consulting – “Sometimes the “best” or “right” statistical procedure is not the best for a particular situation.”
P20 Seminar November 12, Which Statistical Package? There is not one “perfect” software for any procedure All standard packages have been tested and are reliable “Specialized” procedures are found in several packages
P20 Seminar November 12, Collaborate Rather than Consult Collaboration is a communal activity Decide who is responsible for what at first meeting Politely and quickly leave a collaboration where any party seems misguided or unethical Decide on questions of authorship at first meeting
P20 Seminar November 12, Part 2: Essentials of Data Management (DM) Olga Gurvich, MA
P20 Seminar November 12, Data Management Essential part of any research Interactive and collaborative venture of both investigator and statistician Requires a well-defined in advance system and consistency in its implementation
P20 Seminar November 12, Data Management Stages Database setup Raw data collection [who, what, when, how] Raw data entry, verification and cleaning Data storage [Data re-structuring for statistical analyses] [Data analysis] Data archiving
P20 Seminar November 12, Database Setup - Software Choice mainly depends on Amount of data to be collected Complexity of data structure Type of data Export/import capabilities to/from Planned statistical analyses and software Software: try avoiding Excel SPSS, ACCESS, EpiInfo, output of survey software, plain text (ASCII)
P20 Seminar November 12, Database Setup – Structure Participants => rows ; variables => columns Logical Record: one row contains all data for a single study participant Multiple Record: multiple rows per single participant Relational: multiple data files that can be merged
P20 Seminar November 12, Database Setup - General Give short, meaningful and “dated” name DB given to a statistician for cleaning and analyses should include - ONLY collected raw data; - NO graphs, comments, titles, summaries, hidden rows, split-spreadsheets, multiple spreadsheets, imposed “special” formats or highlighting
P20 Seminar November 12, Database Setup - Variables Set unique numeric ID(-s) in 1st column (-s) Identify types of variables, measurement units and type of recording [auto/manual] Carefully choose variables’ format and length Dates format MM/DD/YYYY; if parts are missing, create three separate variables Time format dd hh:mm:ss or similar
P20 Seminar November 12, Database Setup - Variables Create separate variable for every separate piece of information Give unique, short [6-8 char], meaningful names No special characters [!, %, $,spaces] Do not start with a number Consider other restrictions of specific software [e.g., lower/upper case letters]
P20 Seminar November 12, Database Setup - Coding Assign short and meaningful codes; consistent for same-response variables Use numeric (if possible) coding; do not combine num and char codes within a numeric variable Address missing values Avoid using “N/A”, “?”, etc. entirely
P20 Seminar November 12, Database Setup – Codebook/Data Dictionary A written handbook with information on study data: Study title, PI name, date of last update, DB name and location # of observations, # of variables Study variables and their attributes [name, label, location (ASCII), coding (values), format, measurement units] Other [formulae, weights, scoring documentation, etc.]
P20 Seminar November 12, Data Entry, Verification and Cleaning Ultimate aim is a fully-documented backed-up archive of verified, validated and ready-for-use data
P20 Seminar November 12, Data Entry “Do it promptly, completely and consistently” Preferably one trained data entry person [unless double entry] Unique ID (-s) All the data must be entered in its “raw” form directly from the original records - NO hand calculations Frequent back-up
P20 Seminar November 12, Data Verification and Cleaning Optimally done by a statistician or DM professional in close collaboration with investigator Includes (but not limited to) general and logic checks to detect errors and outliers, verification of data completeness (subjects and variables) Audit trail/log book for a complete record of changes made Following all necessary corrections, ONE FINAL CLEAN DB is created
P20 Seminar November 12, Data Storage Stored on a password-protected server are 1. ONE INITIAL RAW DB 2. ONE FINAL CLEAN DB 3. CODEBOOK 4. Audit trail or log book [if used] Frequent BACK-UPs are performed All previous DB versions EXCEPT the initial raw one are destroyed
P20 Seminar November 12, Data Re-Structuring If not foreseen in advance, may be needed for certain analyses Usually can be done in statistical packages Keep a record of any re-structuring Use “version-” or “date-numbering” system
P20 Seminar November 12, Data Archiving At the end of a project, the data, codebook, log-book and programs [syntax] must be archived The archive serves as a permanent storage and gives access to all project-related information Keep a copy of the archive and detailed report of the archive’s structure