Replicating Results- Procedures and Pitfalls June 1, 2005.

Replicating Results- Procedures and Pitfalls June 1, 2005

The JMCB Data Storage and Evaluation Project Project summary –Part 1- July 1982 JMCB started requesting programs/data from authors –Part 2- attempt replication of published results based on submissions Review of results from Part 2 in Replication in Empirical Economics: The Journal of Money, Credit and Banking Project; The American Economic Review, Sept 1986, by Dewald, Thursby, Anderson

The JMCB Data Storage and Evaluation Project/ Dewald et al The paper focuses on Part 2 -How people responded to the request -Quality of the data that was submitted -The actual success (or lack thereof) of replication efforts

The JMCB Data Storage and Evaluation Project/ Dewald et al Three groups: –Group 1: Papers submitted and published prior to 1982. These authors did not know upon submission that they would be subsequently asked for programs/data. –Group 2: Authors whose papers were accepted for publication beginning July, 1982 –Group 3: Authors whose papers were under review beginning July, 1982

Group 1Group 2Group 3 Requests622765 Responses: Total Percent Mean response time in days 42 68 217 % 26 96 125 % 49 75 130 % Datasets Submitted22 35% 21 78% 47 72% Datasets Not Submitted:40618 Confidential Data210 Lost or Destroyed Data1421 Data Available, but not Sent421 Nonrespondents20116 Summary of Responses/Datasets Submitted, Dewald et al, p 591

Summary of Examined Datasets Dewald et al, p 591-592 Group 1Group 2Group 3 Total Datasets Submitted222047 Data Sets Examined191421 No Problems134 Problems by type Incomplete Submission635 Sources Cited Incorrectly044 Sources Cited Imprecisely11710 Data Transformations Described Incompletely 341 Data Element Not Clearly Defined232 Other031 Total222423

“Our findings suggest that inadvertent errors in published empirical articles are a commonplace rather than a rare occurrence.” – Dewald et al, page 587-588 “We found that the very process of authors compiling their programs and data for submission reveals to them ambiguities, errors, and oversights which otherwise would be undetected.” – Dewald et al, page 589

Raw data to finished product Raw data Analysis data Runs/results Finished product

Raw Data -> Analysis Data Always have two distinct data files- the raw data and analysis data A program should completely re-create analysis data from raw data NO interactive changes!! Final changes must go in a program!!

Raw Data -> Analysis Data Document all of the following: –Outliers? –Errors? –Missing data? –Changes to the data? Remember to check- –Consistency across variables –Duplicates –Individual records, not just summary stats –“Smell tests”

Analysis Data -> Results All results should be produced by a program Program should use analysis data (not raw) Have a “translation” of raw variable names -> analysis variable names -> publication variable names

Analysis Data -> Results Document- –How were variances estimated? Why? –What algorithms were used and why? Were results robust? –What starting values were used? Was convergence sensitive? –Did you perform diagnostics? Include in programs/documentation.

Thinking ahead Delete or archive old files as you go Use a meaningful directory structure (/raw, /data, /programs, /logfiles, /graphs etc.) Use relative pathnames Use meaningful variable names Use a script to sequentially run programs

Example script to sequentially run programs 1. #! /bin/csh 2. #File location: /u/machine/username/project/scripts/myproj.csh 3. #Author: your name 4. #Date: 9/21/04 5. #This script runs a do-file in Stata which produces and saves a dta-file 6. #in the data directory. Stat-transfer converts the.dta file to.sas7bdat 7. #and saves the file in the data folder. The program analyze.sas is run on 8. #the new sas data-file. 9. cd /u/machine/username/project/ 10. stata -b do programs/cleandata.do 11. st data/H00x_B.dta data/$file.sas7bdat 12. sas programs/analyze.sas

Log files Your log file should tell a story to the reader. As you print results to the log file, include words explaining the results Don’t output everything to the log-file- use quietly and noisily in a meaningful way. Include not only what your code is doing, but your reasoning and thought process

Project Clean-up Create a zip file that contains everything necessary for complete replication Delete/archive unused or old files Include any referenced files in zip When you have a final zip archive containing everything- –Open it in it’s own directory and run the script –Check that all the results match

When there are data restrictions… Consider releasing: –the subset of the raw data used –your analysis data as opposed to raw data –(at a minimum) notes on process from raw to analysis data PLUS everything pertaining to the data analysis Consider “internal” and “external” version of your log-file: –Do this via a variable at the top of your log-files: local internal = 1 … list if `internal’ == 1

Ethical Issues All authors are responsible for proper clean-up of the project Extremely important whether or not you plan on releasing data and programs Motivation –self-interest –honest research –the scientific method –allowing others to be critical of your methods/results –furthering your field

Ethical Issues – for discussion What if third-party redistribution of data is not allowed? Solutions for releasing data while protecting your time investment in data collection Is it unfair to ask people to release data after a huge time investment in the collection?

Replicating Results- Procedures and Pitfalls June 1, 2005.

Similar presentations

Presentation on theme: "Replicating Results- Procedures and Pitfalls June 1, 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Replicating Results- Procedures and Pitfalls June 1, 2005.

Similar presentations

Presentation on theme: "Replicating Results- Procedures and Pitfalls June 1, 2005."— Presentation transcript:

Similar presentations

About project

Feedback