Presentation is loading. Please wait.

Presentation is loading. Please wait.

Erin Osborne Nishimura onish.web.unc.edu University of North Carolina at Chapel Hill.

Similar presentations


Presentation on theme: "Erin Osborne Nishimura onish.web.unc.edu University of North Carolina at Chapel Hill."— Presentation transcript:

1 Erin Osborne Nishimura onish.web.unc.edu University of North Carolina at Chapel Hill

2 Reproducible Research Specific aim The goal of reproducible research is to aid in the exact replication of scientific findings by an independent investigator.

3 Why reproducible research? Accuracy Efficiency Accessibility

4 The key components of reproducible research Organization – Structuring project folders in meaningful ways – Naming files Documentation – Taking notes along the way – Reporting what your code is doing Automation – Writing reproducible code – Making sure the code produces the same results each time Publication – Making your data, analysis, and/or code available to collaborators or the public

5 1) ORGANIZATION Best organizational practices – Structure projects into a directory/folder. – Employ a stereotyped file structure. – Document the project as you go. – Practice effective naming.

6 EXAMPLE: One person’s file structure Six months ago, your lab read a paper from the Rick Young lab stating that histone-modifying enzymes affect gene expression. Your lab noticed that some of the enzymes with the biggest changes were positively charged. At that time, you started a project to plot the charges of different enzymes and then compare the results to the Rick Young dataset from the paper. Then, you got distracted. Now, six months later, you realize you should probably re-do the analysis. You open up the folder/directory where you were performing your work only to see… $ ls data_from_RickYoungLab_cleanversion.txt data_from_RickYoungLab.txt data_fromRickYoungLab.xlsx first_pass_plots.pdf firstpass_main_script.sh hg19_13947561833.gtf hg19_histoneacetyltransferases_only.gtf jan_22_2014_latest_main_script.sh latest_main_script.sh notes_to_myself.docx outputplot.pdf plot_enzyme_charge.R temp1.txt temp2.txt test1.txt test.gtf What are the problems with this organization? Can you re-organize your folder any better?

7 EXAMPLE: flat organization Six months ago, your lab read a paper from the Rick Young lab stating that histone-modifying enzymes affect gene expression. Your lab noticed that some of the enzymes with the biggest changes were positively charged. At that time, you started a project to plot the charges of different enzymes and then compare the results to the Rick Young dataset that you obtained from the paper. Then, you got distracted. Now, six months later, you realize you should probably re-do the analysis. You open up the folder/directory where you were performing your work only to see… $ ls 00_histoneChargeProject_README.txt 01_input_data_from_RickYoungLab_raw.txt 01_input_data_from_RickYoungLab_raw.xlsx 02_input_data_RYLab_processed.txt 03_input_hg19_reference.gtf 04_mainScript_14-01-22.sh 04_mainScript_previous_versions/ 04_mainScript_14-01-20.sh 04_mainScript_14-01-21.sh 05_plotEnzymeCharge.R 06_outputPlot_14-01-20.pdf 06_outputPlot_14-01-22.pdf 07_testFiles/ temp1_14-01-20.txt temp2_14-01-20.txt test1_14-01-20.txt test_14-01-20.gtf

8 EXAMPLE: hierarchical organization $ ls 00_documentation/ histoneChargeProject_README.txt 01_input/ data_from_Rick_Young_lab/ inputdata_RickYoungLab_raw.txt inputdata_RickYoungLab_raw.xlsx inputdata_RYLab_processed.txt hg19_annotations/ hg19_13947561833.gtf hg19_histoneacetyltransferases_only.gtf 02_mainScript/ 14-01-22_mainScript.sh previous_versions/ 14-01-20_mainScript.sh 14-01-21_mainScript.sh 03_auxiliary_scripts/ plotEnzymeCharge.R 04_ouptut/ 14-01-22_outputPlot.pdf 14-01-22_outputPlot.log previous_versions/ 14-01-20_outputPlot.pdf 14-01-20_outputPlot.log 05_testFiles/ temp1_14-01-20.txt temp2_14-01-20.txt test1_14-01-20.txt test_14-01-20.gtf

9 Naming matters! $ ls 00_documentation/ histoneChargeProject_README.txt 01_input/ data_from_Rick_Young_lab/ inputdata_RickYoungLab_raw.txt inputdata_RickYoungLab_raw.xlsx inputdata_RYLab_processed.txt hg19_annotations/ hg19_13947561833.gtf hg19_histoneacetyltransferases_only.gtf 02_mainScript/ 14-01-22_mainScript.sh previous_versions/ 14-01-20_mainScript.sh 14-01-21_mainScript.sh 03_auxiliary_scripts/ plotEnzymeCharge.R 04_ouptut/ 14-01-22_outputPlot.pdf 14-01-22_outputPlot.log previous_versions/ 14-01-20_outputPlot.pdf 14-01-20_outputPlot.log 05_testFiles/ temp1_14-01-20.txt temp2_14-01-20.txt test1_14-01-20.txt test_14-01-20.gtf

10 A typical hierarchical structure A quick guide to organizing computational biology projects PLoS Computational Biology, 2009.

11 2) DOCUMENTATION Write a project overview in a README file Take notes along the way – Actual notes – Note taking.txt files or software Produce dynamic documentation of your project – Files that your code generates that weave together code and output Write scripts that generate log files – Output files that your code generates to keep a record of software versions, commands used, input, and output. Comments within code Pseudocode

12 $ ls 00_documentation/ histoneChargeProject_README.txt 01_input/ data_from_Rick_Young_lab/ inputdata_RickYoungLab_raw.txt inputdata_RickYoungLab_raw.xlsx inputdata_RYLab_processed.txt hg19_annotations/ hg19_13947561833.gtf hg19_histoneacetyltransferases_only.gtf 02_mainScript/ 14-01-22_mainScript.sh previous_versions/ 14-01-20_mainScript.sh 14-01-21_mainScript.sh 03_auxiliary_scripts/ plotEnzymeCharge.R 04_ouptut/ 14-01-22_outputPlot.pdf 14-01-22_outputPlot.log previous_versions/ 14-01-20_outputPlot.pdf 14-01-20_outputPlot.log 05_testFiles/ temp1_14-01-20.txt temp2_14-01-20.txt test1_14-01-20.txt test_14-01-20.gtf EXAMPLE: a README file histoneChargeProject_README.txt ########################################### INFO: Author: Claude DeScientist Lab: Jason Lieb Lab StartDate: January 20, 2014 ########################################### PROJECT: I want to understand the link between the charges associated with key histone modifying enzymes and the gene expression changes that result when these histone modifying enzymes are removed. ########################################### DIRECTORY:killdevil:/nas02/home/c/l/cdesci/ ########################################### 01_INPUT: I will use data from a Rick Young Lab paper an the human genome annotations as input. INPUT DATA: RICK YOUNG LAB: Information on the Rick Young Lab paper dataset is: Supplemental Materials 1 was downloaded on January 20, 2014 from: Devries, et al., Nature, 2013 http://nature.uk.pub/134345234/sfig1 INPUT DATA: HG ANNOTATION: … histoneChargeProject_README.txt ########################################### INFO: Author: Claude DeScientist Lab: Jason Lieb Lab StartDate: January 20, 2014 ########################################### PROJECT: I want to understand the link between the charges associated with key histone modifying enzymes and the gene expression changes that result when these histone modifying enzymes are removed. ########################################### DIRECTORY:killdevil:/nas02/home/c/l/cdesci/ ########################################### 01_INPUT: I will use data from a Rick Young Lab paper an the human genome annotations as input. INPUT DATA: RICK YOUNG LAB: Information on the Rick Young Lab paper dataset is: Supplemental Materials 1 was downloaded on January 20, 2014 from: Devries, et al., Nature, 2013 http://nature.uk.pub/134345234/sfig1 INPUT DATA: HG ANNOTATION: …

13 EXAMPLE: Note taking Date Path where you were working Project you were working on Dates and URL sites for downloads Links to published datasets you are using

14 3) AUTOMATION Try to minimize manual manipulations of data. Set seeds on random number generators Generate scripts that automatically generate output and log files in structured, directories Keep track of ALL software and package versions

15 EXAMPLE: An automatically generated log file ###################################################################### 2015-05-16_20:57RUNNING ###################################################################### 2015-05-16_20:57INITIATED autoAnalyzeChipseq.sh using command: chipSeqAutoAnalyzePipeline/chipSeqAnalyzeStep1.sh --multi../01_input/mAR100_JM127_L007_001.fastq.gz --bar../01_input/ barcode_index_AR100.txt -p 4 --extension 150 This pipeline will run with the following modules: 1) null 4) bedtools/2.22.1 7) r/3.1.1 2) git/1.8.5.3 5) bowtie/1.1.0 8) fastqc/0.11.3 3) samtools/0.1.19 6) java/1.8.0_11 RUNNING IN SPLIT-N-ALIGN MODE FILE TO SPLIT IS:../01_input/mAR100_JM127_L007_001.fastq.gz BARCODE FILE IS:../01_input/barcode_index_AR100.txt ###################################################################### 2015-05-16_20:57SPLITTING ###################################################################### zcat../01_input/mAR100_JM127_L007_001.fastq.gz | fastx_barcode_splitter.pl --bcfile../01_input/barcode_index_AR100.txt …

16 The key components of reproducible research Organization – Structuring project folders in meaningful ways – Naming files Documentation – Taking notes along the way – Reporting what your code is doing Automation – Writing reproducible code – Making sure the code produces the same results each time Publication – Making your data, analysis, and/or code available to collaborator or the public

17 REFERENCES A quick guide to organizing computational biology projects. PLoS Computational Biology, 2009. A quick guide to organizing computational biology projects Ten simple rules for reproducible computational research. PLoS Computational Biology, 2013. Ten simple rules for reproducible computational research. Why do reproducible research? ropensci.github.io Why do reproducible research? Reproducible Research Blog

18 FURTHER LEARNING Reproducible Research. Coursera. Next class August 3 – 29, 2015. Reproducible Research Reproducible Science Workshop – Tools, Resources, & Practices. Duke University. Periodically announced. Materials available for self-study. Reproducible Science Workshop Tool for Reproducible Research – University of Wisconsin. Materials available for self-study. Tool for Reproducible Research


Download ppt "Erin Osborne Nishimura onish.web.unc.edu University of North Carolina at Chapel Hill."

Similar presentations


Ads by Google