How NOT to share your data: Avoiding data horror stories Rosie Higman Office of Scholarly Communication 8th March 2017
Formatting your spreadsheet* Document and describe your data! Where? What? File formats Formatting your spreadsheet* Document and describe your data! * Based on Avoiding data disasters course by Mark Dunning, CRUK-CI http://bioinformatics-core-shared-training.github.io//avoid-data-disaster/
Warning! Every discipline is different These are general principles Application will vary according to your research
Where NOT to share your data Real examples – many found in various DAF surveys
Where SHOULD you share your data? Disciplinary repositories where possible, Apollo as a repository of last resort for Cambridge people DOI Preservation policy Well-indexed in search engines
You’ve decided to put your data in the repository now need to decide what data and how to present it. Roche DG, Lanfear R, Binning SA, Haff TM, Schwanz LE, et al. (2014) - Roche DG, Lanfear R, Binning SA, Haff TM, Schwanz LE, et al. (2014) Troubleshooting Public Data Archiving: Suggestions to Increase Participation. PLoS Biol 12(1): e1001779. doi:10.1371/journal.pbio.1001779, CC BY 4.0,
What data should you include? Graphs without underlying data do not add much to your article
What data should you include?
What data should you include? Numerical data in Word documents is not hugely helpful
What data should you include? More than your figures! Data and code necessary to recreate your results
Powerpoint is for presentations NOT data! Researchers often want to annotate images to highlight important areas and add metadata Lose the metadata embedded in the original image files – particularly important as some microscopes etc automatically embed substantial metadata Powerpoint is a poor format for both re-use and preservation
Powerpoint is for presentations NOT data! Instead: Original image files Appropriate formats Annotations embedded in separate PDF/csv/txt file (README file)
Think! Preservation vs access/re-use Find your file format Textual data = XML, TXT, HTML, PDF/A (Archival PDF) Tabular data (spreadsheets) = CSV Databases = XML, CSV Images = TIFF, PNG, JPEG* Audio = FLAC, WAV, MP3 You may need to submit multiple copies of the same data – one which facilitates easy re-use and one which will be possible to preserve – important for many research funders JPEG as a lossy format (some data is lost in compressing images into JPEGs) but one which may be necessary in some cases as TIFF files can be very large and so expensive to store. Think! Preservation vs access/re-use
Once you’ve found an appropriate file format don’t forget to publish in a way which allows for re-use, otherwise there is no point in publishing! Read-Only files make it less likely that other researchers will re-use your data and so cite you.
Messy spreadsheets are harder to re-use
Keep your spreadsheets tidy Graphs in separate sheet No highlighting No colours No formulas* *In your raw data Graphs should not be obscuring the underlying data Highlighting and colours are saved in CSV format and also not understood by computers so not useful if others are automating their analysis. Sometimes when you’ve used lots of formulas and they add to the data it might be helpful to include an xlsx file with the forumlae, as well as a csv which just contains the raw data
Keep your spreadsheets tidy No blank cells 1 piece of data per cell Keep units out of cells Use data validation No blank cells and choose a null value which cannot be confused Units should be in column titles not individual cells
Keep your spreadsheets tidy Paper found that Excel was automatically converting gene names into dates (gene symbols SEPT2 (Septin 2) and MARCH1 are converted by default to ‘2-Sep’ and ‘1-Mar’) This has corrupted many spreadsheets submitted as supporting information in genomics. Important to be aware of what is in your spreadsheet and use data validation when appropriate.
How NOT to describe your data
How you SHOULD describe your data Tells you what is in the dataset and how the data were collected.
Remember to document your data A good README file makes data much more usable. Space to describe methods, process of cleaning and analysing data. Opportunity to make your data a valuable resource
Choose open file formats. Choose sharing more than your figures. Choose a repository. Choose open file formats. Choose sharing more than your figures. Choose a tidy spreadsheet. Choose to describe your data. Choose decent documentation so your research is reproducible. CHOOSE DATA SHARING info@data.cam.ac.uk @CamOpenData www.data.cam.ac.uk