Presentation is loading. Please wait.

Presentation is loading. Please wait.

How NOT to share your data: Avoiding data horror stories

Similar presentations


Presentation on theme: "How NOT to share your data: Avoiding data horror stories"— Presentation transcript:

1 How NOT to share your data: Avoiding data horror stories
Rosie Higman Office of Scholarly Communication 8th March 2017

2 Formatting your spreadsheet* Document and describe your data!
Where? What? File formats Formatting your spreadsheet* Document and describe your data! * Based on Avoiding data disasters course by Mark Dunning, CRUK-CI

3 Warning! Every discipline is different These are general principles Application will vary according to your research

4 Where NOT to share your data
Real examples – many found in various DAF surveys

5 Where SHOULD you share your data?
Disciplinary repositories where possible, Apollo as a repository of last resort for Cambridge people DOI Preservation policy Well-indexed in search engines

6 You’ve decided to put your data in the repository now need to decide what data and how to present it. Roche DG, Lanfear R, Binning SA, Haff TM, Schwanz LE, et al. (2014) - Roche DG, Lanfear R, Binning SA, Haff TM, Schwanz LE, et al. (2014) Troubleshooting Public Data Archiving: Suggestions to Increase Participation. PLoS Biol 12(1): e doi: /journal.pbio , CC BY 4.0,

7 What data should you include?
Graphs without underlying data do not add much to your article

8 What data should you include?

9 What data should you include?
Numerical data in Word documents is not hugely helpful

10 What data should you include?
More than your figures! Data and code necessary to recreate your results

11 Powerpoint is for presentations NOT data!
Researchers often want to annotate images to highlight important areas and add metadata Lose the metadata embedded in the original image files – particularly important as some microscopes etc automatically embed substantial metadata Powerpoint is a poor format for both re-use and preservation

12 Powerpoint is for presentations NOT data!
Instead: Original image files Appropriate formats Annotations embedded in separate PDF/csv/txt file (README file)

13 Think! Preservation vs access/re-use
Find your file format Textual data = XML, TXT, HTML, PDF/A (Archival PDF) Tabular data (spreadsheets) = CSV Databases = XML, CSV Images = TIFF, PNG, JPEG* Audio = FLAC, WAV, MP3 You may need to submit multiple copies of the same data – one which facilitates easy re-use and one which will be possible to preserve – important for many research funders JPEG as a lossy format (some data is lost in compressing images into JPEGs) but one which may be necessary in some cases as TIFF files can be very large and so expensive to store. Think! Preservation vs access/re-use

14 Once you’ve found an appropriate file format don’t forget to publish in a way which allows for re-use, otherwise there is no point in publishing! Read-Only files make it less likely that other researchers will re-use your data and so cite you.

15 Messy spreadsheets are harder to re-use

16 Keep your spreadsheets tidy
Graphs in separate sheet No highlighting No colours No formulas* *In your raw data Graphs should not be obscuring the underlying data Highlighting and colours are saved in CSV format and also not understood by computers so not useful if others are automating their analysis. Sometimes when you’ve used lots of formulas and they add to the data it might be helpful to include an xlsx file with the forumlae, as well as a csv which just contains the raw data

17 Keep your spreadsheets tidy
No blank cells 1 piece of data per cell Keep units out of cells Use data validation No blank cells and choose a null value which cannot be confused Units should be in column titles not individual cells

18 Keep your spreadsheets tidy
Paper found that Excel was automatically converting gene names into dates (gene symbols SEPT2 (Septin 2) and MARCH1 are converted by default to ‘2-Sep’ and ‘1-Mar’) This has corrupted many spreadsheets submitted as supporting information in genomics. Important to be aware of what is in your spreadsheet and use data validation when appropriate.

19 How NOT to describe your data

20 How you SHOULD describe your data
Tells you what is in the dataset and how the data were collected.

21 Remember to document your data
A good README file makes data much more usable. Space to describe methods, process of cleaning and analysing data. Opportunity to make your data a valuable resource

22 Choose open file formats. Choose sharing more than your figures.
Choose a repository. Choose open file formats. Choose sharing more than your figures. Choose a tidy spreadsheet. Choose to describe your data. Choose decent documentation so your research is reproducible. CHOOSE DATA SHARING @CamOpenData


Download ppt "How NOT to share your data: Avoiding data horror stories"

Similar presentations


Ads by Google