Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © Software Carpentry 2011 This work is licensed under the Creative Commons Attribution License See

Similar presentations


Presentation on theme: "Copyright © Software Carpentry 2011 This work is licensed under the Creative Commons Attribution License See"— Presentation transcript:

1 Copyright © Software Carpentry 2011 This work is licensed under the Creative Commons Attribution License See http://software-carpentry.org/license.html for more information. Data Management

2 1 How do we manage multiple versions of data files?

3 Data Management 2 data samples.mat So far, so good…

4 Data Management 3 data samples.mat new version Now what?

5 Data Management 4 data samples.old.mat samples.mat I guess this is alright?

6 Data Management 5 data samples.old.mat samples.old2.mat samples.mat Okay…

7 Data Management 6 One problem: #!/usr/bin/env bash cat data/samples.mat \ | tr “^M” “\n” \ | sort -k 2,2nr \ |... > results/samples.txt generate_results.sh Which version of samples.mat was used to generate our results?

8 Data Management 7 In general, don’t move or rename files unless absolutely necessary. -makes it harder to reproduce results -makes it harder to find the data later -breaks scripts -breaks symbolic links (ln -s)

9 Data Management 8 But adding new filenames without structure isn’t necessarily any better…

10 Data Management 9 data samples.mat samplesV2.mat samplesFinal.mat samplesFinalV2.mat samplesUSE_THIS_ONE.mat Which one is the most recent?

11 Data Management 10 You could… 1)look at the data and guess 2)check the last-modified date And don’t ever move or copy the data!

12 Data Management 11 sounds like a job for…

13 Data Management 12 sounds like a job for…

14 Data Management 13 sounds like a job for… …or is it?

15 Data Management 14 Subversion is good for: -simultaneous modification -archive -merge -branch -diff Why shouldn’t we just use Subversion?

16 Data Management 15 Subversion is good for: -simultaneous modification -archive -merge -branch -diff Why shouldn’t we just use Subversion? what makes sense for data?

17 Data Management 16 Why shouldn’t we just use Subversion? Subversion is good for: -simultaneous modification -archive -merge -branch -diff what makes sense for data?

18 Data Management 17 What we need… a clear, stable, accessible way to archive data

19 Data Management 18 Performance reasons for not using version control on data -We can no longer remove or compress old and obsolete data files. -Checking out the repository means copying all your data files. -Many version control tools store text-based differences between versions of a file. For binary files, they essentially store a new copy of the file for each version.

20 Data Management 19 Common approach: data samples.mat old old2 Archive old data into subdirectories: -maintains file names -keeps all data files in same “version” together -still moves files, so scripts and history now refer to different data

21 Data Management 20 Another common approach: data samples.mat final final2 samples.mat really-final

22 Data Management 21 Two big problems still: 1)What order do they go in? The naming can easily become ambiguous. 2)Hard to distinguish “top-level” data directories from archived or revised data directories. Is data/saved/ an old version of all the data or a current version of “saved” data? Both are easily solved with a little thinking ahead

23 Data Management 22 From the start: data samples.mat 2010-09-28 2011-03-31 “yyyy-mm-dd” is a nice date format since: -is easy to read (easier than yyyymmdd) -the alphabetic ordering is also chronological

24 Data Management 23 We’re still missing something big -We give a collaborator access to the data folder -A new student joins the project -It’s three years later and we’ve forgotten the details of the project

25 Data Management 24 We’re still missing something big -We give a collaborator access to the data folder -A new student joins the project -It’s three years later and we’ve forgotten the details of the project We need context. We need metadata.

26 Data Management 25 Metadata Header inside the file? (see Provenance essay) -Doesn’t work well with binary files Separate metadata file for each data file? -Can quickly get out of sync or out of hand -who is the data from? -when was it generated? -what were the experiment conditions?

27 Data Management 26 How I view metadata… A list of attributes that I might want to search by later

28 Data Management 27 data samples.mat 2010-09-28 2011-03-31 README 2010-09-28 - batch1 - 25*C - 5 hrs Initial QC: samples 1328, 1422 poor quality 2011-03-31 - batch1 - 28*C - 10 hrs Initial QC: okay, but 1 min power failure @ t=2.5 README

29 Data Management 28 data README 2010-09-28 2011-03-31 README Additional READMEs can be necessary when you have many data files per directory Many data files

30 Data Management 29 data README 2010-09-28 2011-03-31 README Under version control Many data files * * * *

31 Data Management 30 data README 2010-09-28 2011-03-31 README Under version control Many data files * * * * delete for space

32 Data Management 31 data README 2010-09-28 2011-03-31 README Under version control Many data files * * * * README should describe how to exactly reacquire data, if possible.

33 Data Management 32 data README * yyyy-mm- dd Many data files * Under version control * The big picture project

34 Data Management 33 data README * yyyy-mm- dd Many data files * Under version control * The big picture src * scripts, source code, etc * project

35 Data Management 34 data README * yyyy-mm- dd Many data files * Under version control * src * scripts, source code, etc * experiments results from running scripts on data yyyy-mm- dd project The big picture

36 Data Management 35 Why all the fuss? Sensible archiving: -make it clear and obvious for someone else -“someone else” will likely be you, several months or years from now

37 Data Management 36 Why all the fuss? targets filenameExactlyAsDownloaded.tkq raw targets.gff Example: How exactly did I create targets.gff? This would be very difficult without additional information.

38 Data Management 37 Why all the fuss? The directory structure: targets filenameExactlyAsDownloaded.tkq otherNecessaryFile.dat raw README process_targets.sh targets.gff README - when and where from files were downloaded process_targets.sh - exact commands used to generate targets.gff from downloaded files

39 Data Management 38 Why all the fuss? The directory structure: targets filenameExactlyAsDownloaded.tkq otherNecessaryFile.dat raw README process_targets.sh targets.gff This made it both 1) possible and 2) easy to know exactly where the data came from and how it was processed.

40 Data Management 39 The right layout for the right project -No hierarchy is perfect for all projects. -Thinking hard at the beginning can save pain and suffering later on -For example…

41 Data Management 40 The right layout for the right project What we’ve talked about so far is great for projects with a strong separation between data and analysis: experiment0001 experiment0002 experiment0003 someOtherData analysis0001 analysis0002 analysis0003 analysis0004 analysis0005

42 Data Management 41 The right layout for the right project But it doesn’t work as well for pipelines: parse align filter cluster

43 Data Management 42 The right layout for the right project One representation I have seen is: data align cluster 2010-09-28 filter parse align cluster 2011-03-31 filter parse

44 Data Management 43 The right layout for the right project But: 1)I doesn’t make the pipeline ordering clear 2)It doesn’t scale well with alternatives: parse align01 filter01 cluster filter02 align02 filter01 cluster

45 Data Management 44 The right layout for the right project A better approach captures the dependency structure of the pipeline: parse2010-09-28 samples.dat align01 samples.dat align02 samples.dat filter01 samples.dat

46 Data Management 45 -Think hard at the beginning -Version control metadata, not data files -An intelligent structure not only makes your data easier to archive, track, and manage, but also reduces the chance that the paths in the pipeline get crossed and the data out the end isn't what you think it is. Summary

47 Copyright © Software Carpentry 2011 This work is licensed under the Creative Commons Attribution License See http://software-carpentry.org/license.html for more information. Orion Buske May 2011 created by


Download ppt "Copyright © Software Carpentry 2011 This work is licensed under the Creative Commons Attribution License See"

Similar presentations


Ads by Google