Copyright © Software Carpentry 2011 This work is licensed under the Creative Commons Attribution License See

Slides:



Advertisements
Similar presentations
Week 2 DUE This Week: Safety Form and Model Release DUE Next Week: Project Timelines and Website Notebooks Lab Access SharePoint Usage Subversion Software.
Advertisements

Organising and Documenting Data Stuart Macdonald EDINA & Data Library DIY Research Data Management Training Kit for Librarians.
Drawing & Document Management System or DMS
Microsoft ® Office Outlook ® 2007 Training Retrieve, back up, or share messages Sweetwater ISD presents:
Version Control System (Sub)Version Control (SVN).
Backing Up a Hard Disk CGS2564. Why Backup Programs? Faster Optimized to copy files Can specify only files that have changed Safer Can verify backed up.
John Porter Why this presentation? The forms data take for analysis are often different than the forms data take for archival storage Spreadsheets are.
Concepts of Version Control A Technology-Independent View.
2/6/2008Prof. Hilfinger CS164 Lecture 71 Version Control Lecture 7.
Using subversion COMP 2400 Prof. Chris GauthierDickey.
University of Michigan Administrative Information Services BusinessObjects Report Conversion Jeanne Mackey MAIS Data Delivery June, 2006.
Version Control Systems Phil Pratt-Szeliga Fall 2010.
Source Code Management Or Configuration Management: How I learned to Stop Worrying and Hate My Co-workers Less.
UNIX By Darcy Tatlock. 1. Successful Log Into Unix To actively manipulate your website you need to be logged in. Without being logged in you cannot enter.
SubVersioN – the new Central Service at DESY by Marian Gawron.
GIT is for people who like SVN September 18 th 2012 Vladimir Kerkez and BK.
European Organization for Nuclear Research Source Control Management Service (Subversion) Brice Copy, Michel Bornand EN-ICE 13 May 2009.
Microsoft Office Illustrated Fundamentals Unit B: Understanding File Management.
Source Code Revision Control Software CVS and Subversion (svn)
11 Games and Content Session 4.1. Session Overview  Show how games are made up of program code and content  Find out about the content management system.
Computer Literacy BASICS: A Comprehensive Guide to IC 3, 5 th Edition Lesson 3 Windows File Management 1 Morrison / Wells / Ruffolo.
Unix Command Project Justin Rogers for LS 560 Spring 2015.
Subversion. What is Subversion? A Version Control System A successor to CVS and SourceSafe Essentially gives you a tracked, shared file system.
4 1 Operating System Activities  An operating system is a type of system software that acts as the master controller for all activities that take place.
علیرضا فراهانی استاد درس: جعفری نژاد مهر Version Control ▪Version control is a system that records changes to a file or set of files over time so.
Version Control with Subversion Quick Reference of Subversion.
1 Lecture 19 Configuration Management Software Engineering.
CVS 簡介 數位芝麻網路公司蔡志展 2001/8/18 大綱 • CVS 簡介 • CVS 安裝 • CVS 設定 (Linux/Windows) • CVS 指令簡介 • CVS 多人環境的應用.
Source Control How to be the 1337 in project management source control.
SharePoint document libraries I: Introduction to sharing files Sharjah Higher Colleges of Technology presents:
SWEN 302: AGILE METHODS Roma Klapaukh & Alex Potanin.
Information Systems and Network Engineering Laboratory II DR. KEN COSH WEEK 1.
Version Control.
Copyright © 2015 – Curt Hill Version Control Systems Why use? What systems? What functions?
Team 708 – Hardwired Fusion Created by Nam Tran 2014.
CSE 219 Computer Science III CVS
Introduction to Unix – CS 21 Lecture 4. Lecture Overview * cp, mv, and rm Looking into files The file command head and tail cat and more What we’ve seen.
Refactoring and Synchronization with the StarTeam Plug-in for Eclipse  Jim Wogulis  Principal Architect, Borland Software Corporation.
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
Electronic labnotes Mari Wigham COMMIT/. Information WUR  Organising, sharing, finding and reusing data  Expertise in: ● Modelling data.
IT-IDT-5 Understand, communicate, and adapt to a digital world. File Management.
Presentation OLOMOLA,Afolabi( ). Update Changes in CSV/SVN.
Computer Literacy BASICS: A Comprehensive Guide to IC 3, 5 th Edition Lesson 3 Windows File Management 1 Morrison / Wells / Ruffolo.
Version Control and SVN ECE 297. Why Do We Need Version Control?
(1) Introduction to Subversion (SVN) and Google Project Hosting Philip Johnson Collaborative Software Development Laboratory Information and Computer Sciences.
Short-term storage and data documentation Mari Wigham COMMIT/
Unix tools Regular expressions grep sed AWK. Regular expressions Sequence of characters that define a search pattern banana matches the text banana
DIGITAL REPOSITORIES CGDD Job Description… Senior Tools Programmer – pulled August 4 th, 2011 from Gamasutra.
1 Subversion Kate Hedstrom April Version Control Software System for managing source files –For groups of people working on the same code –When.
Git A distributed version control system Powerpoint credited to University of PA And modified by Pepper 28-Jun-16.
Computers Are Your Future Tenth Edition Spotlight 4: File Management Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall1.
Problem Solving With C++ SVN ( Version Control ) April 2016.
Documenting and organising your data For an easier life lib.uts.edu.au utslibrary.
Revision Control for Sysadmins
Subversion Subversion is a brand of version control software that is frequently used to store the code and documentation of a project so as to permit.
Information Systems and Network Engineering Laboratory II
Discussion #11 11/21/16.
Version Control with Subversion
SVN intro (review).
User Guide PrimePortal – File Archive
CSE 303 Concepts and Tools for Software Development
Best practices in R scripting
Introduction to Configuration Management
Lesson 9 Windows Management
User Guide PrimePortal – File Archive
Git CS Fall 2018.
Introduction to Version Control with Git
Systems Analysis and Design I
Microsoft Office Illustrated Fundamentals
1. GitHub.
Presentation transcript:

Copyright © Software Carpentry 2011 This work is licensed under the Creative Commons Attribution License See for more information. Data Management

1 How do we manage multiple versions of data files?

Data Management 2 data samples.mat So far, so good…

Data Management 3 data samples.mat new version Now what?

Data Management 4 data samples.old.mat samples.mat I guess this is alright?

Data Management 5 data samples.old.mat samples.old2.mat samples.mat Okay…

Data Management 6 One problem: #!/usr/bin/env bash cat data/samples.mat \ | tr “^M” “\n” \ | sort -k 2,2nr \ |... > results/samples.txt generate_results.sh Which version of samples.mat was used to generate our results?

Data Management 7 In general, don’t move or rename files unless absolutely necessary. -makes it harder to reproduce results -makes it harder to find the data later -breaks scripts -breaks symbolic links (ln -s)

Data Management 8 But adding new filenames without structure isn’t necessarily any better…

Data Management 9 data samples.mat samplesV2.mat samplesFinal.mat samplesFinalV2.mat samplesUSE_THIS_ONE.mat Which one is the most recent?

Data Management 10 You could… 1)look at the data and guess 2)check the last-modified date And don’t ever move or copy the data!

Data Management 11 sounds like a job for…

Data Management 12 sounds like a job for…

Data Management 13 sounds like a job for… …or is it?

Data Management 14 Subversion is good for: -simultaneous modification -archive -merge -branch -diff Why shouldn’t we just use Subversion?

Data Management 15 Subversion is good for: -simultaneous modification -archive -merge -branch -diff Why shouldn’t we just use Subversion? what makes sense for data?

Data Management 16 Why shouldn’t we just use Subversion? Subversion is good for: -simultaneous modification -archive -merge -branch -diff what makes sense for data?

Data Management 17 What we need… a clear, stable, accessible way to archive data

Data Management 18 Performance reasons for not using version control on data -We can no longer remove or compress old and obsolete data files. -Checking out the repository means copying all your data files. -Many version control tools store text-based differences between versions of a file. For binary files, they essentially store a new copy of the file for each version.

Data Management 19 Common approach: data samples.mat old old2 Archive old data into subdirectories: -maintains file names -keeps all data files in same “version” together -still moves files, so scripts and history now refer to different data

Data Management 20 Another common approach: data samples.mat final final2 samples.mat really-final

Data Management 21 Two big problems still: 1)What order do they go in? The naming can easily become ambiguous. 2)Hard to distinguish “top-level” data directories from archived or revised data directories. Is data/saved/ an old version of all the data or a current version of “saved” data? Both are easily solved with a little thinking ahead

Data Management 22 From the start: data samples.mat “yyyy-mm-dd” is a nice date format since: -is easy to read (easier than yyyymmdd) -the alphabetic ordering is also chronological

Data Management 23 We’re still missing something big -We give a collaborator access to the data folder -A new student joins the project -It’s three years later and we’ve forgotten the details of the project

Data Management 24 We’re still missing something big -We give a collaborator access to the data folder -A new student joins the project -It’s three years later and we’ve forgotten the details of the project We need context. We need metadata.

Data Management 25 Metadata Header inside the file? (see Provenance essay) -Doesn’t work well with binary files Separate metadata file for each data file? -Can quickly get out of sync or out of hand -who is the data from? -when was it generated? -what were the experiment conditions?

Data Management 26 How I view metadata… A list of attributes that I might want to search by later

Data Management 27 data samples.mat README batch1 - 25*C - 5 hrs Initial QC: samples 1328, 1422 poor quality batch1 - 28*C - 10 hrs Initial QC: okay, but 1 min power t=2.5 README

Data Management 28 data README README Additional READMEs can be necessary when you have many data files per directory Many data files

Data Management 29 data README README Under version control Many data files * * * *

Data Management 30 data README README Under version control Many data files * * * * delete for space

Data Management 31 data README README Under version control Many data files * * * * README should describe how to exactly reacquire data, if possible.

Data Management 32 data README * yyyy-mm- dd Many data files * Under version control * The big picture project

Data Management 33 data README * yyyy-mm- dd Many data files * Under version control * The big picture src * scripts, source code, etc * project

Data Management 34 data README * yyyy-mm- dd Many data files * Under version control * src * scripts, source code, etc * experiments results from running scripts on data yyyy-mm- dd project The big picture

Data Management 35 Why all the fuss? Sensible archiving: -make it clear and obvious for someone else -“someone else” will likely be you, several months or years from now

Data Management 36 Why all the fuss? targets filenameExactlyAsDownloaded.tkq raw targets.gff Example: How exactly did I create targets.gff? This would be very difficult without additional information.

Data Management 37 Why all the fuss? The directory structure: targets filenameExactlyAsDownloaded.tkq otherNecessaryFile.dat raw README process_targets.sh targets.gff README - when and where from files were downloaded process_targets.sh - exact commands used to generate targets.gff from downloaded files

Data Management 38 Why all the fuss? The directory structure: targets filenameExactlyAsDownloaded.tkq otherNecessaryFile.dat raw README process_targets.sh targets.gff This made it both 1) possible and 2) easy to know exactly where the data came from and how it was processed.

Data Management 39 The right layout for the right project -No hierarchy is perfect for all projects. -Thinking hard at the beginning can save pain and suffering later on -For example…

Data Management 40 The right layout for the right project What we’ve talked about so far is great for projects with a strong separation between data and analysis: experiment0001 experiment0002 experiment0003 someOtherData analysis0001 analysis0002 analysis0003 analysis0004 analysis0005

Data Management 41 The right layout for the right project But it doesn’t work as well for pipelines: parse align filter cluster

Data Management 42 The right layout for the right project One representation I have seen is: data align cluster filter parse align cluster filter parse

Data Management 43 The right layout for the right project But: 1)I doesn’t make the pipeline ordering clear 2)It doesn’t scale well with alternatives: parse align01 filter01 cluster filter02 align02 filter01 cluster

Data Management 44 The right layout for the right project A better approach captures the dependency structure of the pipeline: parse samples.dat align01 samples.dat align02 samples.dat filter01 samples.dat

Data Management 45 -Think hard at the beginning -Version control metadata, not data files -An intelligent structure not only makes your data easier to archive, track, and manage, but also reduces the chance that the paths in the pipeline get crossed and the data out the end isn't what you think it is. Summary

Copyright © Software Carpentry 2011 This work is licensed under the Creative Commons Attribution License See for more information. Orion Buske May 2011 created by