Download presentation
Presentation is loading. Please wait.
Published byJaden Zachery Modified over 9 years ago
1
The Vault Data Manager Derek Hower 2/10/2011
2
Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished. Interruptions will hide that. Vault unifies: – Data storage – Data analysis – Job management Features: – Designed for flexibility & sharing – Should be sufficient to meet NSF guidelines Proposal (open to discussion): – The group should phase-in Vault
3
Outline Elephants Motivation/Goals Vault Overview Discussion NSF Status
4
An Aside on Vault is written (mostly) in Ruby – Don’t have to use it Has a command line & web interface – But… Not all operations are accessible from command line You need to write submission/analysis scripts anyway Will GEM5 stand for this “ruby” thing? – The simulator side component is in C Want it in Python? – I’m available for consultation
5
So you built a DBMS? (a.k.a. Dear Spyros,) Vault does have elements of a DBMS – Serialized commit, file storage, etc. But is much more – Interface, Job management, Repository, etc Why not use a DBMS under the hood? – I think they are clumsy to work with – Some operations don’t map well (job stats, permissions)
6
Outline Elephants Motivation/Goals Vault Overview Discussion NSF Status
7
Motivation There is no unified data management plan – Collaborating can be a pain – Interpreting data can be a pain – Unstructured data is error prone Custom parsers for every experiment, etc Loosely unified job management – Condor, but everyone has their own submission scripts Some people (me) need enforced organization – Vault was made for me. Maybe you’ll like it too.
8
Goals Repeatability – Don’t do anything until you know you can do it again Flexibility – Multiple tools – Storage – Migration & compression – Scheduling Promote Collaboration – Share data, actively work together – Protect data with permissions Data Integrity
9
A Note on Storage Why focus on storage reduction/management? – Aren’t stats just text files? Case Study: Rocks – Typical job: Stat file: 170K Stdout: 743 Stderr: 27K Config: 17K – Total: 215K/job – 215K * 2000 jobs = 430M of text per experiment!! Key: Most of the text is redundant
10
Outline Elephants Motivation/Goals Vault Overview Discussion NSF Status
11
What is Vault? Demo Time!
12
Features Search Consistency Repeatability Flexible permissions Multiple views Flexible storage options Documentation Result parsing tools Modular software architecture Annotations
13
Configuration Vault Object Organization Repository Experiment Job Scaffold Run Apparatus Job Scaffold Job Stat MiscOut Scheduler
14
Vault Repositories Three components: – One Metafile – One or more Storage Directories – One or more Sandbox Directories Access managed by filesystem To share or not to share? – + Increase collaboration – - Hard to manage storage needs – - Limited data protection – Vault’s answer: repository linking
15
Repository Linking Derek’s Repository ~drh5/vault.storage Perm: 744 Polina’s Repository ~pdudnik/vault.storage Perm: 744 Calvin Repository …/projects/calvin/vault.storage Perm: 774
16
Implementation Note Vault uses a flat storage scheme – Every object is a “blob” identified by a hash of its contents Benefits – Objects can be stored anywhere Repository Linking is easy Storage management is flexible – Identical files are stored once Hash Collision? – Chance is order 1:2^80. And it’s good enough for git. ~/vault.storage 5CA…1AB1E0…BADCAF…EBABE0…111
17
Experiments Complete description of an experiment – Copy of the tool (apparatus) – Copy of all inputs – Copy of commands Becomes immutable once run – Exception: annotations Key to repeatability
18
Apparatus Describes how to control a tool – SCM control – Building – Running Allow Vault to be used with many different tools Apparati are vault plugins – Ruby code – Saved with the experiment
19
Scheduler Controls where and when jobs are run Like Apparati, are Vault plugins Two existing (more possible): – SerialScheduler – MultifacetCondorScheduler Run Container for a run of an experiment Experiments may be run multiple times Contains: Scheduler, Jobs
20
Job Scaffold Describes how a job is configured & controlled Elements: – Configuration – Command line – Repetitions Configuration Can be: A standard vault configuration : list A non-standard text file
21
Stats All vault tools *must* use the vault stat infrastructure C/C++ library – Collection of macros vs_new_signed_scalar(name, desc, data_ptr) vs_new_signed_sarray(name, desc, size, array_ptr) etc. – Below tool stat managers (e.g., GEM5 stat class) – Includes stat server for real-time updates
22
Stat File Format Produces two files – Header XML description of stats – Data Binary data file Most jobs from same tool produce identical headers – Vault’s storage stores one copy Data files are small
23
Views Two (three?) views – Command line – Web server – Access through Ruby PIs: only need to know one command – vault serve Demo to follow
24
Configuration Vault Organization Repository Experiment Job Scaffold Job Mold Job Scaffold Run Apparatus Job Scaffold Job Stat MiscOut Scheduler
25
Data Analysis Unified data storage/access leads to common analysis tools/techniques Vault comes with a few neat parsing helpers – e.g., in Ruby: – Finds all jobs matching config, gets the stat “insns” from each, and returns the arithmetic mean of all of them insns = repo.find(:config => some_config).insns.arith_mean
26
Outline Elephants Motivation/Goals Vault Overview Discussion NSF Status
27
About Repeatability Vault experiments are repeatable because: – Experiments are run from versioned source code – Inputs are logged Vault experiments may not be repeatable if – The SCM repository moves/disappears – Software update But, can reconstruct the original software
28
Data Integrity Vault behaves like an SCM/DBMS – Nothing is written to the repository until commit Allows script development without polluting repository
29
Best Practices TBD – Storage structure? – Experiment naming convention? – What to do when something goes wrong? (experiment fails)
30
Outline Elephants Motivation/Goals Vault Overview Discussion NSF Status
31
NSF Data Management Plans the types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project; – Vault stat files the standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies); – Vault can conform to *any* standard (stat templates)
32
NSF Data Management Plans policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements; – Filesystem permissions policies and provisions for re-use, re- distribution, and the production of derivatives; and – Vault’s emphasis on repeatability
33
NSF Data Management Plans plans for archiving data, samples, and other research products, and for preservation of access to them. – Vault’s emphasis on repeatability – Data is backed up in AFS
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.