The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished.

Slides:



Advertisements
Similar presentations
Week 2 DUE This Week: Safety Form and Model Release DUE Next Week: Project Timelines and Website Notebooks Lab Access SharePoint Usage Subversion Software.
Advertisements

Configuration management
Configuration management
MOSS 2007 Document Management Adam McCarthy 1 st April 2009.
Test Case Management and Results Tracking System October 2008 D E L I V E R I N G Q U A L I T Y (Short Version)
The Documentum Team Lance Callaway, Brooke Durbin, Perry Koob, Lorie McMillin, Jennifer Song Missouri University of Science and Technology Rolla, Missouri.
Snejina Lazarova Senior QA Engineer, Team Lead CRMTeam Dimo Mitev Senior QA Engineer, Team Lead SystemIntegrationTeam Telerik QA Academy SOAP-based Web.
METS In order to reconstruct the archive, we will need to understand the METS files. METS is schema that provides a flexible mechanism for encoding descriptive,
Dr Gordon Russell, Napier University Unit Data Dictionary 1 Data Dictionary Unit 5.3.
Data Management Plans PAUL H. BERN, PH.D. APRIL 3, 2014.
Revision Control Systems Amin Tootoonchian Kian Mirjalali.
Data Management What? Why? How?. 2 What do we mean by … Managing your Research (aka Data) … Ensuring physical integrity of files and helping to preserve.
METS What is METS ? What is METS ? A schema that provides a flexible mechanism for encoding descriptive, administrative, and structural metadata for a.
US GPO AIP Independence Test CS 496A – Senior Design Fall 2010 Team members: Antonio Castillo, Johnny Ng, Aram Weintraub, Tin-Shuk Wong.
1 Introduction The Database Environment. 2 Web Links Google General Database Search Database News Access Forums Google Database Books O’Reilly Books Oracle.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Presented by IBM developer Works ibm.com/developerworks/ 2006 January – April © 2006 IBM Corporation. Making the most of Creating Eclipse plug-ins.
Chapter 1: The Database Environment
UIS EDEN Workflow Engine Overview of workflow engine for IU’s OneStart portal.
Talend 5.4 Architecture Adam Pemble Talend Professional Services.
Git: Part 1 Overview & Object Model These slides were largely cut-and-pasted from tutorial/, with some additions.
DMPTool Expert Resources and Support for Data Management Planning Tao Zhang Michael Witt Purdue University Libraries 1.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
Linux Operations and Administration
Statewide Digitization and the FCLA Digital Archive Priscilla Caplan, Florida Center for Library Automation Statewide Digitization Planners Meeting OCLC,
Africa RISING West Africa Mega Site M&E Activities Summary Africa RISING Project Steering Committee Meeting February 4, 2014; Bamako, Mali Beliyou Haile,
Working Out with KURL! Shayne Koestler Kinetic Data.
Open for ^ Business Research Data Services & Data Management Planning Ryan Schryver Wendt Commons is our.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
Configuration Management (managing change). Starter Questions... Which is more important?  stability  progress Why is change potentially dangerous?
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
Section 1: Introducing Group Policy What Is Group Policy? Group Policy Scenarios New Group Policy Features Introduced with Windows Server 2008 and Windows.
Elements of a Data Management Plan Bill Michener University Libraries University of New Mexico Data Management Practices for.
1 Introductory Notes on the Git Source Control Management Ric Holt, 8 Oct 2009.
K. Harrison CERN, 20th April 2004 AJDL interface and LCG submission - Overview of AJDL - Using AJDL from Python - LCG submission.
UVa Library Research Data Services
INFSOM-RI Juelich, 10 June 2008 ETICS - Maven From competition, to collaboration.
© 2007 by Prentice Hall 1 Introduction to databases.
© 2009 Pearson Education, Inc. Publishing as Prentice Hall 1 Chapter 1: The Database Environment Modern Database Management 9 th Edition Jeffrey A. Hoffer,
Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.
LCG Middleware Testing in 2005 and Future Plans E.Slabospitskaya, IHEP, Russia CERN-Russia Joint Working Group on LHC Computing March, 6, 2006.
UVa's Digital Library CSG - September 2005 Slides courtesy of: Leslie Johnston Director, Digital Access Services, UVA Library Tim Sigmon University of.
1 Software Configuration Management (SCM) and Software Reuse Presented By: Edmund Leng (HT052446J) Choo Zhi Min (HT052430X)
University Libraries/ITS Content Stewardship Program Mairéad Martin, Sr. Director, ITS Digital Library Technologies Presentation to FACAC March 1, 2011.
Changing Implementation of NSF Data Policy Dr. Jennifer M. Schopf, NSF OD/OIA/EPSCoR On behalf of the NSF Data Working Group March 17, 2011 CASC Spring.
SCORM Course Meta-data 3 major components: Content Aggregation Meta-data –context specific data describing the packaged course SCO Meta-data –context independent.
Measurement Data Workspace and Archive: Current State and Next Steps GEC15 Oct 2012 Giridhar Manepalli Corporation for National Research Initiatives
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Planned Document Management Improvements Rob McKercher, Iain Goodenow, George Angeli.
Elements of a Data Management Plan Bill Michener University of New Mexico
Dean Anderson Polk County, Oregon GIS in Action 2014 Modifying Open Source Software (A Case Study)
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Dionex Corporation Designs, manufactures and sells chemical analysis equipment Based in Sunnyvale, California Employs more than 1,200 people worldwide.
Data Management Lesley A. Brown Director of Proposal Development.
Version Control and SVN ECE 297. Why Do We Need Version Control?
Data Management Plans PAUL H. BERN, PH.D. APRIL 3, 2014.
STAR Scheduling status Gabriele Carcassi 9 September 2002.
Copyright © 2004, Keith D Swenson, All Rights Reserved. OASIS Asynchronous Service Access Protocol (ASAP) Tutorial Overview, OASIS ASAP TC May 4, 2004.
STAR Scheduler Gabriele Carcassi STAR Collaboration.
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
EBay Searcher Brian Payton, Jason Nowakoski, Justin Szeluga, Salvatore Siragusa, David Wolkiser.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
PGP Desktop (Client only) By: Courtney Wirtz & Vincent Verner.
Lecture 1 Page 1 CS 111 Summer 2013 Important OS Properties For real operating systems built and used by real people Differs depending on who you are talking.
Agenda:- DevOps Tools Chef Jenkins Puppet Apache Ant Apache Maven Logstash Docker New Relic Gradle Git.
Data Management What? Why? How?.
z/Ware 2.0 Technical Overview
Statewide Digitization and the FCLA Digital Archive
Research Data Management
Presentation transcript:

The Vault Data Manager Derek Hower 2/10/2011

Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished. Interruptions will hide that. Vault unifies: – Data storage – Data analysis – Job management Features: – Designed for flexibility & sharing – Should be sufficient to meet NSF guidelines Proposal (open to discussion): – The group should phase-in Vault

Outline Elephants Motivation/Goals Vault Overview Discussion NSF Status

An Aside on Vault is written (mostly) in Ruby – Don’t have to use it Has a command line & web interface – But… Not all operations are accessible from command line You need to write submission/analysis scripts anyway Will GEM5 stand for this “ruby” thing? – The simulator side component is in C Want it in Python? – I’m available for consultation

So you built a DBMS? (a.k.a. Dear Spyros,) Vault does have elements of a DBMS – Serialized commit, file storage, etc. But is much more – Interface, Job management, Repository, etc Why not use a DBMS under the hood? – I think they are clumsy to work with – Some operations don’t map well (job stats, permissions)

Outline Elephants Motivation/Goals Vault Overview Discussion NSF Status

Motivation There is no unified data management plan – Collaborating can be a pain – Interpreting data can be a pain – Unstructured data is error prone Custom parsers for every experiment, etc Loosely unified job management – Condor, but everyone has their own submission scripts Some people (me) need enforced organization – Vault was made for me. Maybe you’ll like it too.

Goals Repeatability – Don’t do anything until you know you can do it again Flexibility – Multiple tools – Storage – Migration & compression – Scheduling Promote Collaboration – Share data, actively work together – Protect data with permissions Data Integrity

A Note on Storage Why focus on storage reduction/management? – Aren’t stats just text files? Case Study: Rocks – Typical job: Stat file: 170K Stdout: 743 Stderr: 27K Config: 17K – Total: 215K/job – 215K * 2000 jobs = 430M of text per experiment!! Key: Most of the text is redundant

Outline Elephants Motivation/Goals Vault Overview Discussion NSF Status

What is Vault? Demo Time!

Features Search Consistency Repeatability Flexible permissions Multiple views Flexible storage options Documentation Result parsing tools Modular software architecture Annotations

Configuration Vault Object Organization Repository Experiment Job Scaffold Run Apparatus Job Scaffold Job Stat MiscOut Scheduler

Vault Repositories Three components: – One Metafile – One or more Storage Directories – One or more Sandbox Directories Access managed by filesystem To share or not to share? – + Increase collaboration – - Hard to manage storage needs – - Limited data protection – Vault’s answer: repository linking

Repository Linking Derek’s Repository ~drh5/vault.storage Perm: 744 Polina’s Repository ~pdudnik/vault.storage Perm: 744 Calvin Repository …/projects/calvin/vault.storage Perm: 774

Implementation Note Vault uses a flat storage scheme – Every object is a “blob” identified by a hash of its contents Benefits – Objects can be stored anywhere Repository Linking is easy Storage management is flexible – Identical files are stored once Hash Collision? – Chance is order 1:2^80. And it’s good enough for git. ~/vault.storage 5CA…1AB1E0…BADCAF…EBABE0…111

Experiments Complete description of an experiment – Copy of the tool (apparatus) – Copy of all inputs – Copy of commands Becomes immutable once run – Exception: annotations Key to repeatability

Apparatus Describes how to control a tool – SCM control – Building – Running Allow Vault to be used with many different tools Apparati are vault plugins – Ruby code – Saved with the experiment

Scheduler Controls where and when jobs are run Like Apparati, are Vault plugins Two existing (more possible): – SerialScheduler – MultifacetCondorScheduler Run Container for a run of an experiment Experiments may be run multiple times Contains: Scheduler, Jobs

Job Scaffold Describes how a job is configured & controlled Elements: – Configuration – Command line – Repetitions Configuration Can be: A standard vault configuration : list A non-standard text file

Stats All vault tools *must* use the vault stat infrastructure C/C++ library – Collection of macros vs_new_signed_scalar(name, desc, data_ptr) vs_new_signed_sarray(name, desc, size, array_ptr) etc. – Below tool stat managers (e.g., GEM5 stat class) – Includes stat server for real-time updates

Stat File Format Produces two files – Header XML description of stats – Data Binary data file Most jobs from same tool produce identical headers – Vault’s storage stores one copy Data files are small

Views Two (three?) views – Command line – Web server – Access through Ruby PIs: only need to know one command – vault serve Demo to follow

Configuration Vault Organization Repository Experiment Job Scaffold Job Mold Job Scaffold Run Apparatus Job Scaffold Job Stat MiscOut Scheduler

Data Analysis Unified data storage/access leads to common analysis tools/techniques Vault comes with a few neat parsing helpers – e.g., in Ruby: – Finds all jobs matching config, gets the stat “insns” from each, and returns the arithmetic mean of all of them insns = repo.find(:config => some_config).insns.arith_mean

Outline Elephants Motivation/Goals Vault Overview Discussion NSF Status

About Repeatability Vault experiments are repeatable because: – Experiments are run from versioned source code – Inputs are logged Vault experiments may not be repeatable if – The SCM repository moves/disappears – Software update But, can reconstruct the original software

Data Integrity Vault behaves like an SCM/DBMS – Nothing is written to the repository until commit Allows script development without polluting repository

Best Practices TBD – Storage structure? – Experiment naming convention? – What to do when something goes wrong? (experiment fails)

Outline Elephants Motivation/Goals Vault Overview Discussion NSF Status

NSF Data Management Plans the types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project; – Vault stat files the standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies); – Vault can conform to *any* standard (stat templates)

NSF Data Management Plans policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements; – Filesystem permissions policies and provisions for re-use, re- distribution, and the production of derivatives; and – Vault’s emphasis on repeatability

NSF Data Management Plans plans for archiving data, samples, and other research products, and for preservation of access to them. – Vault’s emphasis on repeatability – Data is backed up in AFS