Integrated Rule Oriented Data System (iRODS) Reagan W. Moore Arcot Rajasekar Mike Wan

Slides:



Advertisements
Similar presentations
Panel 2 – Promoting Re-Use of Scientific Collections John Harrison SHAMAN Project University of Liverpool
Advertisements

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
GFS OGF-22 Global Resource Naming Developers: Reagan Moore Arcot Mike.
OGF-23 iRODS Metadata Grid File System Reagan Moore San Diego Supercomputer Center.
Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center
Digital Preservation Lifecycle Management Building a demonstration prototype for the preservation of large-scale multi-media collections Arcot Rajasekar.
Data Grid: Storage Resource Broker Mike Smorul. SRB Overview Developed at San Diego Supercomputing Center. Provides the abstraction mechanisms needed.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Particle Physics Data Grid PPDG Data Handling System Reagan.
San Diego Supercomputer Center NARA Research Prototype Persistent Archive Building Preservation Environments with Data Grid Technology (NARA Research Prototype.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Grid Based Solutions for Distributed Data Management Reagan.
Wayne Schroeder, Paul Tooby Data Intensive Cyber Environments Team (DICE) DICE Center, University of North Carolina at Chapel Hill; Institute for Neural.
A Very Brief Introduction to iRODS
Sustainable Preservation Services for Archivists through Distributed Custody Caryn Wojcik State of Michigan Records Management Services.
Towards a Federated Infrastructure for the Preservation and Analysis Archival Data Chien-Yi HOU Richard MARCIANO {chienyi, School.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Distributed components
Richard MARCIANO Chien-Yi HOU School of Information and Library Science (SILS) Sustainable Archives & Leveraging Technologies Group (SALT) University of.
SOAPI: a flexible toolkit for implementing ingest and preservation workflows Mark Hedges Centre for e-Research, King’s College London Arts and Humanities.
Applying Data Grids to Support Distributed Data Management Storage Resource Broker Reagan W. Moore Ian Fisk Bing Zhu University of California, San Diego.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.
Richard MARCIANO Chien-Yi HOU School of Information and Library Science (SILS) Sustainable Archives & Leveraging Technologies Group (SALT) University of.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
National Science Foundation Cooperative Agreement: OCI
DCC Conference, Glasgow November, Digital Archive Policies and Trusted Digital Repositories MacKenzie Smith, MIT Libraries Reagan Moore, San Diego.
National Data Infrastructure Projects EarthCube Layered Architecture (GEO) DataNet Federation Consortium (OCI) integrated Rule Oriented Data System (SDCI)
National Partnership for Advanced Computational Infrastructure Digital Library Architecture Reagan Moore Chaitan Baru Amarnath Gupta George Kremenek Bertram.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Information Management and Distributed Data Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
Working Group: Practical Policy Rainer Stotzka, Reagan Moore.
OSG Public Storage and iRODS
DISTRIBUTED COMPUTING
PERG OGF-22 Preservation Environments Research Group Organizers: Reagan Moore Richard Marciano
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
Rule-Based Distributed Data Management iRODS Jan 23, Reagan W. Moore Mike Wan Arcot Rajasekar Wayne Schroeder San Diego.
1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.
Richard MarcianoChien-Yi Hou Caryn Wojcik University of University of State of Michigan North Carolina North Carolina Records Management ServicesSALT DCAPE.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative OOI Cyberinfrastructure Architecture Overview Michael Meisinger Life Cycle Architecture Review.
Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.
Working Group Practical Policy based on slides and latest documents from the PP WG chaired by Reagan Moore, Rainer Stotzka presented by Johannes Reetz.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart.
Archival Workshop on Ingest, Identification, and Certification Standards Certification (Best Practices) Checklist Does the archive have a written plan.
Policy Based Data Management Data-Intensive Computing Distributed Collections Grid-Enabled Storage iRODS Reagan W. Moore 1.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
From SRB to IRODS: Policy Virtualization using Rule-Based Data Grids Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center.
GGF-17 Preservation Environments Research Group Preservation Environment Working Group Officers: Bruce Barkstrom (NASA Langley) Reagan.
Introduction to The Storage Resource.
National Science Foundation Cooperative Agreement: OCI Reagan Moore, PI Mary Whitton, Project Manager.
©MIT LKTR Workshop, Digital Archive Policies and Trusted Digital Repositories MacKenzie Smith, MIT Libraries Reagan Moore, San Diego Supercomputer.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
National Archives and Records Administration1 Integrated Rules Ordered Data System (“IRODS”) Technology Research: Digital Preservation Technology in a.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
Physical Oceanography Distributed Active Archive Center THUANG June 9-13, 20089th GHRSST-PP Science Team Meeting GHRSST GDAC and EOSDIS PO.DAAC.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Use of Policies to Enforce Collection Properties Richard Marciano Reagan Moore University of North Chapel Hill Data Intensive Cyber Environments.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
Working Group: Data Foundations and Terminology (Practical Policy Considerations) Reagan Moore.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Introduction to iRODS Jean-Yves Nief.
Policy-Based Data Management integrated Rule Oriented Data System
Joseph JaJa, Mike Smorul, and Sangchul Song
Presentation transcript:

Integrated Rule Oriented Data System (iRODS) Reagan W. Moore Arcot Rajasekar Mike Wan

Data Management Infrastructure Assemble distributed data into a shared collection – Manage properties of the collection – Enforce management policies – Validate assessment criteria – Automate administrative tasks Support wide range of management applications – Data sharing, publication, preservation, analysis – Works at scale (petabytes, hundreds of millions of files)

Data Management Challenges Data driven research generates massive data collections – Data sources are remote and distributed – Collaborators are remote – Wide variety of data types: observational data, experimental data, simulation data, real-time data, office products, web pages, multi-media Collections contain millions of files – Logical arrangement is needed for distributed data – Discovery requires the addition of descriptive metadata Long-term retention requires migration of output into a reference collection – Automation of administrative functions is essential to minimize long- term labor support costs – Creation of representation information for describing file context – Validation of assessment criteria (authenticity, integrity)

Preservation Context Preservation metadata – Authenticity (provenance) information – Representation information (structure, semantics) – Administrative information (replication, checksums, access controls, retention, disposition) Preservation procedures – Administration procedures – ISO MOIMS-rac assessment procedures – Preservation procedures generate preservation metadata

Overview of iRODS Architecture User Can Search, Access, Add and Manage Data & Metadata *Access data with Web-based Browser or iRODS GUI or Command Line clients. Overview of iRODS Data System iRODS Data Server Disk, Tape, etc. iRODS Metadata Catalog Track data iRODS Data System iRODS Rule Engine Track policies

iRODS Distributed Data Management

iRODS Resource Server

Types of File Manipulation Replication Load leveling across storage systems Registration Synchronization Checksums Aggregation Metadata Access controls (time dependent)

iRODS Micro-Services Function snippets that wrap a well-defined process – Compute checksum – Replicate file – Integrity check – Zoom image – Get SDSS image cutout – Search PubMed Written in C or Python (PHP, Java soon) – Recovery micro-services to handle failure – Web services can be wrapped as micro-services Can be chained to perform complex tasks – Micro-services invoked by rule engine

iRODS Rules Server-side workflows Action | condition | workflow chain | recovery chain Condition - test on any attribute: – Collection, file name, storage system, file type, user group, elapsed time, IRB approval flag, descriptive metadata Workflow chain: – Micro-services / rules that are executed at the storage system Recovery chain: – Micro-services / rules that are used to recover from errors

iput With Replication Data iput Client Resource 1 icat data metadata / Metadata Data Resource 2 metadata Rule added to rule database data Rule Base Rule Base

Policy-Virtualization: Automate Operations System-centric Policies & Obligations: – Manage retention, disposition, distribution, replication, integrity, authenticity, chain of custody, access controls, representation information, descriptive information requirement, logical arrangement, audit trails, authorization, authentication Domain-specific Policies: – Identification & Extraction of Metadata – Ingestion Control for Provenance Attribution – Processing of Data on Ingestion Creation of multi-resolution images, type-identification, anonymization,… – Processing of Data on Access IRB Approval for data access, Data sub-setting, Merging of multiple images, conversion, redaction, …

Policy/rule execution Immediate - enforced at time of action invocation Deferred - applied at a future time Periodic - applied at defined interval Interactive - applied on demand iSEC scheduler / batch system supports – Local workflows – Distributed workflows – Deferred and periodic workflows – (Launch micro-services on clusters, clouds, supercomputers)

Checksum Validation Rule myChecksumRule{ msiMakeQuery("DATA_NAME, COLL_NAME, DATA_CHECKSUM",*Condition,*Query); msiExecStrCondQuery(*Query,*B); assign(*A,0); forEachExec (*B) { msiGetValByKey(*B,COLL_NAME,*C); msiGetValByKey(*B,DATA_NAME,*D); msiGetValByKey(*B,DATA_CHECKSUM,*E); msiDataObjChksum(*B,*Operation,*F); ifExec (*E != *F) { writeLine(stdout,file *C/*D has registered checksum *E and computed checksum *F); } else { assign(*A,*A + 1); } ifExec(*A > 0) { writeLine(stdout, have *A good files); } *Condition can be COLL_NAME like ‘/ils161/home/moore/genealogy/%’

Quota Checking Rule mytestRule|| assign(*A,0)## assign(*ContInx,1)## assign(*G,0)## msiMakeGenQuery("DATA_SIZE",*Condition,*Query)## msiExecGenQuery(*Query,*B)## forEachExec(*B,msiGetValByKey(*B,DATA_SIZE,*C)## assign(*A,*A + *C)## assign(*G, *G + 1),nop)## `whileExec(*ContInx > 0, msiGetMoreRows(*Query, *B, *ContInx)## forEachExec(*B,msiGetValByKey(*B,DATA_SIZE,*C)## assign(*A,*A + *C)## assign(*G, *G + 1),nop),nop)## writeLine(stdout,Total size of data owned by *D on resource *E is *A)## writeLine(stdout,Number of files is *G)|nop *D= rods%*E= renci-vault1% *Condition= DATA_OWNER_NAME = 'rods' AND RESC_NAME = 'renci-vault1' ruleExecOut

Managing Structured Information Information exchange between micro-services – Parameter passing – White board memory structures – High performance message passing (iXMS) – Persistent metadata catalog (iCAT) Structured Information Resource Drivers – Interact with remote structured information resource (HDF5, netCDF, tar file)

Structured Data Aggregate data into a tar file – Mount a tar file to enable manipulation of files within the tar file Use HDF5 to manage aggregations of files – Micro-services that apply HDF5 library calls at the remote storage location Mount a remote directory – Synchronize files in directory with files in iRODS collection

Micro-services vs Web Services Micro-services – Manage exchange of structured information between micro-services through memory – Serialize information for transmission over a network – Optimized protocol for data transmission Single message for small files (<32 MBs) Parallel I/O for large files Web Services – SOAP /HTTP data transmission between services

Research Collaborations NSF NARA - supports application of data grids to preservation environments NSF SDCI - supports development of core iRODS data grid infrastructure NSF OOI - future integration of data grids with real- time sensor data streams and grid computing NSF TDLC - production TDLC data grid and extension to remaining 5 Science of Learning Centers (0.3 FTE) NSF SCEC - current production environment (0.1 FTE) NSF Teragrid - production environment (0.1 FTE)

NSF Software Development for Cyberinfrastructure Conduct research on policy management in distributed data systems – Collection oriented data management – Adaptive middleware architecture – Distributed rule engine – Server-side (remote) workflow execution – Transactional recovery semantics – Automated validation – Automation of large-scale data administrative functions – Enforcement of management policies

User With Client, Views & Manages Data Overview of iRODS Architecture Processing Cache Disk, Tape, Database, File system, etc. The iRODS Data Grid installs in a “layer” over existing or new data, letting you view, manage, and share part or all of diverse data in a unified Collection. iRODS Shows Unified “Virtual Collection” Archive Disk, Tape, Database, File system, etc. Archivist Sees Single “Virtual Collection” Access Cache Disk, Tape, Database, File system, etc.

Generic Data Management Systems iRODS - integrated Rule-Oriented Data System

NARA Preservation Application Transcontinental Persistent Archive Prototype – Use data grid technology to build a preservation environment – Conduct research on preservation concepts Infrastructure independence Enforcement of preservation properties Automation of administrative preservation processes Validation of preservation assessment criteria – Demonstrate preservation on selected NARA digital holdings Integration of generic infrastructure with preservation technologies (Cheshire, MVD, JHOVE, Pronom, Fedora, Dspace)

Preservation is an Integral Part of the Data Life Cycle Organize project data into a shared collection Publish data in a digital library for use by other researchers Enable data-discovery & data-driven analyses Preserve reference collections for use by future research initiatives Analyze new collection against prior state-of-the-art data Define & Enforce Policies for long-term management and curation

National Archives and Records Administration Transcontinental Persistent Archive Prototype U Md UCSD MCAT Georgia Tech MCAT Federation of Seven Independent Data Grids NARA II MCAT NARA I MCAT Extensible Environment, can federate with additional research and education sites. Each data grid uses different vendor products. Rocket Center MCAT U NC MCAT

To Manage Long-term Preservation Define desired preservation properties – Authenticity / Integrity / Chain of Custody / Original arrangement – Life Cycle Data Requirements Guide Implement preservation processes – Appraisal / accession / arrangement / description / preservation / access Manage preservation environment – Minimize costs – Validate assessment criteria to verify preservation properties

ISO MOIMS repository assessment criteria Are developing 150 rules that implement the ISO assessment criteria 90Verify descriptive metadata and source against SIP template and set SIP compliance flag 91Verify descriptive metadata against semantic term list 92Verify status of metadata catalog backup (create a snapshot of metadata catalog) 93Verify consistency of preservation metadata after hardware change or error

Sustainability Economic sustainability – Reference collections – Repurpose reference collections to support use by multiple communities – Federate resources across multiple communities Technological sustainability – Open source software – Support continued porting through international collaborations Policy sustainability – Evolve management policies to support new user communities Access sustainability – Support data manipulation and display by new communities

Data Virtualization Storage System Storage Protocol Access Interface Standard Micro-services Data Grid Map from the actions requested by the access method to a standard set of micro- services. The standard micro- services are mapped to the operations supported by the storage system Standard Operations

Migration of Parsing Routines Data Grids minimize the effort needed to sustain parsing routines – Parsing routine is encapsulated as a micro-service – New clients can then be ported on top of the data grid without changing the parsing routine Map from actions to standard actions – New storage systems can be added to the data grid without changing the parsing routine Map from standard operations to storage protocol

Clients Unix shell commands Java I/O library C I/O redirection library Windows browser Web-DAV Kepler workflow HDF5 client DSpace Fedora Python library

Scale of iRODS Data Grid Number of files – Tens of millions to hundreds of millions of files Size of data – Hundreds of terabytes to petabytes of data Number of policy enforcement points – 20 actions define when policy is checked Amount of metadata – 112 metadata attributes for system information per file Number of policies – 150 policies Number of data grids – Federation of tens of data grids

Federation Across Spatial Scales International collaborations – Australian Research Collaboration Service (ARCS) – Sustaining Heritage Access through Multivalent ArchiviNg (SHAMAN) – Cinegrid National collaborations – Temporal Dynamics of Learning Center (TDLC) – Ocean Observatories Initiative (OOI) Regional collaborations – LSU data grid – HASTAC humanities data grid – Distributed Custodial Archive Preservation Environment (DCAPE) State collaborations – RENCI data grid – North Carolina State Library Institutional repositories – Carolina Digital Repository – SIO Repository

Integrating across Supercomputer / Cloud / Grid iRODS Data Grid iRODS Server Software iRODS Server Software iRODS Server Software iRODS Server Software iRODS Server Software iRODS Server Software Supercomputer File System Cloud Disk Cache Teragrid Node Virtual Machine Environment Parallel Application Grid Services OOISCEC RENCI

ARCS Data Fabric

Davis – Modes

Scientist A Adds data to Shared Collection Scientists can use iRODS as a “data grid” to share multiple types of data, near and far. iRODS Rules also enforce and audit human subjects access restrictions. Temporal Dynamics of Learning Center Brain Data Server, CA iRODS Metadata Catalog iRODS Data System Audio Data Server, NJ Video Data Server, TN Scientist B Accesses and analyzes shared Data

iRODS Evaluations NASA Jet Propulsion Laboratory – iRODS selected for managing distribution of Planetary Data System records NASA National Center for Computational Sciences – iRODS chosen to manage archive of simulation output and serve as access data cache for distribution AVETEC appraisal for DoD HPC centers – iRODS now provides all required capabilities French National Library – iRODS rules control ingestion, access, and audit functions Australian Research Coordination Service – iRODS manages data distributed between academic institutions

Development Team DICE team – Arcot Rajasekar - iRODS development lead – Mike Wan - iRODS chief architect – Wayne Schroeder - iRODS developer – Bing Zhu - Fedora, Windows – Lucas Gilbert - Java (Jargon), DSpace – Paul Tooby - documentation, foundation – Sheau-Yen Chen - data grid administration Preservation – Richard Marciano - Preservation development lead – Chien-Yi Hou - preservation micro-services – Antoine de Torcy - preservation micro-services

Foundation Data Intensive Cyber-environments – Non-profit open source software development – Promote use of iRODS technology – Support standards efforts – Coordinate international development efforts IN2P3 - quota and monitoring system King’s College London - Shibboleth Australian Research Collaboration Services - WebDAV Academia Sinica - SRM interface

iRODS is a "coordinated NSF/OCI-Nat'l Archives research activity" under the auspices of the President's NITRD Program and is identified as among the priorities underlying the President's 2009 Budget Supplement in the area of Human and Computer Interaction Information Management technology research Reagan W. Moore NSF OCI “NARA Transcontinental Persistent Archives Prototype” NSF SDCI “Data Grids for Community Driven Applications”