Information Management and Distributed Data Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,

Slides:



Advertisements
Similar presentations
Panel 2 – Promoting Re-Use of Scientific Collections John Harrison SHAMAN Project University of Liverpool
Advertisements

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
GFS OGF-22 Global Resource Naming Developers: Reagan Moore Arcot Mike.
© 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.
OGF-23 iRODS Metadata Grid File System Reagan Moore San Diego Supercomputer Center.
© 2006 Open Grid Forum OGF19 Federated Identity Rule-based data management Wed 11:00 AM Mountain Laurel Thurs 11:00 AM Bellflower.
© 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.
Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center
Digital Preservation Lifecycle Management Building a demonstration prototype for the preservation of large-scale multi-media collections Arcot Rajasekar.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Particle Physics Data Grid PPDG Data Handling System Reagan.
San Diego Supercomputer Center NARA Research Prototype Persistent Archive Building Preservation Environments with Data Grid Technology (NARA Research Prototype.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Grid Based Solutions for Distributed Data Management Reagan.
A Very Brief Introduction to iRODS
GGF-17 Astro Workshop Preservation Environment Working Group Officers: Bruce Barkstrom (NASA Langley) Reagan Moore (SDSC) Goals  Demonstrate.
ADAPT An Approach to Digital Archiving and Preservation Technology Principal Investigator: Joseph JaJa Lead Programmers: Mike Smorul and Mike McGann Graduate.
PAWN: Producer-Archive Workflow Network University of Maryland Institute for Advanced Computer Studies Joseph Ja’Ja, Mike Smorul, Mike McGann.
Applying Data Grids to Support Distributed Data Management Storage Resource Broker Reagan W. Moore Ian Fisk Bing Zhu University of California, San Diego.
NextGRID & OGSA Data Architectures: Example Scenarios Stephen Davey, NeSC, UK ISSGC06 Summer School, Ischia, Italy 12 th July 2006.
Brief Overview of Major Enhancements to PAWN. Producer – Archive Workflow Network (PAWN) Distributed and secure ingestion of digital objects into the.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.
Richard MARCIANO Chien-Yi HOU School of Information and Library Science (SILS) Sustainable Archives & Leveraging Technologies Group (SALT) University of.
DCC Conference, Glasgow November, Digital Archive Policies and Trusted Digital Repositories MacKenzie Smith, MIT Libraries Reagan Moore, San Diego.
National Partnership for Advanced Computational Infrastructure Digital Library Architecture Reagan Moore Chaitan Baru Amarnath Gupta George Kremenek Bertram.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Working Group: Practical Policy Rainer Stotzka, Reagan Moore.
USING METADATA TO FACILITATE UNDERSTANDING AND CERTIFICATION ABOUT THE PRESERVATION PROPERTIES OF A PRESERVATION SYSTEM Jewel H. Ward, Hao Xu, Mike C.
A Metadata Catalog Service for Data Intensive Applications Presented by Chin-Yi Tsai.
Jan Storage Resource Broker Managing Distributed Data in a Grid A discussion of a paper published by a group of researchers at the San Diego Supercomputer.
REPLIX Max Planck Institute for Psycholinguistics, TLA.
PERG OGF-22 Preservation Environments Research Group Organizers: Reagan Moore Richard Marciano
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
Rule-Based Distributed Data Management iRODS Jan 23, Reagan W. Moore Mike Wan Arcot Rajasekar Wayne Schroeder San Diego.
1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.
Richard MarcianoChien-Yi Hou Caryn Wojcik University of University of State of Michigan North Carolina North Carolina Records Management ServicesSALT DCAPE.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
Libraries, Archives, and Digital Preservation: The Reality of What We Must Do Leslie Johnston Acting Director, National Digital Information Infrastructure.
Working Group Practical Policy based on slides and latest documents from the PP WG chaired by Reagan Moore, Rainer Stotzka presented by Johannes Reetz.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart.
SRB 1 & iRODS 2 Arcot Rajasekar Reagan Moore Mike Wan SDSC/UCSD Pathways to OOI-CI CyberData Architecture 1 Storage Resource Broker 2 integrated Rule Oriented.
Interoperability of Digital Repositories Adil Hasan Univ of Liverpool.
Policy Based Data Management Data-Intensive Computing Distributed Collections Grid-Enabled Storage iRODS Reagan W. Moore 1.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Create Content Capture Content Review Content Edit Content Version Content Version Content Translate Content Translate Content Format Content Transform.
1 iRODS: A Rule Oriented Data ManagementSystem SRB Space.
From SRB to IRODS: Policy Virtualization using Rule-Based Data Grids Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center.
GGF-17 Preservation Environments Research Group Preservation Environment Working Group Officers: Bruce Barkstrom (NASA Langley) Reagan.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No The pan-European.
National Science Foundation Cooperative Agreement: OCI Reagan Moore, PI Mary Whitton, Project Manager.
©MIT LKTR Workshop, Digital Archive Policies and Trusted Digital Repositories MacKenzie Smith, MIT Libraries Reagan Moore, San Diego Supercomputer.
Partnerships in Innovation: Serving a Networked Nation Grid Technologies: Foundations for Preservation Environments Portals for managing user interactions.
National Archives and Records Administration1 Integrated Rules Ordered Data System (“IRODS”) Technology Research: Digital Preservation Technology in a.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Use of Policies to Enforce Collection Properties Richard Marciano Reagan Moore University of North Chapel Hill Data Intensive Cyber Environments.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
Working Group: Data Foundations and Terminology (Practical Policy Considerations) Reagan Moore.
Building Preservation Environments from Federated Data Grids Reagan W. Moore San Diego Supercomputer Center Storage.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
An Overview of iRODS Integrated Rule-Oriented Data System
Policy-Based Data Management integrated Rule Oriented Data System
OGSA Data Architecture Scenarios
Arcot Rajasekar Michael Wan Reagan Moore (sekar, mwan,
Technical Issues in Sustainability
Presentation transcript:

Information Management and Distributed Data Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar, {moore, schroede, mwan, sekar,

Data Grid Software Storage Resource Broker iRODS - integrated Rule-Oriented Data System Version 0.9 released May 30, 2007 Version 1.0 scheduled for fall, 2007 Open source - BSD license

Concepts Distributed Data Management Concepts Data virtualization Infrastructure independence Trust virtualization Administrative domain independence Federation Rule-based Data Management Management virtualization Automating execution of management policies Coupling management policies to assertions about data

Data Virtualization Manage properties of each digital entity independently of the remote storage systems Infrastructure independence Properties Name spaces Persistent state information (location, size,…) Manage standard operations Client actions Operations performed at remote storage systems

Data Virtualization Storage System Storage Protocol Access Interface Standard Access Actions Data Grid Map from the actions requested by the access method to a standard set of micro-services used to interact with the storage system Standard Micro-services

Federation Between Data Grids Data Grid Logical resource name space Logical user name space Logical file name space Logical rule name space Logical micro-service name Logical persistent state Data Collection B Data Access Methods (Web Browser, DSpace, OAI-PMH) Data Grid Logical resource name space Logical user name space Logical file name space Logical rule name space Logical micro-service name Logical persistent state Data Collection A

Production Data Grids: Observations Data grids manage shared collections that are distributed across multiple storage systems and institutions Data grids are responsible for providing recovery mechanisms for all errors that occur in the distributed environment The number of observed problems is proportional to the size of the collections Need to minimize labor costs by automating: Application of management policies Execution of administrative functions for error recovery Validation of preservation assessment criteria

Observations of Production Data Grids Each community implements different management polices Need a mechanism to support the socialization of shared collections Community specific preservation objectives Community specific assertions about properties of the shared collection Community specific management policies

iRODS What additional levels of virtualization are required to support advanced data management applications? Observe that each community imposes different management policies. Different criteria for data retention, disposition, access control, data caching, replication Assertions on collection integrity and authenticity such as required metadata Assertions on data distribution, data transport Need the ability to characterize management policies, automate their application, and verify collection properties

Socialization of Data Collections Management policies are a mechanism for the "socialization" of a collection. The management policies describe how the collection can be accessed by a broader community, the internal consistency mechanisms that maintain the reputation of the builders of the collection, and the collection consistency properties that the broader community can expect when they access the data. Management policies transform from the expectations of the designated community that built the collection to the expectations of the wider world that uses the collection. While management policies are unique for each record collection, generic management policies exist that can be tuned to represent the "socialization" of the collection.

Data Management iRODS - integrated Rule-Oriented Data System

Rule-based Data Management Map from management policies to rules controlling execution of remote micro- services Manage persistent state information for results of each micro-service execution Support an additional three logical name spaces Rules Micro-services Persistent state information

iRODS - integrated Rule- Oriented Data System Resources Client InterfaceAdmin Interface Metadata Modifier Module Config Modifier Module Rule Modifier Module Consistency Check Module Confs Rule Base Metadata Persistent Repository Engine Rule Current State Rule Invoker Micro Service Modules Resource-based Services Micro Service Modules Metadata-based Services Service Manager Consistency Check Module Consistency Check Module

Management Virtualization Standard policies expressed as rules Rules control execution of data management and access operations Integrity Validation of checksums Synchronization of replicas Data distribution Data retention Access controls Authenticity Chain of custody - audit trails Required preservation metadata - templates Generation of AIPs, DIPS

Example Rules Rule composed of four parts: Name | condition | micro-service set | recovery Rule to automate replication of data for a specific collection acPostProcForPut | $objPath like /tempZone/home/rods/nvo/* | msiSysReplDataObj(nvoReplResc,null) | nop Rule types Internal, administrative, user-defined Atomic, deferred, periodic

Three Classes of Rules Internal rules Used within iRODS for standard data manipulation services Administrator rules Set by data grid administrator to enforce policies on shared collection User-defined rules Support server-driven workflows

Rule-based Data Management Associate rules with combinations of name spaces Rule set for a particular collection Rule set for a particular user group Rule set for a particular user group when accessing a particular collection Rule set for a particular storage system Rule set for a particular micro-service Generic rules based on SRB operations

Administrative Rules Currently 15 administrative rules Administrative Storage selection Data pre-processing Data post-processing Data deletion Parallel I/O

Creation Rules Administration Creation Rules acCreateUser | | msiCreateUser## acCreateDefaultCollections## msiCommit | msiRollback## msiRollback##nop acVacuum(*arg1) | | delayExec(msiVacuum,*arg1) | nop acCreateDefaultCollections | | acCreateUserZoneCollections | nop acCreateUserZoneCollections | | acCreateCollByAdmin(/$rodsZoneProxy/home,$otherUserName)## acCreateCollByAdmin(/$rodsZoneProxy/trash/home,$otherUserName) | nop##nop acCreateCollByAdmin(*parColl,*childColl) | | msiCreateCollByAdmin(*parColl,*childColl) | nop

Administration Deletion Rules acDeleteUser | | acDeleteDefaultCollections## msiDeleteUser## msiCommit | msiRollback##msiRollback##nop acDeleteDefaultCollections | | acDeleteUserZoneCollections | nop acDeleteUserZoneCollections | | acDeleteCollByAdmin(/$rodsZoneProxy/home,$otherUserName)## acDeleteCollByAdmin(/$rodsZoneProxy/trash/home,$otherUserName) | nop##nop acDeleteCollByAdmin(*parColl,*childColl) | | msiDeleteCollByAdmin(*parColl,*childColl) | nop

Data Manipulation Rules Rule for pre-processing on storage use acSetRescSchemeForCreate | | msiSetDefaultResc(demoResc,noForce)## msiSetRescSortScheme(random)## msiSetRescSortScheme(byRescType) | nop##nop##nop Rule for pre-processing on data reads acPreprocForDataObjOpen | | msiSortDataObj(random) | nop Rule for post processing data writes acPostProcForPut | | nop | nop acPostProcForCopy | | nop | nop Rule for setting number of threads for parallel I/O acSetNumThreads | | msiSetNumThreads(default,default,default) | nop Rule for data deletion policy setting acDataDeletePolicy | | nop | nop

Planned Development Implement the rules and micro-services needed for the listed ERA capabilities Have identified 174 micro-services Data manipulation Structured information manipulation Have identified 212 persistent state attributes Implement the rules and micro-services needed to validate assessment criteria for trusted digital repositories Have identified 176 rules

ERA Capability Categories Accession Arrangement Description Preservation Access Disposition Subscription Notification Task queuing Transformative migration Display transformation Automated client specification System performance and failure reports.

Summary of Mapping ERA Capabilities to Management Rules ERA integrates capabilities of multiple systems PAWN submission pipeline - 34 operations Cheshire indexing system - 13 operations Kepler workflow - 53 operations iRODS data management operations Operations facility - the remaining capabilities The 597 operations are executed by 174 generic rules The analysis identified five types of metadata attributes: Collection metadata - 11 attributes File metadata attributes User metadata - 38 attributes Resource metadata - 9 attributes Rule metadata - 32 attributes

Example ERA Capabilities Record manipulation List files Display file (template) Format file Delete file Delete file authorized Delete file copies Delete file versions Erase file Replace file Set file version Create soft link Replicate file Synchronize replicas Physically move file Annotate file Access URL Check vault Monitor space used Output file Register file Regenerate system metadata Set number of items per display page Structured information DIP format template Disposition agreement format template Disposition action format template Physical location report template Inventory report template Data movement summary report template Access report template File migration report template Document internal access control template AIP format template Transfer format template Access review determination rule template Access review determination report template Validate access classification rule template File transfer discrepancy report template Notification review report template Redaction rule template Search display template File display template (file type) Format conversion format template Workbench display template Request help format template

Theory of Data Managment  Characterization  Persistent name spaces  Operations that are performed upon the persistent name spaces  Changes to the persistent state information associated with each persistent name space that occur for each operation  Transformations that are made to the records on each operation  Completeness  Set of operations is complete, enabling the decomposition of every data management process onto the operation set.  Data management policies are complete, enabling the validation of all data assessment criteria.  Persistent state information is complete, enabling the validation of authenticity and integrity and management policies.  Assertion  If the operations are reversible, then a future data management environment can recreate a record in its original form, maintain authenticity and integrity, support access, and display the record.  Such a system would allow records to be migrated between independent implementations of data management environments, while maintaining authenticity and integrity.