Interoperability of Digital Repositories Adil Hasan Univ of Liverpool.

Slides:



Advertisements
Similar presentations
Research Data Access and Preservation Summit Panel 2 - Promoting Re-Use of Scientific Collections Some responses to the questions posed... John Harrison.
Advertisements

Panel 2 – Promoting Re-Use of Scientific Collections John Harrison SHAMAN Project University of Liverpool
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
GFS OGF-22 Global Resource Naming Developers: Reagan Moore Arcot Mike.
OGF-23 iRODS Metadata Grid File System Reagan Moore San Diego Supercomputer Center.
© 2006 Open Grid Forum OGF19 Federated Identity Rule-based data management Wed 11:00 AM Mountain Laurel Thurs 11:00 AM Bellflower.
An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.
Database System Concepts and Architecture
Data Grid: Storage Resource Broker Mike Smorul. SRB Overview Developed at San Diego Supercomputing Center. Provides the abstraction mechanisms needed.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Particle Physics Data Grid PPDG Data Handling System Reagan.
A Very Brief Introduction to iRODS
iRODS: Interoperability in Data Management
DESIGNING A PUBLIC KEY INFRASTRUCTURE
Chronopolis: Preserving Our Digital Heritage David Minor UC San Diego San Diego Supercomputer Center.
Rutgers University Libraries What is RUcore? o An institutional repository, to preserve, manage and make accessible the research and publications of the.
NextGRID & OGSA Data Architectures: Example Scenarios Stephen Davey, NeSC, UK ISSGC06 Summer School, Ischia, Italy 12 th July 2006.
Brief Overview of Major Enhancements to PAWN. Producer – Archive Workflow Network (PAWN) Distributed and secure ingestion of digital objects into the.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.
US GPO AIP Independence Test CS 496A – Senior Design Team members: Antonio Castillo, Johnny Ng, Aram Weintraub, Tin-Shuk Wong Faculty advisor: Dr. Russ.
Understanding Active Directory
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
Digital Object: A Virtual Online Storage Solution 598C Course Project Huajing Li.
DCC Conference, Glasgow November, Digital Archive Policies and Trusted Digital Repositories MacKenzie Smith, MIT Libraries Reagan Moore, San Diego.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Information Management and Distributed Data Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
Working Group: Practical Policy Rainer Stotzka, Reagan Moore.
OSG Public Storage and iRODS
A Metadata Catalog Service for Data Intensive Applications Presented by Chin-Yi Tsai.
Jan Storage Resource Broker Managing Distributed Data in a Grid A discussion of a paper published by a group of researchers at the San Diego Supercomputer.
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
Access Across Time: How the NAA Preserves Digital Records Andrew Wilson Assistant Director, Preservation.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Working Group Practical Policy based on slides and latest documents from the PP WG chaired by Reagan Moore, Rainer Stotzka presented by Johannes Reetz.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
Policy Based Data Management Data-Intensive Computing Distributed Collections Grid-Enabled Storage iRODS Reagan W. Moore 1.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
How to Implement an Institutional Repository: Part II A NASIG 2006 Pre-Conference May 4, 2006 Technical Issues.
From SRB to IRODS: Policy Virtualization using Rule-Based Data Grids Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center.
Introduction to The Storage Resource.
The Project Three-year grant from the National Historical Publications and Records Commission (NHPRC), April 2010-March 2013 Develop electronic records.
©MIT LKTR Workshop, Digital Archive Policies and Trusted Digital Repositories MacKenzie Smith, MIT Libraries Reagan Moore, San Diego Supercomputer.
From Access to Archive Transforming Scholars Portal into an E-Journal Archive.
National Archives and Records Administration1 Integrated Rules Ordered Data System (“IRODS”) Technology Research: Digital Preservation Technology in a.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
A possible SRB to iRODS Migration (a path from here to there)‏ Adil Hasan RAL.
Use of Policies to Enforce Collection Properties Richard Marciano Reagan Moore University of North Chapel Hill Data Intensive Cyber Environments.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
Store and exchange data with colleagues and team Synchronize multiple versions of data Ensure automatic desktop synchronization of large files B2DROP is.
Working Group: Data Foundations and Terminology (Practical Policy Considerations) Reagan Moore.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Digital Library Storage using iRODS Data Grids Mark Hedges, Tobias Blanke Centre for e-Research, King’s College London Arts and Humanities Data Service.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
An Overview of iRODS Integrated Rule-Oriented Data System
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
Policy-Based Data Management integrated Rule Oriented Data System
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Implementing an Institutional Repository: Part II
Arcot Rajasekar Michael Wan Reagan Moore (sekar, mwan,
Implementing an Institutional Repository: Part II
Reportnet 3.0 Database Feasibility Study – Approach
How to Implement an Institutional Repository: Part II
Presentation transcript:

Interoperability of Digital Repositories Adil Hasan Univ of Liverpool

What's a Digital Repository? Place where your data is stored for current and future use. Contains upload tools so you can store and update your data. Contains discovery tools so you can find your data. Contains access tools so you can get at your data. Recognition that not only the publishable data is important.

Content Management System? Used to store a communities digital content (documents, audio-visual, etc). Primarily focussed on managing the content not on disseminating the content. Focus is on the content production side: e.g managing multiple updates to a digital object.

Digital Archive? Digital Archive: used to store a communities digital content that has been appraised as being worthy of long-term preservation. Requires consideration about storage and access. Now all repositories holdings considered important. Seems now that Digital Repository → Digital Archive.

What is Interoperability? Sorry, but many kinds: Share current data with collaborators within my project. Share current data with collaborators outside my project. Use old data with current data. Can be achieved by: Interoperation of data-sets within a repository Interoperation of data-sets in different repositories.

Intra-Repository Issues Can they correctly understand the data? Metadata, sufficient description Can they find the data? Flexible discovery, adequate indexing etc. Can they access the data? Proprietary tools, do they have the tools, know how to use them. Do they have permissions to all the data? What part of the data-set, when, for how long, issues of recognition.

Intra-Repository Issues Can they update data/metadata? Approval process,validation, versioning, provenance Is it possible to contact a responsible of the old data? For questions about old data that cannot be answered by the metadata. To use data for a different purpose than was intended.

Extra-Repository Issues Is there a uniform way to discover data in repositories? Same terminology used in different repositories? Is there a uniform way to access data in repositories? Is it possible to copy data from one repository to another? Because it's closer to me, or less used or has more space, etc. Is it possible to combine data-sets to make a new data-set?

Workshop Mission Are there more 'types' of interoperability we should think about? Are there more issues that we need to take into account? Are there tools that would be extremely desirable to have? How do we go about making datasets interoperable? Is interoperability really important??

IRODS Adil Hasan University of Liverpool Interoperability of Digital Repositories 2-4 Dec 2009

iRODS? Based on considerable experience from Storage Resource Broker (SRB) developed by DICE group. Found many groups used SRB to store large quantities of data. A lot of server-side post-processing of the data is required (e.g. replicate files, convert to different format, checksum etc). Almost all management is Policy driven.

iRODS? SRB experience motivated requirements for a new data management system: Contained all SRB functionality. Add work-flow to manage server-side post-processing. Configurable – only include the 'services' you need. Open-source – SRB license imposed sever restrictions on the academic community.

iRODS? integrated Rule Oriented Data Management System. Developed by Data Intensive Cyber Environments (DICE) group at UNC and UCSD. Can be seen as a basis for a Digital Repository/Archive. Digital Repository/Archive is a Policy Driven System.

iRODS Client-server middleware Consists of database holding metadata information. Server applications – one for each storage resource. Rule engine applications – one for each storage resource. One server application interfaces to database. Client applications/API: C, Java, Python, PHP.

iRODS Support for user-defined metadata –Useful for adding project-specific metadata –In triplets (attribute, value, unit). –Support schemas such as Dublin Core, FITS, DICOM. –Rules can extract metadata stored in XML files and populate user-defined metadata.

IRODS Provides features essential for Repositories: Storage virtualization Data location virtualization Policy virtualization Features provide a flexible, scalable system that is robust to change. All operations carried out by micro-services on objects in iRODS.

Where iRODS fits? Storage Digital Repository Access Client interacts with digital repository to access data IRODS Spans Digital Repository and Storage Domains

Where iRODS fits? IRODS provides infrastructure to manage data. Policies implemented as computer actionable rules which control the execution of remote micro- services. Micro-services interact with data. Covers the Storage Management and Storage part of the digital repository.

Storage Virtualization Problem: over time storage will change (e.g. new HSM, new tape systems, etc). Solution: insulate repository from changes through interface/driver. IRODS provides drivers to storage that expose POSIX standard API. Interaction with data performed by micro-services that communicate with data through the drivers.

Storage Virtualization Storage System Storage Driver Digital Repository Access Access to Storage through driver. Provides POSIX standard interface.

Storage Virtualization Also want to be insulated from changes to storage name/address. Provide logical storage resource name. Logical-to-physical resource name mapping. All iRODS interactions with Storage use logical name.

Data Location Virtualization Problem: physical structure of digital objects on storage may change in the future. Solution: create logical file-path to insulate from changes to physical path. IRODS provides logical-to-physical mapping to insulate from changes. All iRODS interactions use the logical name.

Data Virtualization Can group data objects into logical collections. Logical collections can span multiple resources. Can create a logical collection that spans zones. Can register events against collections. –Notified when data is updated/moved/etc.

Policy Virtualization Problem: management applications may change over time. Solution: abstract processes such that it is possible to replace processes without altering workflow. IRODS encodes process as a micro-service (C-application) Create workflow by compositing multiple micro-services. Identify locations in data management framework where policies should be checked. Specify a rule that is checked on each invocation of a framework management hook. Support pre-process hooks for authorization Support post-process hooks for audit trails

Storage Virtualization Storage System Storage Driver Access Access to Storage through driver. Provides POSIX standard interface. Digital Repository Rules Micro-Services Executes processes Implements Processes Rule Engine ICAT Logical namespace

Rules Policies are implemented in iRODS as rules. Rule is a series of logically connected steps. Each step realised as a micro-service. IRODS rules fully featured: Contain loops and branches. Can have rules contained within rules. IRODS rules read from a rule-file (called core.irb by default).

Rules Rule follows Event-Action-Recovery chain. Event, Action, Recovery domains separated by '||'. Rule executed from left-to-right. All micro-services in a rule separated by '##'. Each action micro-service must have a recovery (even if it's a nop). Input and output variables start with '*'.

Example Rule Look at an example rule: Rule to query the catalogue to find and print all data objects that are on the demoResc resource. Make use of the core.irb rule acGetIcatResults to return list of results. Use ForEachExec loop to loop over results and print the values.

Example Rule Myrule|| acGetIcatResults(list, "DATA_RESC_NAME = 'demoResc'", *out)## forEachExec(*out, msiPrintKeyValPair(stdout, *out), nop)| nop Rule Name Condition (event) Rule (action) to execute Rule within a rule Recovery chain Loop construct takes as input array *out, workflow: microservice1##microservice2, recovery chain: recovery2##recovery1

Rules Rules stored in a rule file (default is core.irb). Rules read from file top-to-bottom. First rule that satisfies event is executed. Only one successfully executed rule per event. Can override a rule, but overridden rule must appear later in the rule file.

Rule Engine Rules in rule file executed by the rule engine. Engine running on each iRODS resource. Rule engine triggered by any interaction with the iRODS server (copy, put, get, etc). Except for queries of the catalogue (listing). Mainly due to performance reasons. But can be overridden. Rule engine on server client connects to runs by default.

Rule Engine Also delayed execution rules supported. Can execute a rule later. Can execute a rule periodically. Delayed execution rules are run stored in catalogue. By default rule engine polls for delayed execution rules every 30secs. Can direct the rule engine closest to data to execute.

iCAT Client connects to non-icat server User Authorised Client wants to copy data to Octagonal server Rule engine triggers Locate Octagonal server Rule engine triggers Update catalogue

Repositories Requirements To reliably store data for a defined period of time.  Allows rules to be placed on data (replication, checksums). To ensure data are accessible.  Rules to migrate data.  To validate repository assessment criteria  Rules to parse audit trail, verify integrity, verify retention and disposition

Federation Union of independently administered repositories. Useful for: Interoperation with other remote repositories that are independently administered. Access to data in different repositories in seamless manner. High-availability system.

Federation Issues Rights to access remote repository data (all, some). Rights to store data in remote repository. Rights to access applications from remote repository. Rights to store applications in remote repository.

IRODS Federation IRODS system consists of one iCAT and 0 or more storage systems (each with its own iRODS server and rule engine). Each iRODS system has its own name-space called a Zone. IRODS allows interoperation between Zones (Federation). IRODS federation at the storage management level.

IRODS Federation Creation of iRODS federation essentially: Register zones. Register users as remote users. Grant access to data to remote-zone users. Remote user has access to local user data. Users authenticated locally then given remote access. Currently any interaction (except 'ls') will cause rule to trigger on remotely accessed data.

Resources 'federation' Useful if just want to interoperate storage repositories. Each repository part of iRODS system. Only one zone needed. Each site manages its own resource. But, each site needs admin privs to manage its resource.

Zone Federation Useful when 'organizations' wish to interoperate iRODS systems. Each Zone controls it's own storage and data. A Zone may house data that's part of more than one repository. Can more easily add new resources as opposed to 'resource federation'.

Producer-Reader Federation Effectively have one system that is filled with all the data (the Producer Zone). Reader Zones consist of just iCAT server. Replicate data of interest to Reader Zones. Useful for high-read rate systems where readers are not interested in cross-zone collections.

Hub-Spoke Federation Useful where there may be many producers and readers. Centralized model. Each iRODS Zone contains readers and writers. Central Zone has all Zones registered. Users access central and access data from remote Zones.

Repository Interoperation Federation very useful across iRODS systems. In the case of wishing to interoperate with a different type of system: Look at writing an iRODS driver that knows how to talk to the system so it can appear as an iRODS resource. Need an iRODS server running on resource. Look at writing an iRODS micro-service that can interact with the system. No iRODS server needed, but thought required about workflow.

IRODS and Fedora and Dspace Projects currently looking at integration of Fedora repository framework with iRODS manage the back- end storage. D-Grid, DARIAH Duke Medical Archives Carolina Digital Repository DICE provides an interface to Dspace to allow iRODS to be used as managed storage.

Client Dspace or Fedora IRODS System Dspace and Fedora use iRODS as distributed Back-end storage. Discovery and indexing handled by other tools.

Digital Archive The SHAMAN (Sustaining HeritAge through Multivalent ArchiviNg) project Looking at digital preservation. FP7 integrated project funded until end of EU partners. 2 US collaborators. Partners from academia and industry. Aim to provide a digital preservation framework.

SHAMAN Looking to describe the preservation environment sufficiently well. Such that it's possible to replace services without impacting preservation of the data. In addition looking at the use of Multivalent technology to 'render' the object stored in the original format. Multivalent Java-based 'render' tool has adaptors (media engines) capable of reading different formats.

SHAMAN Make use of Cheshire Digital Library tool-kit to index data. Make use of iRODS to provide a means of abstracting the preservation process and providing underlying storage.

SHAMAN CLOUDHSS... Storage Interface (Grid) Metadata Extraction... Content Mgmt Preservation Process Interface Administration Access Interface Grid interfaces to different Types of storage. Provides Uniform Interface Interface provides uniform access to different preservation processes. Provides uniform access to data.

SHAMAN To ensure data usable in the long-term: Insulate from hardware changes. Insulate from changes to processes. Insulate from changes to data format. Insulate from changes to description. Ensure as much information as possible about data is captured.  Ideally test data is understandable without ANY external dependencies. SHAMAN aims to provide a framework that accounts for these issues.

Summary IRODS can provide a basis for which digital repositories or archives can be constructed. Have illustrated some of the features of iRODS. Have illustrated how some of these features can be of use in repositories and archives. Have illustrated how iRODS can interoperate with existing systems.

Acknowledgements Thanks to: –Reagan Moore –Arcot Rajasekar (Raja) –Mike Wan –Wayne Schroeder –Paul Watry –Jean-Yves Nief