Towards a Federated Infrastructure for the Preservation and Analysis Archival Data Chien-Yi HOU Richard MARCIANO {chienyi, School of Information and Library Science (SILS) Sustainable Archives & Leveraging Technologies Group (SALT) University of North Carolina at Chapel Hill 1
Preservation with Data Grid 2 Data grid systems provide data virtualization – Allows users to access files seamlessly across a distributed environments. – It replicates, syncs and archives data, connecting heterogeneous resources in a logical and abstracted manner. In addition to the capabilities above, iRODS, the Integrated Rule-Oriented Data System, provides policy/service virtualization – Rule Engine applies user-defined Policies and Rules – Policy can be coded as functions (micro-services) – Remote micro-services can be chained – The chains can be triggered on an event and condition (rules) – Micro-services communicate through parameters, shared contexts, and out-of-band message queues.
Overview of iRODS Architecture 3 3 User Can Search, Access, Add and Manage Data & Metadata *Access data with Web-based Browser or iRODS GUI or Command Line clients. iRODS Data Server Disk, Tape, etc. iRODS Metadata Catalog Track information iRODS Data System iRODS Rule Engine Track policies
Interoperability & Reproducibility T-RACES PoDRI e-Legacy Data Grid RSS Feed Reader Digital Repository Digital Library GIS 4
CA’s Geospatial Records: Archival Appraisal, Accessioning, and Preservation e-Legacy RSS Feed Reader Data Grid Dealing with GIS data Moving from “Expert Appraisal & Accessioning” to “Social Appraising and Automated Accessioning”. Incorporating RSS subscription into the appraisal process to allow archivists to work together on deciding what to preserve. Ingesting data to the archive automatically once the criteria were satisfied. Collaborators: California State Archives: Chris Garmire CERES: David Harris SALT/UNC: Richard Marciano, Chien-Yi Hou Funded by NHPRC 5
6 CSA California State Archives DICE (UNC/ INC) Archive ICAT CaSIL California Spatial Information Library Local Storage Resources Shared Preservation Environment Metadata Catalog (Oracle) Archival Storage (HPSS, Sam-QFS) e-Legacy: Shared Infrastructure
7 Where is The Data?
8 Old Approach: Formulating Appraisal Rules Retrieve root webpage ‘ For each entry: Create an “matching entry” collection on iRODS Add ‘entry description’ metadata to that collection Create “Documentation” subcollection Load web page Load all “.gif” | “.jpg” | “.jpeg” files Load all “.doc” Load metadata file Create “ArcINFO” subcollection Load all “.e00” | “.clr” | “.asc” | “.nit” | “.dlg” | “.txt” files Create “Shape” subcollection Load all “.shp” files Create “SDTS” subcollection Load all “.sdts” files Create “Others” subcollection Load “.tfw” | “.rdb” | “.clr” | “.asc” | “.prj” files DECOMPRESS & LOAD “.zip” | “.gz” | “.tgz” | “.tar” | “.tar.gz” files
9 What is RSS? RSS is a standardized XML file format for content providers to publish their contents. RSS is a web feed format. CERES Geospatial Service CERES Geospatial Service en-us Wed, 23 Jul :08:18 EDT CA Digital Raster Graphics 1x2 degree series CA Digital Raster Graphics 1x2 degree series(1:250K scale) Tue, 22 Jul :07:00 EDT CA Digital Raster Graphics 30x60 minute series CA Digital Raster Graphics 30x60 minute series(1:100K scale) Tue, 22 Jul :58:00 EDT
RSS Feed Reader
Appraisal Description Arrangement Preservation Subscribe to RSS Review Received Entry Share and Tag Meet Preservation Criteria Preserve to iRODS Yes e-Legacy Workflow
e-Legacy RSS Tool
PoDRI Policy-Driven Repository Interoperability PoDRI Digital Repository Data Grid The PoDRI project investigates the requirements for policy-aware interoperability and demonstrates key features needed for its implementation. Using iRODS and its rules engine, combined with Fedora’s rich semantic object model for digital objects, enables use of the best features of both products. Collaborators: UNC: Richard Marciano, David Poclar, Alex Chassanoff. Chien-Yi Hou Duraspace/Cornell: Daniel Davis DICE/UCSD: Bing Zhu Funded by IMLS 13
PoDRI Use Cases Fedora (Digital Repository) Fedora (Digital Repository) iRODS (Data Grid) iRODS (Data Grid) New content ingested via iRODS Bulk registration from iRODS into Fedora Update of content or metadata via Fedora Update of content or metadata via iRODS New content ingested via Fedora 14
PoDRI Use Case 1 15 New content ingested via Fedora
T-RACES Testbed for the Redlining Archives of California’s Exclusionary Spaces T-RACES Digital Library GIS Data Grid Making 1930s redlining files of eight California cities from the Federal Home Owners’ Loan Corporation accessible to the public. Integrating the data and maps and providing an interface for users to access and query the data easily. Being one of the first project to use the HASS (Humanities, Arts, and Social Sciences) Grid, a cyberinfrastructure initiative organized by the University of California Humanities Research Institute (UCHRI) and partners. Collaborators: SALT/UNC: Richard Marciano, Chien-Yi Hou UCHRI: David Theo Goldberg Funded by IMLS 16
17 HASS Data Grid
18 HOLC Area Description Example
19 Redlining Map
20 Original Paper Documents Database GIS Maps PDF Documents Scan OCR Parse Scan OCR User Queries Get Results on maps Get Results on docs Get Corresponding docs User Browsing How to use the data? 1.Query the database 2.Browse the map 3.Browse the PDF
21 The Interface 1939
Thank you! More information? or 22