Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jim Tuttle North Carolina State University Libraries

Similar presentations


Presentation on theme: "Jim Tuttle North Carolina State University Libraries"— Presentation transcript:

1 Jim Tuttle North Carolina State University Libraries
Tools Development and Demonstration: North Carolina Geospatial Data Archiving Project Jim Tuttle North Carolina State University Libraries

2 Process Overview Data transfer Threat and format analysis, validation
Archive package organization Selective format migration Metadata normalization and supplementation Source metadata translation Statistics collection Extra-repository AIP management This presentation discusses some of the tools created by the NCGDAP to facilitate archive ingest workflow. They are organized here by process.

3 Data Transfer Python Md5sum comparison
'Transfer set' metadata capture in 'Seed file' The copy process from external drives and optical disks is handled by a python script. It first creates a manifest of the contents including MD5 checksums, it then copies all files, creating and comparing checksums of each file as it is copied. The operator is warned after 3 failed attempts to recopy files that fail the comparison. As the data set, known as a 'transfer set', is copied to the local server, the operator captures metadata concerning the set, such as from whom it was acquired, under what circumstances it was obtained, and the permissions that apply to the set, and enters it into this form which wraps it in XML. The community and collection menus are dynamically populated from the DSpace database. This form creates both machine-actionable, unix-style and text-based, human-readable permissions. Although we don't currently utilize the permissions, we expect that they might be important for controlling access in the future.

4 Threat and format analysis, validation
Python wrappers for the following: Virus – ClamAV Compressed files (tar, zip, gzip, bzip)‏ Geodatabases (extension and size)‏ Executable files (magic numbers)‏ Jhove validation We've written python classes to wrap some utilities such as ClamAV anti-virus; the unix 'file' utility used to identify executable files using the magic numbers; and Jhove used for format validation although it doesn't currently recognize most geospatial formats. The script scans for files that require early intervention including compressed files and geodatabases. Compressed archives are handled manually due to the inconsistency of unpacking locations.

5 Archive package organization
ESRI ArcGIS toolbar for selected formats We have a three-stage archive item organization process. The first stage is handled by a custom toolbar written for ESRI's ArcCatalog. The primary purpose of the toolbar is format conversion, but it also provides some metadata and organization support. We plan to rewrite this as a Visual Basic extension to make it more portable.

6 Archive package organization
Rule-based python logic filestem extension relationships ( multi-file format validation)‏ directory structure Manual intervention metadata.doc NOID assignment The second stage is a python script that builds a comma-delimited file of suggested item groupings based on several factors such as non-unique filestems and directory structure (coverage 'info' directories). The script then does a complex-format validation to ensure that required files for each format are present. The operator can then import the suggested item groupings into a spreadsheet to determine the accuracy of the groupings. Thespreadsheet, with modifications, if necessary, is then used to feed the item building script. As items are grouped, a NOID persistent identifier is assigned to each item.

7 Selective Format Migration
Coversions using ArcGIS toolbar e00 interchange to coverage to shapefile geodatabase to raster, shapefile, etc Original files retained We've identified the shapefile as being likely to survive the longest and retaining the most functionality. We always retain the original formats.

8 Metadata Normalization & Supplementation
Agency-specific XML templates in ArcCatalog with synchronization flags Provenance and curation metadata scripted Often, when metadata is present, it is minimal or incorrect. By using synchronization flags in an agency-specific metadata template, we modify only select elements. ArcCatalog automates some metadata augmentation, but some is handled by python. All FGDC metadata is updated to reflect NCGDAP aquisition.

9 Source Metadata Translation
Hub-and-spoke model a la Echo Depository repository agnostic modular conversion hub facilitate repository software migration & inter-archive exchange We've taken inspiration from UIUC and implemented a simple hub-and-spoke metadata translation process. We've created a central hub with which schema-specific spokes can inter-operate. We've tried to abstract out the repository as much as possible. This approach should facilitate archive exchange. Currently we have input spokes for FGDC and our seed file metadata and output spokes for QDC and our workflow management database.

10 Statistics Collection
Python scripted statistics generation: number of files by format cumulative size by format mean file size collection size agency contribution The processing scripts contain functions to capture statistics about transfer sets and about the processes themselves. This information is stored in the workflow management database.

11 Extra-repository AIP management
Workflow Management Database populated as a spoke on the metadata/ingest hub External tracking of NOID, Handle, ISO keywords, other metadata for interaction with other systems The workflow management database is simple MySQL database that serves several functions. First, it provides insurance against Dspace/Postgres failure. With the WMD and our files on disk, we can rebuild our archive. It allows us to generate reports concerning the content of the archive. It may also provide a means to eventually integrate the data collected in the project into university collections by using existing search tools.

12 Questions? Jim Tuttle Geospatial Data Librarian &Project Coordinator NCGDAP NCSU Libraries jim_tuttle at ncsu dot edu


Download ppt "Jim Tuttle North Carolina State University Libraries"

Similar presentations


Ads by Google