Metadata, Ingest, and Data Feeds What we do with your data and why Nicole Lawrence, DLG Mike Kanning, GALILEO GALILEO Users Conference July 12, 2018
Your presenters Nicole Lawrence Project Manager, Digital Library of Georgia nicole.lawrence@uga.edu Mike Kanning Developer, GALILEO mkanning@uga.edu
What are we going to talk about? The DLG data supply chain including: How we gather data What we do with it Batch processing Spatial lookups and other improvements Public websites DPLA harvest process
The Data Supply Chain Georgia Portal Civil Rights Digital Library OAI-PMH Harvest Civil Rights Digital Library DLG Processing DLGadmin Exported Data Civil War in the American South DLG OAI-PMH Data Feed Locally Created Digital Public Library of America EBSCO
How we gather data
The Data Supply Chain: How we gather data Georgia Portal OAI-PMH Harvest Civil Rights Digital Library DLG Processing DLGadmin Exported Data Civil War in the American South DLG OAI-PMH Data Feed Locally Created Digital Public Library of America EBSCO
How we gather data: OAI-PMH harvest
How we gather data: Exported data
How we gather data: Locally created
What we do with your data
The Data Supply Chain: What we do with it Georgia Portal OAI-PMH Harvest Civil Rights Digital Library DLG Processing DLGadmin Exported Data Civil War in the American South DLG OAI-PMH Data Feed Locally Created Digital Public Library of America EBSCO
Steps in DLG processing Normalize 01 System validation Faceting Enhance 02 Missing fields DLG specific fields Map 03 Crosswalk original scheme to DLG Ensure proper field headings and content Convert 04 Native format to active XML
Steps in DLG processing: Normalizing
Steps in DLG processing: Normalizing
Steps in DLG processing: Enhancement
Steps in DLG processing: Enhancement
Steps in DLG processing: Crosswalk
Steps in DLG processing: Data verification
Steps in DLG processing: Data verification
Steps in DLG processing: Convert
Steps in DLG processing: Convert
Steps in DLG processing: Convert
Steps in DLG processing: Convert
Steps in DLG processing: Convert
Batch processing
The Data Supply Chain: Ingesting Georgia Portal OAI-PMH Harvest Civil Rights Digital Library DLG Processing DLGadmin Exported Data Civil War in the American South DLG OAI-PMH Data Feed Locally Created Digital Public Library of America EBSCO
DLGAdmin’s Batch System Batch Import Batch Commit Batch Batch Items Items Batch Import The DLGAdmin batch system is used by DLG staff to ingest, improve and validate new records, as well as update existing records. Batches are created as units-of-work and when complete, are “committed” to the public index. Batches serve as an audit trail for records. Batch processing is complex and can take up a lot of system time, so they are queued and worked as background processes.
Populating Batches Form XML
Populating Batches Search Results
A Populated Batch
Committing a Batch Commit jobs are submitted to a worker queue and worked one at a time. Status notifications are available via a Slack integration. Completed commits show in the list of batches, and in the event of an error the user is given the opportunity to revise the batch and retry the commit job. Once complete, item records are either created or updated and the new or changed record is added to the search index. The change in live in the public DLG site as soon as this happens. Viewing the item record shows the history of batch items that were used to create or update a given record.
Spatial lookups (and other improvements)
GeoJSON On import, DLGAdmin generates and indexes a GeoJSON object for each record with spatial metadata. GeoJSON is a standard format used for plotting shapes on maps like seen here. We are hoping to improve this process to introduce higher fidelity for object mapping (get the pin closer to where a photo was actually taken, for example) and support the lookup of coordinates for novel locations.
Indexing Dates 1732-02-03/1732-03-24 1732-06-09 1732/1783 0000/1885-06-10 1/31/1991 1903-05 approx. 1934 circa 1960-1969 July 1, 1997 - June 30, 1998 1/21/1999-4/4/2012 5/1986 1776-7 1920-00-00 On import DLGAdmin also parses the dc_date field for year values that apply to the record. Ranges are parsed to include all years within the range. A variety of formats commonly found in the metadata are handled. We are working to improve this process to make it easier to return items from a user-provided range (e.g. 1900-1930) and also to increase the fidelity of the indexed date values (e.g index a date down to the individual day/month rather than just year)
Public Websites
The Data Supply Chain: Public Access Georgia Portal OAI-PMH Harvest Civil Rights Digital Library DLG Processing DLGadmin Exported Data Civil War in the American South DLG OAI-PMH Data Feed Locally Created Digital Public Library of America EBSCO
DLG Public Website: dlg.usg.edu
Other Websites: CRDL and AMSO
Other Websites: CRDL and AMSO
DPLA harvest process
The Data Supply Chain: Harvesting Georgia Portal OAI-PMH Harvest Civil Rights Digital Library DLG Processing DLGadmin Exported Data Civil War in the American South DLG OAI-PMH Data Feed Locally Created Digital Public Library of America EBSCO
DLG’s OAI-PMH Feed
DPLA Metadata Application Profile
DLG in DPLA
Questions?