Data Wrangling: Developing Local Best Practice for Born Digital Metadata Tracy Popp, Digital Preservation Coordinator Ayla Stein, Metadata Librarian University Library University of Illinois Urbana-Champaign
Intro What will be addressed: Institutional context Project needs Challenges Current progress Future work
Institutional Context University Library –Campus-wide network of libraries –Largest public university research library in U.S. thirteen million volumes 24 million items and materials Over 12,000,000 digital files Main Library building, East Entrance
Institutional Context Collaborative effort: –Content Access Management (Cataloging and Metadata) Ayla – Metadata Librarian –Preservation Unit Tracy – Born Digital Content Reformatting –Special Collections University Archives RBML, Sousa, etc. –Back to Preservation Kyle Rimkus – Preservation Librarian –Digital Content Long-term Preservation (Medusa)
Project Needs Ayla (Metadata) and Tracy (Born Digital Content Reformatting) Identify –Metadata currently captured Make –Schema Recommendations Technical Administrative Descriptive –Controlled Vocabulary
Overview of Challenges Behemoth spreadsheet Various reports not in a schema No controlled vocabulary Redundant data entry Ideally aligns with Medusa data
Born Digital Reformatting Behemoth spreadsheet –Project tracking and data entry Reports –Structured but not to a schema From FTK Imager: »Directory list of media structure (created at time of disk imaging); item level information »Hash list of exported files From TreeSize Pro »Media group level reports
Challenges - Schema No one schema appropriate –Many layers of transformation –varying types of metadata Born Digital Reformatting Collecting Unit Digital Preservation Repository Recover from obsolete media Arrangement Description Access Medusa: Long term Preservation
Challenges – Controlled Vocabulary Reformatting request form is paper –Project tracking system in works No Controlled Vocabulary Reviewed: MANY Chose: –PBCore instantiationMediaType –PBCore instantiationPhysical
Schema Choices METS, MODS, and PREMIS Why? –MODS and PREMIS align with Medusa terms
Schema Choices PREMIS –Record technical info of item pre- reformatting –Encode actions and digital forensics reports as ‘events’ –Can have full provenance of a digital object in a cohesive piece
Schema Choices The Catch: –Medusa supports limited metadata Collection & file group level Event info does not pre-date ingest into repository –Metadata file as content METS wraps up MODS & PREMIS info Deposit METS record with content
Good Practice Interoperability Various levels that will assist in the digital preservation life cycle
Summary: Work In Progress Schema Choice: –METS, MODS, and PREMIS Controlled Vocabulary Choices: –Data Type: instantiationMediaType –Media Type: PBCore instantiationPhysical
Future Work Creating centralized, web-based tracking tool –Allow curating units to add descriptive information –Avoid data duplication Import metadata and reports –Structured in schema More controlled vocabulary –Rights
Thank You! Tracy Popp Digital Preservation Coordinator Ayla Stein Metadata Librarian @TheStacksCat