Migrating Repository Metadata & Users: The Harvard DRS 2 Project Andrea Goethals, Harvard Library IS&T Archiving 2014, May
Library Digital Initiative Funds (1998-) Build technical infrastructure - Digital Repository Service (DRS) Hire specialists Build digital collections via 49 internal grants to be preserved in the DRS
DRS Users Grew to 55 Organizational Units at Harvard
DRS is Central to User Workflows DRS Access (discovery, search, delivery platforms) Ingest (deposit tools) Manage (cataloging & management tools) reformatting labs; automated system deposits; library, archives and museum staff reformatting labs ; library, archives and museum staff; repository managers researchers, teachers, learners
Why a New DRS? Upgrade to best-in-breed technologies Adopt digital preservation best practices and standards Preserve metadata better Improve collection management Support preservation planning & activities Improve access to content & metadata Support more formats & genres
Evolution of the DRS DRS in production New DRS in production DRS enhancements New DRS infrastructure development New DRS metadata migration & user adoption
New DRS - Completed convened DRS Advisory Group software in production users trained, phase 1 hardware in production migrated content to new hardware Infrastructure Development Metadata Migration & User Adoption Fedora assessment DuraCloud pilot test early releasebeta 1beta 2 beta 3 first object deposited to the new DRS
New DRS – In Progress Infrastructure Development Metadata Migration & User Adoption metadata migration tools created & tested migrating metadata moving users
Why “Metadata” Migration? Why not “content” migration?
Pre-migration DRS Content DRS Database DRS Database
Post-migration DRS Content DRS Database New DRS Database New DRS Index New DRS Object Descriptors
New DRS Data Model Not a simple metadata conversion A new DRS object is a logical intellectual entity that unifies multiple DRS files, for example: – Still image objects - archival and production masters, and deliverables including thumbnails – Audio objects - archival and production masters and deliverables – PDS objects - page image and text files
Object Descriptors METS files generated for each object – Standards-based schemas (PREMIS, MODS, MIX, etc.) Metadata gathered from multiple sources – Current DRS database – Every content file parsed using FITS – In some cases catalog records, finding aids, legacy METS files
Technical Challenges Many formats Unique migration rules per format Preserving all identifiers Uninterrupted access for end users Large (>5000 file) page-turned documents 46+ million DRS files - At 1 sec/file would take 530+ days!
Formulating a Migration Plan Technical analysis – DRS content – Possible metadata sources User analysis – Management activity via system logs – Preparation via training and testing registration lists – Perceived preparation & concerns via survey of highest volume, active users
Migration Plan Combines needs of users with technical requirements – Respects all technical requirements – Minimizes the time users need to work in two systems at the same time
Migrating Content in 5 Stages Migrate 1 st : Tier 1 content Migrate 2 nd : Tier 2 content Migrate 3 rd : Tier 3 content Migrate 4 th : Tier 4 content Migrate 5 th : Tier 5 content
Migrating Content in 5 Stages Migrate 1 st : Tier 1 content Migrate 2 nd : Tier 2 content Migrate 3 rd : Tier 3 content Migrate 4 th : Tier 4 content Migrate 5 th : Tier 5 content simpler objects more complex objects
Migrating Content in 5 Stages Migrate 1 st : Tier 1 content Migrate 2 nd : Tier 2 content Migrate 3 rd : Tier 3 content Migrate 4 th : Tier 4 content Migrate 5 th : Tier 5 content dependencies between tiers dependencies within tiers
Migrating Content in 5 Stages TierContent 1Text (Methodology, ESRI World File), Document, Color Profile, Target Image 2PDS Document, Still Image 3Audio, Text (SMIL) 4Web Harvest, Opaque Container 5Biomedical Image; Google Document Container 1, 2, 3
Migrating Content in 5 Stages TierContent 1Text (Methodology, ESRI World File), Document, Color Profile, Target Image 2PDS Document, Still Image 3Audio, Text (SMIL) 4Web Harvest, Opaque Container 5Biomedical Image; Google Document Container 1, 2, 3
Migrating Content in 5 Stages TierContent 1Text (Methodology, ESRI World File), Document, Color Profile, Target Image 2PDS Document, Still Image 3Audio, Text (SMIL) 4Web Harvest, Opaque Container 5Biomedical Image; Google Document Container 1, 2, 3 Tiers 1, 3, 4, 5: Migrate across all DRS owner codes at one time Tier 2: Migrate one DRS owner code at a time
Migrating Content in 5 Stages TierContent 1Text (Methodology, ESRI World File), Document, Color Profile, Target Image 2PDS Document, Still Image 3Audio, Text (SMIL) 4Web Harvest, Opaque Container 5Biomedical Image; Google Document Container 1, 2, 3 Tiers 1, 3, 4, 5: Migrate across all DRS owner codes at one time Tier 2: Migrate one DRS owner code at a time * Minimizes the amount of time the content they manage the most is in 2 different systems
Technical Strategies Modular, parallelizable migration design Delivery services made migration-aware Test, test, test Design for migration failures – make do-overs possible
Technical Strategy – Modular, Parallelizable 1) Group files into objects 2) Run FITS, combine with metadata to generate object descriptors 3) Ingest into new DRS Objects queue Descriptors ready queue END START
Tuning Experiments Single powerful computer – Dell R720 Server using Intel(R) Xeon(R) CPU E GHz CPU’s with 16 Cores, 64 GB of Memory and 1 TB of internal disk – Various thread counts – 4-35 files processed per second Next: – RAM disk – Multiple computers
User Strategies Advisors - DRS Advisory Group Minimize disruption – Tier 2 migration - one owner at a time – Close partners - Imaging Services Tapping help of experts – “pioneer” depositors, beta testers, trainers Regular communications monthly via HL Update
Migration State Diagram
Migration Set Checklist Description of the affected content List of steps needing human intervention, who will do them, date of completion – includes communication, migration kickoff and post-migration verification tasks Final step – manager signs off on completion Checklist is preserved
Learned So Far Can migrate in sub-second/file time User-contributed metadata varies in quality – Should automate more and/or put more validation checks in place – Useful exercise to analyze metadata values and elements periodically errors in metadata values value vs. effort of metadata elements
Preservation Capability Before and After the DRS2 Project Level OneLevel TwoLevel ThreeLevel Four Storage & Geographic Location File Fixity and Data Integrity Information Security Metadata File Formats = already compliant= will be compliant after the DRS2 project Based on the NDSA Levels of Digital Preservation
Q & A Thanks! DRS Advisory Group DRS beta testers DCSWG Bobbi Fox Franziska Frey Andrea Goethals Wendy Gogel Chip Goines HUIT Security Jonathan Kennedy LTS Operations Spencer McEwen Grainne Reilly Tracey Robinson Randy Stern Janet Taylor Chris Vicary Robin Wendler Julie Wetherill Vitaly Zakuta