NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine
The Story So Far 2 Texts Texts –7,866 books, incl. 225 multi-vol sets –Medical Heritage Library 1.7m pages In-house digitization –1 multi-part report Audiovisuals Audiovisuals –70 films –2 thematic collections
The Saga Continues Serials Serials –NIH Institute annual reports –61 volume printed index of historical citations –Journals may be coming soon Oral Histories Oral Histories Still Images Still Images Born-digital resources Born-digital resources Citation dataset Citation dataset
Public Interface: “Digital Collections” Browse & Search (Muradora) Browse & Search (Muradora) Supports multiple collections, diverse content Resource display page: metadata, datastreams Book Viewer (NWU) Book Viewer (NWU) Open source software from Northwestern University Open source JPEG2000 server (Djatoka) Video Player with Search (NLM) Video Player with Search (NLM) Features video transcript search and play-ahead jump HHS Innovates finalist (top 6), Fall
Replacing Muradora Muradora codebase is aging Muradora codebase is aging –No community development or support Newer community projects reaching maturity Newer community projects reaching maturity –Islandora –Hydra Priority is to preserve/enhance resource search and browse Priority is to preserve/enhance resource search and browse Probably retain the book and video viewing applications Probably retain the book and video viewing applications 5
Current Developments Workflows Workflows –Increasingly concurrent content projects –Moving from project-specific to project-agnostic Data Services Data Services –Programmatic access – search web service –Bulk data –Need to pin down use cases Fedora framework upgrading Fedora framework upgrading –Journaling for propagating changes across multiple Fedora instances 6
Current Developments Periodic checksum checking Periodic checksum checking –Make use of recent Fedora enhancements in this area Third copy of content Third copy of content –“Just in case” copy, not primary disaster recovery –Amazon Glacier seems to be a good fit Descriptive Metadata Descriptive Metadata –More automated updating of ILS –Need to update Fedora/Solr post-ingest 7
Related Activities Internet Archive Internet Archive –Over 6,500 books uploaded as part of MHL project –Only selected datastreams going up –Expect to continue sending books to IA going forward Hathi Trust Hathi Trust –Working group delivered recommendations last year –Participation could involve an IA-to-HT path –Some bibliographic challenges to be met
NLM Digital Collections Support for Multi-volume texts January 22, 2013 Nancy Fallgren, Doron Shalvi National Library of Medicine
Outline Regular book processing Regular book processing Regular book data model and presentation Regular book data model and presentation What is a multi-volume? What is a multi-volume? Multi-volume metadata issues Multi-volume metadata issues Multi-volume scanning and identifiers Multi-volume scanning and identifiers Multi-volume metadata generation and workflow Multi-volume metadata generation and workflow Asynchronous volume processing (a.k.a. Jail) Asynchronous volume processing (a.k.a. Jail) Multi-volume data model and presentation Multi-volume data model and presentation Software adjustments Software adjustments Questions Questions 10
Regular book processing Voyager record Voyager record –One to one relationship between BIB record and digital object Metadata processing Metadata processing –MARCXML to OAI-DC and DMDINDEX Preingest process Preingest process –Create derivatives –Generate FOXML –Locate files Ingest into Fedora Ingest into Fedora 11
Regular book data model 12 IDTYPEMIMETYPELABEL PID-- Fedora persistent identifier DCXtext/xml Dublin Core metadata for this object RELS-EXTXapplication/rdf+xml RDF statements about this object MARCXMLMtext/xmlMARCXML metadata DMDINDEXXtext/xml DMDINDEX descriptive metadata METSMtext/xml METS file for entire book OCREtext/plain Book OCR - full text of entire book PDFEapplication/pdfPDF of entire book THUMBEimage/jpeg JPG Thumbnail image of selected page in book PreviewEimage/jpegJPG Preview image of selected page in book
Regular book presentation 13
What is a Multi-volume? Multiple volume monographic series Multiple volume monographic series –All volumes share the same series title –Each volume may or may not have a unique title –The series has a finite beginning and end Unanalyzed cataloging, i.e., the entire set is cataloged as a single unit, individual volumes do not have their own catalog/BIB records Unanalyzed cataloging, i.e., the entire set is cataloged as a single unit, individual volumes do not have their own catalog/BIB records Not journals or serials Not journals or serials 14
Multi-volume metadata issues One to many relationship between the Voyager BIB record (for the series) and the digital objects (each volume) One to many relationship between the Voyager BIB record (for the series) and the digital objects (each volume) –NLM UID (MARC 035$9) is the basis for each digital object’s PID –Disambiguating volume titles Distinguishing multi-vol pre- and post-ingest processing workflows from monograph workflows Distinguishing multi-vol pre- and post-ingest processing workflows from monograph workflows
Scanning Spreadsheets: UIDs and volume nos.
From spreadsheet to XML
Set/Parent MARCXML
New child/volume MARCXML
Set/Parent DC
Child/Volume DC
Disambiguating Multi-volume workflows Transform pre-ingest manifests (UID lists) Transform pre-ingest manifests (UID lists) –Remove all UIDs with “X#” suffix Transform post-ingest manifests Transform post-ingest manifests –Remove all “X#” suffixes from UIDs –De-dupe the remaining list –Add only set/parent url to BIB records DREPSERIES code DREPSERIES code
Asynchronous Volume processing a.k.a. Jail Do not pass GO, do not collect $200 Do not pass GO, do not collect $200 Volumes are scanned and processed Volumes are scanned and processed asynchronously asynchronously Set object created for first child part Set object created for first child part Standard processing and review workflow Standard processing and review workflow Volumes held in Jail – no further processing – until all volumes pass manual review on Fedora QA system Volumes held in Jail – no further processing – until all volumes pass manual review on Fedora QA system Once all volumes reviewed, full set promoted to Production Once all volumes reviewed, full set promoted to Production
Multi-volume set data model 24 IDTYPEMIMETYPELABEL PID-- Fedora persistent identifier DCXtext/xml Dublin Core metadata for this object RELS-EXTXapplication/rdf+xml RDF statements about this object MARCXMLMtext/xmlMARCXML metadata DMDINDEXXtext/xml DMDINDEX descriptive metadata THUMBEimage/jpeg JPG Thumbnail image of selected page in set PreviewEimage/jpegJPG Preview image of selected page in set Same data model as book, but no METS, OCR or PDF Same data model as book, but no METS, OCR or PDF
Multi-volume part data model 25 IDTYPEMIMETYPELABEL PID-- Fedora persistent identifier DCXtext/xml Dublin Core metadata for this object RELS-EXTXapplication/rdf+xml RDF statements about this object MARCXMLMtext/xmlMARCXML metadata DMDINDEXXtext/xml DMDINDEX descriptive metadata METSMtext/xml METS file for entire book OCREtext/plain Book OCR - full text of entire book PDFEapplication/pdfPDF of entire book THUMBEimage/jpeg JPG Thumbnail image of selected page in book PreviewEimage/jpegJPG Preview image of selected page in book Same data model as book Same data model as book
Multi-volume relationships 26 SetPart fedora:hasPart fedora:isPartOf
Multi-volume presentation - set 27
Multi-volume presentation - part 28
Software adjustments Creation of new content models – mvset, mvpart Creation of new content models – mvset, mvpart New process to generate FOXML, capture thumb New process to generate FOXML, capture thumb New relationships in RELS-EXT New relationships in RELS-EXT Adjustment of UI and business logic to handle sets – link to all parts, query part names from Solr Adjustment of UI and business logic to handle sets – link to all parts, query part names from Solr Adjustment of UI to handle child parts – link back to set Adjustment of UI to handle child parts – link back to set Hide basic display of dc.relation – info in hotlinks instead Hide basic display of dc.relation – info in hotlinks instead More abstract content models, to reduce redundant changes, would have helped More abstract content models, to reduce redundant changes, would have helped
Demonstration