Download presentation
Presentation is loading. Please wait.
Published byEustace Taylor Modified over 9 years ago
1
NLM Digital Collections Update for DCFedoraUsersGroup January 22, 2013 John Doyle National Library of Medicine
2
The Story So Far 2 Texts Texts –7,866 books, incl. 225 multi-vol sets –Medical Heritage Library 1.7m pages In-house digitization –1 multi-part report Audiovisuals Audiovisuals –70 films –2 thematic collections
3
The Saga Continues Serials Serials –NIH Institute annual reports –61 volume printed index of historical citations –Journals may be coming soon Oral Histories Oral Histories Still Images Still Images Born-digital resources Born-digital resources Citation dataset Citation dataset
4
Public Interface: “Digital Collections” Browse & Search (Muradora) Browse & Search (Muradora) Supports multiple collections, diverse content Resource display page: metadata, datastreams Book Viewer (NWU) Book Viewer (NWU) Open source software from Northwestern University Open source JPEG2000 server (Djatoka) Video Player with Search (NLM) Video Player with Search (NLM) Features video transcript search and play-ahead jump HHS Innovates finalist (top 6), Fall 2011 4
5
Replacing Muradora Muradora codebase is aging Muradora codebase is aging –No community development or support Newer community projects reaching maturity Newer community projects reaching maturity –Islandora –Hydra Priority is to preserve/enhance resource search and browse Priority is to preserve/enhance resource search and browse Probably retain the book and video viewing applications Probably retain the book and video viewing applications 5
6
Current Developments Workflows Workflows –Increasingly concurrent content projects –Moving from project-specific to project-agnostic Data Services Data Services –Programmatic access – search web service –Bulk data –Need to pin down use cases Fedora framework upgrading Fedora framework upgrading –Journaling for propagating changes across multiple Fedora instances 6
7
Current Developments Periodic checksum checking Periodic checksum checking –Make use of recent Fedora enhancements in this area Third copy of content Third copy of content –“Just in case” copy, not primary disaster recovery –Amazon Glacier seems to be a good fit Descriptive Metadata Descriptive Metadata –More automated updating of ILS –Need to update Fedora/Solr post-ingest 7
8
Related Activities Internet Archive Internet Archive –Over 6,500 books uploaded as part of MHL project –Only selected datastreams going up –Expect to continue sending books to IA going forward Hathi Trust Hathi Trust –Working group delivered recommendations last year –Participation could involve an IA-to-HT path –Some bibliographic challenges to be met
9
NLM Digital Collections Support for Multi-volume texts January 22, 2013 Nancy Fallgren, Doron Shalvi National Library of Medicine
10
Outline Regular book processing Regular book processing Regular book data model and presentation Regular book data model and presentation What is a multi-volume? What is a multi-volume? Multi-volume metadata issues Multi-volume metadata issues Multi-volume scanning and identifiers Multi-volume scanning and identifiers Multi-volume metadata generation and workflow Multi-volume metadata generation and workflow Asynchronous volume processing (a.k.a. Jail) Asynchronous volume processing (a.k.a. Jail) Multi-volume data model and presentation Multi-volume data model and presentation Software adjustments Software adjustments Questions Questions 10
11
Regular book processing Voyager record Voyager record –One to one relationship between BIB record and digital object Metadata processing Metadata processing –MARCXML to OAI-DC and DMDINDEX Preingest process Preingest process –Create derivatives –Generate FOXML –Locate files Ingest into Fedora Ingest into Fedora 11
12
Regular book data model 12 IDTYPEMIMETYPELABEL PID-- Fedora persistent identifier DCXtext/xml Dublin Core metadata for this object RELS-EXTXapplication/rdf+xml RDF statements about this object MARCXMLMtext/xmlMARCXML metadata DMDINDEXXtext/xml DMDINDEX descriptive metadata METSMtext/xml METS file for entire book OCREtext/plain Book OCR - full text of entire book PDFEapplication/pdfPDF of entire book THUMBEimage/jpeg JPG Thumbnail image of selected page in book PreviewEimage/jpegJPG Preview image of selected page in book
13
Regular book presentation 13
14
What is a Multi-volume? Multiple volume monographic series Multiple volume monographic series –All volumes share the same series title –Each volume may or may not have a unique title –The series has a finite beginning and end Unanalyzed cataloging, i.e., the entire set is cataloged as a single unit, individual volumes do not have their own catalog/BIB records Unanalyzed cataloging, i.e., the entire set is cataloged as a single unit, individual volumes do not have their own catalog/BIB records Not journals or serials Not journals or serials 14
15
Multi-volume metadata issues One to many relationship between the Voyager BIB record (for the series) and the digital objects (each volume) One to many relationship between the Voyager BIB record (for the series) and the digital objects (each volume) –NLM UID (MARC 035$9) is the basis for each digital object’s PID –Disambiguating volume titles Distinguishing multi-vol pre- and post-ingest processing workflows from monograph workflows Distinguishing multi-vol pre- and post-ingest processing workflows from monograph workflows
16
Scanning Spreadsheets: UIDs and volume nos.
17
From spreadsheet to XML
18
Set/Parent MARCXML
19
New child/volume MARCXML
20
Set/Parent DC
21
Child/Volume DC
22
Disambiguating Multi-volume workflows Transform pre-ingest manifests (UID lists) Transform pre-ingest manifests (UID lists) –Remove all UIDs with “X#” suffix Transform post-ingest manifests Transform post-ingest manifests –Remove all “X#” suffixes from UIDs –De-dupe the remaining list –Add only set/parent url to BIB records DREPSERIES code DREPSERIES code
23
Asynchronous Volume processing a.k.a. Jail Do not pass GO, do not collect $200 Do not pass GO, do not collect $200 Volumes are scanned and processed Volumes are scanned and processed asynchronously asynchronously Set object created for first child part Set object created for first child part Standard processing and review workflow Standard processing and review workflow Volumes held in Jail – no further processing – until all volumes pass manual review on Fedora QA system Volumes held in Jail – no further processing – until all volumes pass manual review on Fedora QA system Once all volumes reviewed, full set promoted to Production Once all volumes reviewed, full set promoted to Production
24
Multi-volume set data model 24 IDTYPEMIMETYPELABEL PID-- Fedora persistent identifier DCXtext/xml Dublin Core metadata for this object RELS-EXTXapplication/rdf+xml RDF statements about this object MARCXMLMtext/xmlMARCXML metadata DMDINDEXXtext/xml DMDINDEX descriptive metadata THUMBEimage/jpeg JPG Thumbnail image of selected page in set PreviewEimage/jpegJPG Preview image of selected page in set Same data model as book, but no METS, OCR or PDF Same data model as book, but no METS, OCR or PDF
25
Multi-volume part data model 25 IDTYPEMIMETYPELABEL PID-- Fedora persistent identifier DCXtext/xml Dublin Core metadata for this object RELS-EXTXapplication/rdf+xml RDF statements about this object MARCXMLMtext/xmlMARCXML metadata DMDINDEXXtext/xml DMDINDEX descriptive metadata METSMtext/xml METS file for entire book OCREtext/plain Book OCR - full text of entire book PDFEapplication/pdfPDF of entire book THUMBEimage/jpeg JPG Thumbnail image of selected page in book PreviewEimage/jpegJPG Preview image of selected page in book Same data model as book Same data model as book
26
Multi-volume relationships 26 SetPart fedora:hasPart fedora:isPartOf
27
Multi-volume presentation - set 27
28
Multi-volume presentation - part 28
29
Software adjustments Creation of new content models – mvset, mvpart Creation of new content models – mvset, mvpart New process to generate FOXML, capture thumb New process to generate FOXML, capture thumb New relationships in RELS-EXT New relationships in RELS-EXT Adjustment of UI and business logic to handle sets – link to all parts, query part names from Solr Adjustment of UI and business logic to handle sets – link to all parts, query part names from Solr Adjustment of UI to handle child parts – link back to set Adjustment of UI to handle child parts – link back to set Hide basic display of dc.relation – info in hotlinks instead Hide basic display of dc.relation – info in hotlinks instead More abstract content models, to reduce redundant changes, would have helped More abstract content models, to reduce redundant changes, would have helped
30
Demonstration http://collections.nlm.nih.gov
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.