PIALA 2010 UH Manoa Hamilton Library Chronicling America and the National Digital Newspaper Program: Technical Aspects Part 1: Newspapers and Microfilm Challenges USNP Part 2: Technical Details Image views Text searching Indexing Part 3: Managing a newspaper digitization project
PIALA 2010 UH Manoa Hamilton Library Challenges Newspapers are a difficult medium Never meant to last, made for daily use and disposal Pages crumble and acid corrodes the materials Tracking serial publications over time Patron demand increased, storage space grew scarce, binding costs rose
PIALA 2010 UH Manoa Hamilton Library Microfilm Adopted in the 1920s as a standard Turns newspaper from a storage nightmare to a relatively easy medium to handle Libraries had to decide what to do with the hardcopy Keep in holdings? Deaccession?
PIALA 2010 UH Manoa Hamilton Library United States Newspaper Program (USNP) Began in 1982 Funded by National Endowment for the Humanities, managed by the Library of Congress University of Hawai’i with Hawaiian Historical Society, Hawai’i State Archives and State Library contributed for Hawai’i In mid-2000s: the USNP had received over $54 million in NEH support & non-federal contributions of approx $19.6 million Bibliographic records for over 140,000 newspaper titles; access to 70 million pages of newsprint in microfilm
PIALA 2010 UH Manoa Hamilton Library USNP Goal: Locate, catalog, and microfilm newspapers Hawai’i microfilmed 260,000 pages and cataloged 476 titles Program ended in 2007
PIALA 2010 UH Manoa Hamilton Library USNP Preservation Microfilming Guidelines Optimum legibility Image orientation & reduction ratios to fill frame & obtain greatest degree of legibility in public use copies Quality Each roll of first generation film shall be inspected frame-by-frame by both the filming agency and the project for density and resolution and to determine that the film is free of emulsion scratches, abrasions, fingerprints, spots, fog, and other defects
PIALA 2010 UH Manoa Hamilton Library USNP Preservation Microfilming Guidelines Density No less than five readings at start, middle & end of each reel with a transmission densitometer calibrated daily Maximum (Dmax) density measurements taken on exposed image with no words or graphics Background densities no lower than.80 & no higher than 1.20, lower densities preferred for older pages & to facilitate production of reader-printer & enlargement prints. Base-plus-fog density (Dmin) on the master negative shall not exceed.10
PIALA 2010 UH Manoa Hamilton Library National Endowment for the Humanities and Library of Congress created NDNP No single US collection of newspapers Every institution focusing on particular themes relating to their collecting plans Thousands of volumes of newspapers spread across the country Enhance access to newspapers, building on the foundation of the United States Newspaper Program
PIALA 2010 UH Manoa Hamilton Library NDNP Overview 2-Year awards to state projects, renewable Digitize 100,000 pages of microfilmed newspaper Newspapers picked must be from between 1836 to 1922 Historical essays on each newspaper Collation and Quality Control on all papers
PIALA 2010 UH Manoa Hamilton Library NDNP Goals 20-year span with phased, sustainable development of 30 million page database Establish technical conversion specs & practices for efficient basic discovery & access Develop production tools to ensure good digital objects that can be managed & preserved long-term Provide public access to and take preservation responsibility for the digitized newspapers Create a national resource of historically significant newspapers from all the states and U.S. territories
PIALA 2010 UH Manoa Hamilton Library NDNP Microfilm-related Challenges Where are the master reels? Copyright issues (Who filmed the newspapers and owns the master microfilm) Technical specifications (Poorly filmed, low density readings, etc) Microfilm standards applied vary widely
PIALA 2010 UH Manoa Hamilton Library No universally accepted metadata standard for historical newspapers Online historical newspapers produced by public or private sector existed as discrete systems, metadata structures not designed for interoperability Titles, issues, pages and reels all need to be represented as different yet related classes of objects
PIALA 2010 UH Manoa Hamilton Library NDNP Digital Deliverables Images scanned at dpi Three formats: grayscale, uncompressed Tiff 6.0 Images Compressed JPEG2000 images PDF Image with hidden text Accompanying structural and technical metadata OCR text for all pages
PIALA 2010 UH Manoa Hamilton Library NDNP Scanning specifications De-skew images with a skew of greater than 3 degrees Crop to visible edge of page Capture grayscale preservation microfilm targets
PIALA 2010 UH Manoa Hamilton Library NDNP OCR specifications Conform to ALTO XML schema ALTO (Analyzed Layout and Text Object) is a XML (Extensible Markup Language) Schema that details technical metadata for describing the layout and content of physical text resources Bounding box coordinate data Each column is sectioned and coordinates are used to place words
PIALA 2010 UH Manoa Hamilton Library NDNP Metadata requirements METS (Metadata Encoding and Transmission Standard) format records preservation metadata Structural metadata to relate pages to title, date, and edition; sequence pages within issue or section; and to identify image and OCR files Technical metadata to support the functions of the Library of Congress repository (Metadata is Information about Information)
PIALA 2010 UH Manoa Hamilton Library XML Rules Single, unique root element Matching open/close tags Consistent capitalization Correctly nested elements (no overlapping elements) Attribute values enclosed in quotes No repeating attributes in an element Provides international, vendor independent standard for describing information
PIALA 2010 UH Manoa Hamilton Library Family of XML data standards includes: METS – Metadata Encoding and Transmission Standard MODS – Metadata Object Description Schema PREMIS – PREservation Metadata Implementation Strategies EAD – Encoded Archival Description
PIALA 2010 UH Manoa Hamilton Library METS (Metadata Encoding and Transmission Standard) XML Schema for the purpose of creating XML files that define: the hierarchical structure of digital library objects (images, text files, etc.) the names and locations of the files the associated metadata (e.g., MODS)
PIALA 2010 UH Manoa Hamilton Library Metadata Object Description Schema (MODS) An XML Schema designed for expressing bibliographic data (Think of it as an alternative to the MARC format)
PIALA 2010 UH Manoa Hamilton Library Sections of a METS file - METS header (document talks about itself) - Descriptive metadata (MODS, etc.) - Administrative metadata (copyright info., etc.) - File section (names and locations of files) - Structural map (relationships of the parts) - Linking information - Binding executables/actions to object
PIALA 2010 UH Manoa Hamilton Library Title METS Combines bibliographic and holdings data in a single title record, converted from MARC to MARC XML format Titles digitized will have additional data descriptive essays, more precise geographic coverage data which is put in a Metadata Object Description Schema (MODS) object within the larger METS document
PIALA 2010 UH Manoa Hamilton Library Issue and Reel METS Issue METS Issue Data Page Data Reel METS Reel Data Target Data
PIALA 2010 UH Manoa Hamilton Library WHY? XML structure used by software for creation of multiple outputs: HTML/XHTML for Web display; PDF for printing Ease of editing (single records or batches of records) Ability to validate data Ease of data management and publishing Interoperability Repository submission and OAI harvesting
PIALA 2010 UH Manoa Hamilton Library Geographic metadata Title metadata Date metadata All that coding pays off for the user when SEARCHING
PIALA 2010 UH Manoa Hamilton Library Keyword searching OCR/OWR does not yield article “transcriptions”; text OCR’d from images of newspapers is used for searching purposes Several options ANY of the words, ALL of the words EXACT PHRASE Proximity search – Look for words within 5, 10, 50 or 100 words of one another
PIALA 2010 UH Manoa Hamilton Library Page thumbnail view Click on thumbnail or description of page to view larger version
PIALA 2010 UH Manoa Hamilton Library Page view Different format can be selected with one click
PIALA 2010 UH Manoa Hamilton Library Browse Issues A calendar view indicating which issues have been digitized Can change which year you’re viewing Browse First Pages
PIALA 2010 UH Manoa Hamilton Library From Microfilm to Digital Images Managing a Newspaper Conversion Project Project Management
PIALA 2010 UH Manoa Hamilton Library NDNP & University of Hawai’i UH first grant began in July 2008, running until June 2010 Grant renewed: July 2010-June 2012 Utilizing the microfilm created under the USNP Excellent quality microfilm (in theory) Fewer problems with cataloging/description, acquiring 2N duplicates (in theory)
PIALA 2010 UH Manoa Hamilton Library Project Management Request for Proposals (RFP) Include all LC technical specifications Position Description(s) Coordinator, students Hiring and Training
PIALA 2010 UH Manoa Hamilton Library Project components Microfilm identification and duplication Digitization Metadata creation & Validation
PIALA 2010 UH Manoa Hamilton Library Microfilm selection Choose what is important to your institution(s) if possible Copyright Reels created by or for your institution Reels by Proquest, etc, you may have to ask for permission and pay much higher duplication fees Decide Complete runs of few titles, or many short/incomplete runs of a lot of titles
PIALA 2010 UH Manoa Hamilton Library Vendors iArchives Leaders in the field Lots of experience OCLC/BSLW (Backstage Library Works) Apex/Covantage Northern Micrographics (NMT) Local or national microfilm duplication companies
PIALA 2010 UH Manoa Hamilton Library Equipment GB External Hard Drives (Western Digital MyBooks) and Pelican cases 1 PC with double monitor Software: Library of Congress’ Digital Validator and Viewer (DVV) Densitometer Microfilm reader/scanner
PIALA 2010 UH Manoa Hamilton Library Our Stuff Densitometer Pelican Cases Microfilm scanner PC with 2 monitors & portable HDs (red)
PIALA 2010 UH Manoa Hamilton Library Staffing Project Coordinator Quality Control Technician Graduate students Advisory Board Subject/history/newspaper specialists
PIALA 2010 UH Manoa Hamilton Library Metadata Collection Density readings Recorded onto a spreadsheet
PIALA 2010 UH Manoa Hamilton Library Preparing the Microfilm: Metadata Data from, OCLC MARC record & local holdings
PIALA 2010 UH Manoa Hamilton Library Preparing the Microfilm: Collation Review use copy of reel Missing issues or pages Duplicate issues or pages Mutilated pages Other abnormalities (E.g. pages out of order, incorrect dates)
PIALA 2010 UH Manoa Hamilton Library Preparing the Microfilm: Collation Review use copy, record data on spreadsheet
Film Scanning Customer Deliverables Workflow Manager DB Page/Reel Metadata Page/Reel Metadata Shared Storage (NAS) Split, De-Skew, Crop Split, De-Skew, Crop Post Process Post Process OCR Framework OCR Framework Image Metadata Image Metadata Image Processing Image Processing KEY: ■ Automatic process [image processing, OCR, …] ■ Manual process [image + page metadata] ■ Quality Control QC QC QC QC QC Automated Processing Cloud QC iArchives Digitization Workflow
Scan QC
Split, Crop & DeSkew
2,000,000 Word Dictionary 2,000,000 Name Dictionary 3 Leading OCR Software Programs OWR iArchives OWR Framework
apple (99%) epple (73%) opple (88%) OCR Engine 1 (dictionary choice) OCR Engine 2 OCR Engine 3 apple Text image word (predicted accuracy) How does OWR ™ work?
PIALA 2010 UH Manoa Hamilton Library Post-vendor validation Once the hard drive returned, we verify/validate the batch using the DVV program Verification compares the metadata listed in the master XML file to the metadata found in the issue XML files for correctness Validation is done if a new master XML file needs to be created. It creates checksums for each file and records them in the subsequent metadata Copy contents of hard drive onto our server
PIALA 2010 UH Manoa Hamilton Library Quality Control Image quality Too dark? Too light? Skewed? Correct image? Compare digitized image to microfilmed image No Missing Issue/Page tags Review metadata Dates LCCN # Locations
PIALA 2010 UH Manoa Hamilton Library Thumbnail View can use DVV or any graphics program
PIALA 2010 UH Manoa Hamilton Library Quality Control LC Digital Viewer and Validator (DVV)
PIALA 2010 UH Manoa Hamilton Library Metadata Viewer
PIALA 2010 UH Manoa Hamilton Library OCR
PIALA 2010 UH Manoa Hamilton Library Headers
PIALA 2010 UH Manoa Hamilton Library Title Essays words Describes newspaper’s history Date of establishment Editors Type of news reported Political viewpoint Where is the paper today? Published to Chronicling America
PIALA 2010 UH Manoa Hamilton Library Links Chronicling America: Library of Congress: National Endowment for the Humanities: Hawai’i Newspapers: a union list Using and to Create XML Standards-based Digital Library Applications ts-mods-morgan-ala07/ ts-mods-morgan-ala07/
PIALA 2010 UH Manoa Hamilton Library Thank You! Mahalo! Kinisou Chapur! Questions? Comments? us at: ♦ ♦