Presentation is loading. Please wait.

Presentation is loading. Please wait.

PIALA 2010 UH Manoa Hamilton Library Chronicling America and the National Digital Newspaper Program: Technical Aspects  Part 1: Newspapers and Microfilm.

Similar presentations


Presentation on theme: "PIALA 2010 UH Manoa Hamilton Library Chronicling America and the National Digital Newspaper Program: Technical Aspects  Part 1: Newspapers and Microfilm."— Presentation transcript:

1 PIALA 2010 UH Manoa Hamilton Library Chronicling America and the National Digital Newspaper Program: Technical Aspects  Part 1: Newspapers and Microfilm  Challenges  USNP  Part 2: Technical Details  Image views  Text searching  Indexing  Part 3: Managing a newspaper digitization project

2 PIALA 2010 UH Manoa Hamilton Library Challenges  Newspapers are a difficult medium  Never meant to last, made for daily use and disposal  Pages crumble and acid corrodes the materials  Tracking serial publications over time  Patron demand increased, storage space grew scarce, binding costs rose

3 PIALA 2010 UH Manoa Hamilton Library Microfilm  Adopted in the 1920s as a standard  Turns newspaper from a storage nightmare to a relatively easy medium to handle  Libraries had to decide what to do with the hardcopy  Keep in holdings?  Deaccession?

4 PIALA 2010 UH Manoa Hamilton Library United States Newspaper Program (USNP) Began in 1982  Funded by National Endowment for the Humanities, managed by the Library of Congress  University of Hawai’i with Hawaiian Historical Society, Hawai’i State Archives and State Library contributed for Hawai’i  In mid-2000s: the USNP had received over $54 million in NEH support & non-federal contributions of approx $19.6 million  Bibliographic records for over 140,000 newspaper titles; access to 70 million pages of newsprint in microfilm

5 PIALA 2010 UH Manoa Hamilton Library USNP  Goal: Locate, catalog, and microfilm newspapers  Hawai’i microfilmed 260,000 pages and cataloged 476 titles  Program ended in 2007

6 PIALA 2010 UH Manoa Hamilton Library USNP Preservation Microfilming Guidelines  Optimum legibility Image orientation & reduction ratios to fill frame & obtain greatest degree of legibility in public use copies  Quality Each roll of first generation film shall be inspected frame-by-frame by both the filming agency and the project for density and resolution and to determine that the film is free of emulsion scratches, abrasions, fingerprints, spots, fog, and other defects http://www.loc.gov/preserv/usnpguidelines.html

7 PIALA 2010 UH Manoa Hamilton Library USNP Preservation Microfilming Guidelines  Density No less than five readings at start, middle & end of each reel with a transmission densitometer calibrated daily Maximum (Dmax) density measurements taken on exposed image with no words or graphics Background densities no lower than.80 & no higher than 1.20, lower densities preferred for older pages & to facilitate production of reader-printer & enlargement prints. Base-plus-fog density (Dmin) on the master negative shall not exceed.10

8 PIALA 2010 UH Manoa Hamilton Library National Endowment for the Humanities and Library of Congress created NDNP  No single US collection of newspapers  Every institution focusing on particular themes relating to their collecting plans  Thousands of volumes of newspapers spread across the country  Enhance access to newspapers, building on the foundation of the United States Newspaper Program

9 PIALA 2010 UH Manoa Hamilton Library NDNP Overview  2-Year awards to state projects, renewable  Digitize 100,000 pages of microfilmed newspaper  Newspapers picked must be from between 1836 to 1922  Historical essays on each newspaper  Collation and Quality Control on all papers

10 PIALA 2010 UH Manoa Hamilton Library NDNP Goals  20-year span with phased, sustainable development of 30 million page database  Establish technical conversion specs & practices for efficient basic discovery & access  Develop production tools to ensure good digital objects that can be managed & preserved long-term  Provide public access to and take preservation responsibility for the digitized newspapers  Create a national resource of historically significant newspapers from all the states and U.S. territories

11 PIALA 2010 UH Manoa Hamilton Library NDNP Microfilm-related Challenges  Where are the master reels?  Copyright issues (Who filmed the newspapers and owns the master microfilm)  Technical specifications (Poorly filmed, low density readings, etc)  Microfilm standards applied vary widely

12 PIALA 2010 UH Manoa Hamilton Library No universally accepted metadata standard for historical newspapers  Online historical newspapers produced by public or private sector existed as discrete systems, metadata structures not designed for interoperability  Titles, issues, pages and reels all need to be represented as different yet related classes of objects

13 PIALA 2010 UH Manoa Hamilton Library NDNP Digital Deliverables  Images scanned at 300-400 dpi Three formats:  grayscale, uncompressed Tiff 6.0 Images  Compressed JPEG2000 images  PDF Image with hidden text  Accompanying structural and technical metadata  OCR text for all pages

14 PIALA 2010 UH Manoa Hamilton Library NDNP Scanning specifications  De-skew images with a skew of greater than 3 degrees  Crop to visible edge of page  Capture grayscale preservation microfilm targets

15 PIALA 2010 UH Manoa Hamilton Library NDNP OCR specifications  Conform to ALTO XML schema ALTO (Analyzed Layout and Text Object) is a XML (Extensible Markup Language) Schema that details technical metadata for describing the layout and content of physical text resources  Bounding box coordinate data Each column is sectioned and coordinates are used to place words

16 PIALA 2010 UH Manoa Hamilton Library NDNP Metadata requirements  METS (Metadata Encoding and Transmission Standard) format records preservation metadata  Structural metadata to relate pages to title, date, and edition; sequence pages within issue or section; and to identify image and OCR files  Technical metadata to support the functions of the Library of Congress repository (Metadata is Information about Information)

17 PIALA 2010 UH Manoa Hamilton Library XML Rules  Single, unique root element  Matching open/close tags  Consistent capitalization  Correctly nested elements (no overlapping elements)  Attribute values enclosed in quotes  No repeating attributes in an element  Provides international, vendor independent standard for describing information

18 PIALA 2010 UH Manoa Hamilton Library Family of XML data standards includes:  METS – Metadata Encoding and Transmission Standard  MODS – Metadata Object Description Schema  PREMIS – PREservation Metadata Implementation Strategies  EAD – Encoded Archival Description

19 PIALA 2010 UH Manoa Hamilton Library METS (Metadata Encoding and Transmission Standard)  XML Schema for the purpose of creating XML files that define: the hierarchical structure of digital library objects (images, text files, etc.) the names and locations of the files the associated metadata (e.g., MODS)

20 PIALA 2010 UH Manoa Hamilton Library Metadata Object Description Schema (MODS) An XML Schema designed for expressing bibliographic data (Think of it as an alternative to the MARC format)

21 PIALA 2010 UH Manoa Hamilton Library Sections of a METS file - METS header (document talks about itself) - Descriptive metadata (MODS, etc.) - Administrative metadata (copyright info., etc.) - File section (names and locations of files) - Structural map (relationships of the parts) - Linking information - Binding executables/actions to object

22 PIALA 2010 UH Manoa Hamilton Library Title METS  Combines bibliographic and holdings data in a single title record, converted from MARC to MARC XML format  Titles digitized will have additional data descriptive essays, more precise geographic coverage data which is put in a Metadata Object Description Schema (MODS) object within the larger METS document

23 PIALA 2010 UH Manoa Hamilton Library Issue and Reel METS  Issue METS Issue Data Page Data  Reel METS Reel Data Target Data

24 PIALA 2010 UH Manoa Hamilton Library WHY?  XML structure used by software for creation of multiple outputs: HTML/XHTML for Web display; PDF for printing  Ease of editing (single records or batches of records)  Ability to validate data  Ease of data management and publishing  Interoperability Repository submission and OAI harvesting

25 PIALA 2010 UH Manoa Hamilton Library  Geographic metadata  Title metadata  Date metadata All that coding pays off for the user when SEARCHING

26 PIALA 2010 UH Manoa Hamilton Library Keyword searching  OCR/OWR does not yield article “transcriptions”; text OCR’d from images of newspapers is used for searching purposes  Several options ANY of the words, ALL of the words EXACT PHRASE Proximity search – Look for words within 5, 10, 50 or 100 words of one another

27 PIALA 2010 UH Manoa Hamilton Library Page thumbnail view  Click on thumbnail or description of page to view larger version

28 PIALA 2010 UH Manoa Hamilton Library Page view  Different format can be selected with one click

29 PIALA 2010 UH Manoa Hamilton Library Browse Issues  A calendar view indicating which issues have been digitized  Can change which year you’re viewing  Browse First Pages

30 PIALA 2010 UH Manoa Hamilton Library From Microfilm to Digital Images Managing a Newspaper Conversion Project Project Management

31 PIALA 2010 UH Manoa Hamilton Library NDNP & University of Hawai’i  UH first grant began in July 2008, running until June 2010  Grant renewed: July 2010-June 2012  Utilizing the microfilm created under the USNP  Excellent quality microfilm (in theory)  Fewer problems with cataloging/description, acquiring 2N duplicates (in theory)

32 PIALA 2010 UH Manoa Hamilton Library Project Management  Request for Proposals (RFP) Include all LC technical specifications  Position Description(s) Coordinator, students  Hiring and Training

33 PIALA 2010 UH Manoa Hamilton Library Project components  Microfilm identification and duplication  Digitization  Metadata creation & Validation

34 PIALA 2010 UH Manoa Hamilton Library Microfilm selection  Choose what is important to your institution(s) if possible  Copyright Reels created by or for your institution Reels by Proquest, etc, you may have to ask for permission and pay much higher duplication fees  Decide Complete runs of few titles, or many short/incomplete runs of a lot of titles

35 PIALA 2010 UH Manoa Hamilton Library Vendors  iArchives Leaders in the field Lots of experience  OCLC/BSLW (Backstage Library Works)  Apex/Covantage  Northern Micrographics (NMT)  Local or national microfilm duplication companies

36 PIALA 2010 UH Manoa Hamilton Library Equipment  10 500 GB External Hard Drives (Western Digital MyBooks) and Pelican cases  1 PC with double monitor  Software: Library of Congress’ Digital Validator and Viewer (DVV)  Densitometer  Microfilm reader/scanner

37 PIALA 2010 UH Manoa Hamilton Library Our Stuff Densitometer Pelican Cases Microfilm scanner PC with 2 monitors & portable HDs (red)

38 PIALA 2010 UH Manoa Hamilton Library Staffing  Project Coordinator Quality Control Technician  Graduate students  Advisory Board  Subject/history/newspaper specialists

39 PIALA 2010 UH Manoa Hamilton Library Metadata Collection  Density readings  Recorded onto a spreadsheet

40 PIALA 2010 UH Manoa Hamilton Library Preparing the Microfilm: Metadata Data from, OCLC MARC record & local holdings

41 PIALA 2010 UH Manoa Hamilton Library Preparing the Microfilm: Collation  Review use copy of reel Missing issues or pages Duplicate issues or pages Mutilated pages Other abnormalities (E.g. pages out of order, incorrect dates)

42 PIALA 2010 UH Manoa Hamilton Library Preparing the Microfilm: Collation Review use copy, record data on spreadsheet

43 Film Scanning Customer Deliverables Workflow Manager DB Page/Reel Metadata Page/Reel Metadata Shared Storage (NAS) Split, De-Skew, Crop Split, De-Skew, Crop Post Process Post Process OCR Framework OCR Framework Image Metadata Image Metadata Image Processing Image Processing KEY: ■ Automatic process [image processing, OCR, …] ■ Manual process [image + page metadata] ■ Quality Control QC QC QC QC QC Automated Processing Cloud QC iArchives Digitization Workflow

44 Scan QC

45 Split, Crop & DeSkew

46 2,000,000 Word Dictionary 2,000,000 Name Dictionary 3 Leading OCR Software Programs OWR iArchives OWR Framework

47 apple (99%) epple (73%) opple (88%) OCR Engine 1 (dictionary choice) OCR Engine 2 OCR Engine 3 apple Text image word (predicted accuracy) How does OWR ™ work?

48 PIALA 2010 UH Manoa Hamilton Library Post-vendor validation  Once the hard drive returned, we verify/validate the batch using the DVV program  Verification compares the metadata listed in the master XML file to the metadata found in the issue XML files for correctness  Validation is done if a new master XML file needs to be created. It creates checksums for each file and records them in the subsequent metadata  Copy contents of hard drive onto our server

49 PIALA 2010 UH Manoa Hamilton Library Quality Control  Image quality Too dark? Too light? Skewed?  Correct image? Compare digitized image to microfilmed image No Missing Issue/Page tags  Review metadata Dates LCCN # Locations

50 PIALA 2010 UH Manoa Hamilton Library Thumbnail View can use DVV or any graphics program

51 PIALA 2010 UH Manoa Hamilton Library Quality Control LC Digital Viewer and Validator (DVV)

52 PIALA 2010 UH Manoa Hamilton Library Metadata Viewer

53 PIALA 2010 UH Manoa Hamilton Library OCR

54 PIALA 2010 UH Manoa Hamilton Library Headers

55 PIALA 2010 UH Manoa Hamilton Library Title Essays - 500 words  Describes newspaper’s history Date of establishment Editors Type of news reported Political viewpoint Where is the paper today?  Published to Chronicling America

56 PIALA 2010 UH Manoa Hamilton Library Links  Chronicling America: http://chroniclingamerica.loc.gov/ http://chroniclingamerica.loc.gov/  Library of Congress: http://www.loc.gov/ndnp/http://www.loc.gov/ndnp/  National Endowment for the Humanities: http://www.neh.gov/projects/ndnp.html http://www.neh.gov/projects/ndnp.html  Hawai’i Newspapers: a union list http://evols.library.manoa.hawaii.edu/handle/10524/2 089 http://evols.library.manoa.hawaii.edu/handle/10524/2 089  Using and to Create XML Standards-based Digital Library Applications http://www.loc.gov/standards/mods/presentations/me ts-mods-morgan-ala07/ http://www.loc.gov/standards/mods/presentations/me ts-mods-morgan-ala07/

57 PIALA 2010 UH Manoa Hamilton Library Thank You! Mahalo! Kinisou Chapur!  Questions? Comments?  Email us at: ♦ chantiny@hawaii.edu ♦ erenst@hawaii.edu https://sites.google.com/a/hawaii.edu/ndnp-hawaii/


Download ppt "PIALA 2010 UH Manoa Hamilton Library Chronicling America and the National Digital Newspaper Program: Technical Aspects  Part 1: Newspapers and Microfilm."

Similar presentations


Ads by Google