Download presentation
Presentation is loading. Please wait.
Published byScott Burns Modified over 9 years ago
1
These ain’t “Old News”! Creating access to historic newspapers Christine Guenther OCLC Product Manager, Digital Services Preservation Service Centers Bethlehem, PA CALIFA – January 9, 2009
2
Objective Learn about the workflow to turn historical newspapers into a searchable collection online - starting from preservation microfilm or original paper. Prepare for the key decisions that lead to success and help you define your vision and expectations.
3
Outline Scanning workflow Metadata decisions Access options Your digital newspaper project
4
Getting Started Develop a vision & plan (Goal, Scope, Budget/Funding, Stakeholders…) Select content (titles, date range, page count, quality/completeness, copyright, …) Select format to digitize: Film or Original? Assess film quality (imaging, collation, film generation) No film available? Consider analog preservation as part of digital project
5
Film generations Archive Master 1 st generation Print Master 2nd generation Service Copies 3rd generation Best choice
6
Line that separates columns Heavy scratches Example: Heavily scratched Service Copy
7
Lost text Example: Uneven lighting – lost text
8
Acetate or Polyester? Acetate: not stable for long-term preservation! (Caring for your film may be a byproduct of a digital project) PolyesterAcetate: light blocked
9
Scanning from Paper Key points: Bound/disbound Collation?! Conservation? Cost of imaging Color!
10
The Roadmap IMAGE QUALITY METADATA QUALITY ACCESS QUALITY SelectionSelection
11
Section 1: Digitization Digitization options are relatively simple (1bit vs 8bit, film vs original) Recommended: 300-400 ppi Best quality digital image, typically the master file is a TIFF file.
12
Section 1: Digitization 1bit (= bitonal) 8bit (=grayscale)
13
Section 2: Content Conversion Content Conversion is major intersection – and it’s tied to your vision for access (presentation system) Determine what digital building blocks are needed for the planned presentation system: METADATA CREATION/COLLECTION (incl. text recognition - OCR) JPEG/JPEG2000 XML (METS/ALTO or other) PDF
14
OCR - Optical Character Recognition simple OCR (uncorrected) vs. enhancements (Headline/byline correction, article classification, text correction)
15
OCR – the rocky road to “99%” (?) Input: “photo” of the page Zoning: Columns & reading order Analyze characters/words – Recognition All CAPS fonts (major headlines) yield low accuracy OCR is cost effective tool to gain “full-text” searchability.
16
Main Choices for Content Conversion Image Only approach (aka digital microfilm) vs. PDF based vs. integrated model where page images and metadata are integrated via a presentation system.
17
PDF based presentation PRO Common format OCR Multi-page Free Reader Printing CON Slow Not suitable for 8bit Secondary searches Not scaleable Hidden searchable text
18
Integrated Presentation: Page level Integrated Presentation Page Level Access Example: ContentDM FEATURES: Bitonal or gray Search across collections Primary hits in JPEG 2000 Clipping tool Rich metadata, not only from OCR, but also Dublin Core
19
Integrated Presentation: Article level With article segmentation
20
Section 3: Presentation Digital Newspaper Collection go live! Page Level Access in CONTENTdm: AccessPA group license: www.accesspadigital.org Lycoming College, PAwww.accesspadigital.org Lycoming College, PA Wissahickon Valley PL, PA – Ambler Gazette Article Level Access in CONTENTdm: Seattle Spectator
21
Outlook – The Challenges Analog preservation (film) vs. electronic preservation: File sizes, costs of storage; scanning with digital preservation in mind creates loads of data “If you give the mouse a cookie….” (aka setting expectations) Regaining full-text logic from a photograph of a page; Newspapers are oversize, portrait format, screen is landscape. Zooming will improve legibility, but will not show full page at same time. Access without DAM is not practical, but has costs associated
22
Resources National Digital Newspaper Program (NDNP) http://www.loc.gov/ndnp/ (partnership of the Library of Congress and the National Endowment for the Humanities)
23
Questions? Today: Today: Break-out sessions “Tomorrow”: “Tomorrow”: Contact Christine Guenther guenthec@oclc.org OCLC Preservation Service Center Bethlehem, PA 1-800-773 7222 Thank you! guenthec@oclc.org
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.