Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

Delivering textual resources. Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Digital Library Leadership Montgomery, AL ● April 20, 2004 A Scanning Primer Presented by: Beth Nicol & Todd Jennings Presented by: Beth Nicol & Todd Jennings.
OCLC Online Computer Library Center Microfilmed Newspapers: Selection for Digitization Success ALA June 25, 2006 OCLC Preservation Service Centers.
A Digital Imaging Primer Nick Dvoracek Instructional Resources Center University of Wisconsin Oshkosh.
From Analog to Digital. What are Digital Images? Electronic snapshots taken of a scene or scanned from documents samples and mapped as a grid of dots.
1 CS 502: Computing Methods for Digital Libraries Lecture 9 Conversion to Digital Formats Anne Kenney, Cornell University Library.
PDF (Portable Document Format) for Digital Preservation and Delivery John Laurie Digital Initiatives Librarian The University of Auckland Library National.
These ain’t “Old News”! Creating access to historic newspapers Christine Guenther OCLC Product Manager, Digital Services Preservation Service Centers Bethlehem,
An Introduction to Metadata by Wendy Duff ECURE 2000 October 6, 2000.
Portable Document Format PDF. What is PDF? Universal file format developed by Adobe Systems Incorporates fine detail and quality of print publications.
Vector vs. Bitmap SciVis V
Session 803: Processing PDF Files Gaeir Dietrich Director High Tech Center Training Unit
Software and Multimedia
Digital Images. Scanned or digitally captured image Image created on computer using graphics software.
Digital Text Primer Prepared for: AIEA Roundtable on Digitization of Armenian Documents Saturday 7 October 2006, University of Geneva, Switzerland Roland.
 Scanned or digitally captured image  Image created on computer using graphics software.
HTML Overview for Proofreading. HTML layouts are divided into sections, and created in tables separating the images & content sections.
Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law Library Rutgers-Newark School of Law.
Guest Lecture LIS 656, Spring 2011 Kathryn Lybarger.
Portable Document Format PDF. What is PDF? Universal file format developed by Adobe Systems Incorporates fine detail and quality of print publications.
Bitmapped Images. Bitmap Images Today’s Objectives Identify characteristics of bitmap images Resolution, bit depth, color mode, pixels Determine the most.
Port Townsend Leader Historical Newspaper Archive Keith Darrock.
HBCU-CUL Digital Imaging Workshop, November 2005
2 pt 3 pt 4 pt 5pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2pt 3 pt 4pt 5 pt 1pt 2pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4pt 5 pt 1pt Terms 2 Terms 3 Terms 4 Terms 5 Terms.
Day’s Schedule 9:30-10:30: Digitization Basics 10:45-12:00: Workflow & Color Management 12:00-1:00:Lunch Break 1:00-2:00:Workflow Demo 2:15-3:30:Photoshop.
File Formats COM 366 Web Design & Layout. Native file format –Format native to software program –.psd > PhotoShop default Preserves layers –Use “Save.
Mark Phillips Digital Projects Department University of North Texas Annexation of Texas Project.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
An XML Introduction Extensible Markup Language Describe Structure and Content of Data Sample XML Document.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
Vector vs. Bitmap
2002 September -- ejk/UF RESEARCH TOPICS Web-Interface Performance DTD Extensibility Imaging Distillation Other topics?
Options for digital delivery Record Society Conference, April 19 th 2007 Bruce Tate Project Manager British History Online.
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
Mark Sullivan Digital Library of the Caribbean. Imaging  Imaging Theory & Specifications  Recommended Equipment and Software 2 dLOC Training (7/29/2013)
Leiden University. The university to discover. DMT Week 3 Adriaan van der Weel and Peter Verhaar.
Practical Metadata Kathryn Lybarger. What is metadata?
Organizational Relationships and Shaping the Digital Resource July 21, 2010 Johanna Bauman, Senior Production Manager, ARTstor.
Desktop Publishing Summary Our multilingual desktop publishing team works in all major software for Mac and PC, including the full Adobe suite and QuarkXPress.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
Digitization Programmes National Library of the Czech Republic Adolf Knoll
Digital Image Capture of Musical Scores Jenn Riley, Indiana University Digital Library Program Ichiro Fujinaga, McGill University.
CEAL 2003 XML for CJK Wooseob Jeong School of Information Studies University of Wisconsin - Milwaukee.
Introduction to metadata
XML for Text Markup An introduction to XML markup.
 Scanned or digitally captured image  Image created on computer using graphics software.
1 SGML-MARC Incorporating Library Cataloging into the TEI Environment Stephen Paul Davis Columbia University Libraries.
An exercise in preservation and applied technology Making an Electronic Text.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
Delivering textual and visual resources. Overview Case studies Methods for providing access Structures for delivery Full text Marked-up Image and text.
Guilford County SciVis V104.03
CS 101 – Sept. 11 Review linear vs. non-linear representations. Text representation Compression techniques Image representation –grayscale –File size issues.
Text and Books IST 653. Examples of Texts or Documents? Books Letters Log books Databases Website Maps Computer code.
Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.
DIGITIZATION IN THEORY AND PRACTICE WEBSITE: Helen Nneka Okpala Presentation done at University of.
Vector vs. Bitmap. Vector Images Vector images (also called outline images) are images made with lines, text, and shapes. Test type is considered to be.
BITMAPPED IMAGES & VECTOR DRAWN GRAPHICS
Vector vs. Bitmap.
Chapter 1 HTML, XHTML, and the World Wide Web
Software and Multimedia
The Basics of Creating Accessible Documents for ILL Practitioners
Software and Multimedia
Portable Document Format
Chapter 1 HTML, XHTML, and the World Wide Web
Portable Document Format PDF
RESEARCH TOPICS Web-Interface Performance DTD Extensibility Imaging
From Analog to Digital.
Beer: A Digital Transcription of a Medieval Manuscript
Presentation transcript:

Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

From last time: Calculating potential file size (no really… this time we got it!) file size = height x width x bit-depth x dpi 2 8 bits per byte

imaging Benchmarking Subjective evaluation becomes more problematic when the goal is legibility rather than fidelity.

imaging Benchmarking Physical Type, size and presentation

imaging Banchmarking Physical condition Darkening pages Fading ink Stains bleed-through Uneven printing Fold lines smearing

imaging Benchmarking Document classification Simple text / printed line art Distinct-edge based representation Bitonal? Manuscripts Soft-edge-based Grayscale / color Mixed material

imaging Benchmarking Medium and support Support – (paper, clay tablet, etc.) Thin paper? (bleed through) Medium – (graphite pencil, inks, etc) Fading of ink Variations in color or density

imaging Benchmarking Tonal Representation

imaging Benchmarking Color Appearance Is color reproduction necessary to the document’s meaning? What purpose does the color serve? How important is maintaining the color appearance?

imaging Benchmarking Detail Printed text – Measure the height of the smallest lowercase letter that typifies the item or group of items. Manuscripts, line art – Measure the finest stroke-width that must be represented and characterize the needed level of quality

imaging Benchmarking QI…(Quality Index) Defining detail as character height ANSI/AIIM preservation microfilming standard for determining requirements for text legibility Defines a range from barely legible through excellent that maps to technical test targets

imaging Benchmarking Line pairs Excellent = 8 line pairs Good = 5 line pairs Marginal = 3.6 line pairs Barely legible = 3.0 line pairs

imaging Benchmarking Digital QI Bitonal (only black pixels) QI = (dpi x.039h)/3 h = 3QI/.039dpi dpi = 3QI/.039h Tonal images (grayscale for printed text) QI = (dpi x.039h)/2 h = 2QI/0.39dpi dpi = 2QI/.039h

Text Capture Methods Rekeying OCR Accuracy …

Software Scansoft - Omnipage Pro Abbyy – Fine Reader Adobe Acrobat … PrimeOCR – Prime Recognition

Encoding

XML vs SGML SGML (Standard Generalized Markup Language ) is the grand-daddy of all markup languages XML is a subset of SGML with an intent on being the format for use on the Internet. XML attempts to fill the gap between SGML, which can be used for just about anything, and HTML which is severely limited and currently being abused because of this. (table structures for layout, clear 1 pixel GIFs.. etc)

xml DTDs vs Schemas

xml TEI Text Encoding Initiative Initially launched in 1987, the TEI is an international and interdisciplinary standard that helps libraries, museums, publishers, and individual scholars represent all kinds of literary and linguistic texts for online research and teaching, using an encoding scheme that is maximally expressive and minimally obsolescent.

xml TEI Levels of encoding Level 1: Fully Automated Conversion and Encoding Level 1: Fully Automated Conversion and Encoding Level 2: Minimal Encoding Level 3: Simple Analysis Level 4: Basic Content Analysis Level 5: Scholarly Encoding Projects

Character sets Unicode – Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

character sets Unicode Greek & Coptic

Software XMetal Oxygen Cooktop

Software MetaE