Delivering textual and visual resources
Overview Case studies Methods for providing access Structures for delivery Full text Marked-up Image and text Indexed How to guidance for: Rekeying OCR
ContentDM Off-the-shelf delivery system Easy to implement Can also be customized to some degree Handles images and metadata using Dublin Core Has useful features such as lightbox, collections and my favorites Proprietary solution but uses strong standards Requires no plug-ins
Greenstone Open source and open standards Developed by New Zealand Digital Library Supported by UNESCO and available free from them Used by a number of digital library projects Strong metadata and image support Multilingual
Custom built solution: CVMA Corpus Vitrearum Medii Aevi Medieval Stained Glass in Great Britain Technical solution developed by the Centre for Computing in the Humanities, King’s College London Database for complex information relationships XML for documents and text Clickable maps as navigation aid
Getting the text ready - decisions Choices: Full text every character & word searchable, viewable & reusable in digital form Marked-up as above but with markup added to enable structured searches and use (e.g. XML, SGML) Image and text an image is all the viewer sees - text is fully searchable but is not seen or reusable Indexed Images/files attached to an index or catalogue
Getting the text ready - costs Full text generally expensive in time and resources but depends upon source – for born digital very cheap Marked-up Usually the most expensive due to skilled staff needed for intellectual content markup but some automated system around for format based markup Image and text comparatively cheap but some usability down sides Indexed great if index or catalogue already exists and can just link file to record (e.g. MARC)
Full text Files (e.g. PDF, Word) Formatted text (e.g. HTML) Fully searchable Reusable – copy, edit, share Very high accuracy i.e. 100% expected by user Unstructured searches Results can be overwhelming Born digital – reformatting for delivery to be considered
Markup Advantage of structured search and use Complex to create specifications and workflow from scratch Delivering requires a description of the codes, rules and documents used Most projects will adapt one that already exists: TEI – Text Encoding Initiative EAD – Encoded Archival Documents Some automation possible and some system solutions that enable this
Markup: examples Thomas Knight was indicted for the wilful murder of Robert Ball. He stood charged on the Coroner's inquest for manslaughter, September 7. Michael Ball. The deceased Robert Ball was my son; he was a clock- case maker; the prisoner and he had been fighting some time; he stood up against a wall; I said, Robert, will you fight any more? he said, yes; they fought again. I saw but little of it.
Markup Two forms commonly used: Layout and structure based (format) Thomas Knight was indicted for the wilful murder of Robert Ball. He stood charged on the Coroner's inquest for manslaughter, September 7. Michael Ball. The deceased Robert Ball was my son; he was a clock-case maker; the prisoner and he had been fighting some time; he stood up against a wall; I said, Robert, will you fight any more? he said, yes; they fought again. I saw but little of it.
Markup Content based (function) Thomas Knight was indicted for the wilful murder of Robert Ball. He stood charged on the Coroner's inquest for manslaughter, September 7. Michael Ball. The deceased Robert Ball was my son; he was a clock-case maker ; the prisoner and he had been fighting some time; he stood up against a wall; I said, Robert, will you fight any more? he said, yes; they fought again. I saw but little of it. Can obviously be combined to deliver function and format at the same time
Markup languages Markup is a language not a programming tool All use tags or elements – software interprets those tags for display purposes and/or for search and retrieval Allows users (or communities of users) to create their own tag sets Markup can encode both logical and physical features of text
Markup languages SGML Standard Generalised Markup Language (ISO in 1986) Father of all markup languages HTML Hypertext Markup Language (ISO in 1991) Markup of ‘physical’ features of articles to enable Internet sharing of content – is about format of content XML: Extensible Markup Language (ISO in 1998) SGML lite to enable generic Web use of powerful XML features – is about function of content /
XML: bits and pieces XML Content (.xml) XML Rules (.dtd) Schemas – e.g. TEI, METS DTDs = Document Type Definitions Namespaces (used when you want to combine sets of rules together in a single document)
DTD explained A DTD is the formal definition of the elements, structures, and rules for marking up a given type of XML document Think of it as an abstraction of the document structure What tags and elements must/can be used How these tags and elements are structured in relation to each other Allows Internet browsers and other software to understand how to interpret XML content
XML: further bits and pieces Entities (.ent) Reusable data inside a DTD or within markup Think of entities as variables that can be used to define common text (e.g. copyright information). You can then use the entity anywhere you would normally use the text. Display (.css &.xsl) eXtensible Style Sheet Language Cascading Style Sheets Exstensible Style Sheet Language (.xsl) Used for transforming data to another structure Used for formatting objects
Image and text Image delivered and text is fully searchable but not viewable Text usually created by uncorrected OCR Different ways to do this: Use a PDF document with image and text Deliver an image with text that has been extracted to a searchable database e.g. JSTOR Deliver an image with text that has very basic mark up (possibly just pages defined) and searched as XML
Indexed Basically just linking text or document formats to a subject index or resource catalogue Makes sense and is low cost where the index resources already exists Not so good if the index/catalogue has to be created as this part is costly – in that circumstance XML might be better Delivered as a link within the index/catalogue that directs user to the single text/document file Often used with MARC records or museum Content Management Systems