September 1 st 2010 Igelu Ghent The on-the-fly conversion circus Matthias Gross (Bavarian State Library)
September 1 st 2010 Igelu Ghent Introduction Often the original version of an object is not what the user wants … or what we want the user to get Two basic strategies: Store additional versions Offer them virtually: create them on the fly
September 1 st 2010 Igelu Ghent whats common… Common aspect: something has to be converted a)Content e.g. tiff / PDF jpg (for viewing) tiff / jpg PDF (for download) b)Structural metadata (METS) c)Bibliographic metadata e.g. MARC21 DC Common side-effect of conversion: loss of information, sometimes of functionality. In most cases affordable (hi-resolution scans), in some even wanted: -for legal reasons -reduce file size to speed up transfer
September 1 st 2010 Igelu Ghent … and whats different Benefits of on-the-fly conversion: Reduce storage costs Keep data structure simple Reduce migration costs when specifications of desired target formats change Price: Server load Runtime for conversion <= waiting time for end user e.g. tiff jpg: waiting time is usually too long to be acceptable Some formats are built to be delivered primarily via on-the- fly conversion: j2k ( jpg)
September 1 st 2010 Igelu Ghent normal presentation
September 1 st 2010 Igelu Ghent First example: serving the DFG viewer In this example, we have single page PDFs in a METS structure. Let us look at the content first. To serve the so-called DFG viewer (which projects funded by the Deutsche Forschungsgemeinschaft, DFG, are obliged to), three different JPEG versions of each page with a given resolution are needed This would not be easy to implement within DigiTool: how to encode the resolution of a VIEW manifestation so that it can be addressed from outside the system? (Basically, all VIEW manifestations are equal, with an optional VIEW MAIN) The conversion is implemented via a special viewer which calls Ghostscript/Imagemagick. Via simple caching the waiting time can be shortened when the user returns to a previously accessed page.
September 1 st 2010 Igelu Ghent presentation in the DFG viewer
September 1 st 2010 Igelu Ghent DFG viewer, continued Now let us look at the structural metadata (METS). The DFG viewer expects a different type of METS : the FileSec has to be changed significantly METS (digital entity-XML) can be quite large (>3 MB for 1700 pages) Conversion lasts up to 7 seconds on-the-fly conversion not reasonable, so we put the converted METS in the file system; access via PID (how would you store additional METS files within the DigiTool data model?)
September 1 st 2010 Igelu Ghent Next example: PDF download Let us look at the content again. We want to offer a "PDF download option. An additional manifestation (single PDF file for an IE) would need a significant amount of extra storage This is even true for caching these PDF files; besides that, each read/write-operation for such big files is expensive best performance when using the streams as they are in the repository, embedding them in PDF and give that as HTTP response further optimization: use fast PERL library and access Oracle DB directly For a 350 page book, 1 minute can be reached. Usually the internet connection of the end user is the bottleneck.
September 1 st 2010 Igelu Ghent PDF download dialogue
September 1 st 2010 Igelu Ghent PDF download result
September 1 st 2010 Igelu Ghent Next example: EuropeanaTravel
September 1 st 2010 Igelu Ghent EuropeanaTravel The overall objective of this project is to digitise content on the theme of travel and tourism to be made accessible via Europeana, the European digital library, museum and archive. Launched in November 2008, Europeana provides integrated access to digital treasures from museums, archives, audio-visual archives and libraries of Europe. Europeana EuropeanaTravel has officially started on 1st May 2009 and it will last for two years. Contribution of the University Library of Regensburg via DigiTool: 400 books 200maps 600 illustrations from books (+ detailed metadata)
September 1 st 2010 Igelu Ghent OAI Problem with transporting the metadata to Europeana via OAI: Europeana expects extended DC with special europeana: elements Yet another replica set? How often should we replicate information with just little differences in format? Replica sets demand 1.) database space 2.) significant java resources during replication
September 1 st 2010 Igelu Ghent Solution: build a set that contains all the information which is needed for different demands (granular bibliographic data, delivery URL, thumbnail URL, file type, object type, …) And then…
September 1 st 2010 Igelu Ghent filter OAI-XML through stylesheet on the fly!
September 1 st 2010 Igelu Ghent OAI for Europeana
September 1 st 2010 Igelu Ghent Travel … Facet for EuropeanaTravel Search for Reise (which means travel in German)
September 1 st 2010 Igelu Ghent … and get … link to DigiTool delivery
September 1 st 2010 Igelu Ghent … back home
September 1 st 2010 Igelu Ghent further application of this technique just 1 OAI record per object OAI for harvester 1 stylesheet 1 OAI for harvester 2 stylesheet 2 report list of objects HTML stylesheet 3 report list of objects Excel stylesheet 4 HTML page for starting services stylesheet 5
September 1 st 2010 Igelu Ghent example: thumbnail selec- tion service (post-ingest)
September 1 st 2010 Igelu Ghent Current / planned activities (1) secure PDF display option which prohibits download and printout for copyright material (at least makes that very wearisome, as you cant prevent screen snapshots) assume multi-page PDF as storage format normal PDF viewer (client-side plugin) has too many options which cant be disabled reliably We look at: -multi-page jpg (conversion via Ghostscript) -flash approach (browser plugin needed) -applet approach Observation: the images look somehow not so nice without using the Adobe stuff
September 1 st 2010 Igelu Ghent Current / planned activities (2) We will evaluate conversion to ePUB as alternative download format, possibly on the fly. Who has experience already?
September 1 st 2010 Igelu Ghent THE END … Thank you very much for your patience attention! Special credits to Petra Schröder for most of the work, Joe Getty for thumbnail inspiration and to Wikimedia Commons for some pictures!