DML-CZ: Scanning and adjusting the images Martin Lhoták Academy of Sciences Library Launching the DML-CZ Prague
DML-CZ Workflow 1. Preparation 2. Scanning and adjusting the images 3. OCR 4. Metadata harvesting (MR, ZBL) 5. Integration 6. Digital Library
Content 1. Digitization Centre of the AS Library 2. Scanning 3. Adjusting the images 4. Basic metadada 5. OCR 6. Back up and movement of the data 7. Production till now
Digitization Centre of the AS Library In operation since Builded with support from EU Solidarity fund after floods in Czechia in 2002 Main aim - to build a digital library of scientific publications, published in the Academy of Science of the Czech Rep. Digital Library of ASCR Partner of DML-CZ project since 2005
The Academy of Science of the Czech Republic > 50 scientific institutes 7500 employees, (4000 R&D) > articles, reports, etc. a year publish > 90 journals (circa 3000 articl.) > 100 years history
Digitization Centre of the AS Library 2 x A2 bw scanners Zeutschel OS x A1 color scanner Digibook x A4 fast production scan. Panasonic Staff – 8 to 10 people Monthly production pages Overall production > pages
DML-CZ: Scanning 2 x A2 bw scanners Zeutschel OS DPI 4 bit greyscale 1 page = 1 file usually A5 TIFF with lossless LZW compression circa 10 MB
Image Adjusting Software Book Restorer from i2S Designed to process scanned books Geometrical correction Crop Blur Binarization Despecle
Basic Metadata XML (DTD of The Czech National Library) Title basic biblographic data Physical size of the journal Numbers of pages Software Sirius (CZ)
OCR Fine Reader runs: - 1. to recognize language of paragraph - 2. to do OCR with right language OCR workflow developed by team of Dr. P. Sojka Output – double layer PDF: - 1. layer scanned picture - 2. layer „OCRed“ text
Back up and movement of the data Main steps and outputs: 1. scanning – TIFF 2. image adjust. and basic metadata – TIFF, XML 3. OCR – PDF After each step above: One copy to server in Brno Two copies on LTO tapes
Production for DML-CZ till now Scanning: pages Image adjust.: pages Basic metadata: pages OCR: pages Disproportion: some data was obtained from GDZ Goettingen
Alternative output of the Acad. of Sci. mathematic
Thank you! Questions? Martin Lhoták