Million Book Bibliotheca Alexandrina Youssef Eldakar 19 November 2006
Bibliotheca Alexandrina2 BA Digitization Workflow
Bibliotheca Alexandrina3
Image Processing Image to Better Image
Bibliotheca Alexandrina5 Image Processing Sequence Deskew Despeckle Rotation Noise Removal Black Edge Removal Page resize Center Text to page Enhance text quality [Grow & Erode] Renaming Files File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)
Bibliotheca Alexandrina6 Image Processing Sequence Deskew Despeckle Rotation Noise Removal Black Edge Removal Page resize Center Text to page Enhance text quality [Grow & Erode] Renaming Files File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)
Bibliotheca Alexandrina7 Scanfix Deskew Before After
Bibliotheca Alexandrina8 Scanfix Despeckle Before After
Bibliotheca Alexandrina9 Scanfix Rotation
Bibliotheca Alexandrina10 Image Processing Sequence Deskew Despeckle Rotation Noise Removal Black Edge Removal Page resize Center Text to page Enhance text quality [Grow & Erode] Renaming Files File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)
Bibliotheca Alexandrina11 Photoshop Noise removal Before After
Bibliotheca Alexandrina12 Photoshop Black edge removal Before After
Bibliotheca Alexandrina13 Photoshop Page resize
Bibliotheca Alexandrina14 Photoshop Center text to page Before After
Bibliotheca Alexandrina15 Image Processing Sequence Deskew Despeckle Rotation Noise Removal Black Edge Removal Page resize Center Text to page Enhance text quality [Grow & Erode] Renaming Files File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)
Bibliotheca Alexandrina16 Scanfix Enhance text quality : Grow, Erode (Horizontal / Vertical) Before After
Bibliotheca Alexandrina17 Image Processing Sequence Deskew Despeckle Rotation Noise Removal Black Edge Removal Page resize Center Text to page Enhance text quality [Grow & Erode] Renaming Files File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)
Bibliotheca Alexandrina18 ACDSee Renaming Files
Bibliotheca Alexandrina19 ACDSee Compression to TIFF (CCITT– Group 4)
OCR Image to Text
Bibliotheca Alexandrina21 OCR - Arabic Poses unique challenges –Written cursively, with blocks of connected characters –a ‘block of characters’ can have more than one base line. –Uses external objects such as dots, 'Hamza' and 'Madda'. –Diacritization –Characters can have more than one shape according to their position –Overlapping makes it difficult to determine the spacing Sakhr Automatic reader is used Tricky with old books Requires learning
Bibliotheca Alexandrina22 Arabic Script Is Cursive
Bibliotheca Alexandrina23 Old, Smudgy, and Sticked Together
Bibliotheca Alexandrina24 Use of Diacritics
Bibliotheca Alexandrina25 Pre-OCR Text Enhancement Condition of Arabic printings varies –Old/new –Light/heavy –Solid/dot-matrix ScanFix’s smoothing and completion features improve recognition accuracy Separate from actual processing phase –Must be tested under OCR right away –OCR specialists have a better feel for “good text”
Bibliotheca Alexandrina26 Text Repair in ScanFix
Bibliotheca Alexandrina27 Font Libraries Improvement of Arabic OCR results through –Tweaking of OCR engine settings –Learning Libraries for different fonts have been built to achieve higher recognition rates Databases of character glyphs that describe a particular type of script and improve OCR accuracy Built on a carefully selected and classified high-variety set of scanned images belonging to a batch of about 1000 books that boiled down to 15 font groups
Bibliotheca Alexandrina28 Font Classification Classification criteria: –Script type TA: Traditional Arabic AR: Arabic Transparent DT: Deco type Naskh and Deco type Naskh extension –Printing quality: High (H), Medium (M), and Low (L) –Font size: 1 (largest) to 5 (smallest) “Group X” – virtual font to tag unclassifiable printings and handwriting Minimum accuracy number assigned to each group based on testing results
Bibliotheca Alexandrina29 16 Font Groups
Bibliotheca Alexandrina30 Learning Train the engine on two representational pages of the book to build upon an initial font file picked from a set of pre- built font libraries Use a different page to manually calculate OCR accuracy before and after learning Batch OCR book using learned font file and save to ART
Bibliotheca Alexandrina31 Learning in Sakhr’s Automatic Reader
Bibliotheca Alexandrina32 VERUS from NovoDynamics Preliminay evaluation on two data sets is promising –Challenge: difficult to OCR, degraded images –Normal: known to return acceptable accuracy No learning capabilities—no human operators VERUS uses an XML format to store recognition data BA and NovoDynamics entered into a research agreement
Bibliotheca Alexandrina33 Evaluation of VERUS and AR
Encoding Image on Text
Bibliotheca Alexandrina35 Challenges in Publishing Preservation of layout Searchability of content and metadata Efficient image compression Easy browsing of books Accommodating low bandwidth user Multilingual text support Multipaging
Bibliotheca Alexandrina36 Image-on-Text Multilayered: –Visible page image –Hidden OCR text View exact original layout while searching and highlighting Supported with some OCR suites only Supported format: DJVU and PDF
Bibliotheca Alexandrina37 UDBE Universal Digital Book Encoder A framework for integrating many OCR engines and supporting many target formats into a system for encoding image-on-text documents for publishing Made possible through the use of a Common OCR Format (COF)
Bibliotheca Alexandrina38 UDBE Built around a Common OCR Format (COF)
Bibliotheca Alexandrina39 Performance – Arabic B&W
Bibliotheca Alexandrina40 Performance – Latin B&W
Quality Assurance
Bibliotheca Alexandrina42 Q/A - Common Errors No missing cover or pages All pages are in order Text quality Images quality Pages quality PDF quality
Bibliotheca Alexandrina43 Q/A - Common Errors No missing cover or pages All pages are in order Text quality Images quality Pages quality PDF quality
Bibliotheca Alexandrina44 Q/A - Common Errors No missing cover or pages All pages are in order Text quality Images quality Pages quality PDF quality 17
Bibliotheca Alexandrina45 Q/A - Common Errors No missing cover or pages All pages are in order Text quality Images quality Pages quality PDF quality Pale Text Toothed Text Curved Text
Bibliotheca Alexandrina46 Q/A - Common Errors No missing cover or pages All pages are in order Text quality Images quality Pages quality PDF quality
Bibliotheca Alexandrina47 Cut Pages Q/A - Common Errors No missing cover or pages All pages are in order Text quality Images quality Pages quality PDF quality Fingers Noise and page edges Pages Size Skew
Bibliotheca Alexandrina48 Q/A - Common Errors No missing cover or pages All pages are in order Text quality Images quality Pages quality PDF quality Image on Text Searching Hits
DAR Digital Assets Repository
Bibliotheca Alexandrina50 System Architecture
Bibliotheca Alexandrina51 DAK - Metadata Descriptive Metadata Administrative Metadata Technical Metadata
Bibliotheca Alexandrina52 DAK Publishing Module Providing access to the repository content through search and browse facilities Multilingual full-text search
Bibliotheca Alexandrina53 DAK Publishing Module Functionalities –Browse the repository contents by Collection, Subject, Creator and Title –Search content by an indexed metadata field –Multilingual full-text search using both exact and morphological matching
Bibliotheca Alexandrina54 DAK Publishing Module Functionalities (cont’d) –Display brief record information –Display full record information with links to digital objects –Display MARC and DC format
Bibliotheca Alexandrina55
Bibliotheca Alexandrina56
Bibliotheca Alexandrina57
Bibliotheca Alexandrina58
Bibliotheca Alexandrina59 Show notes
Bibliotheca Alexandrina60
Bibliotheca Alexandrina61 DAR: Future Work Consider MODS and METS standards in the new system data model. Enhance the functionalities of the Books Viewer with more security and copyright management Join the Open Source community by building DAR modules with open source technologies and languages. Provide support for the currently available digital library interoperability protocols
Books from India Towards Better Collaboration
Bibliotheca Alexandrina63 Books From India LanguageNumber Books Arabic832 Arabic + French3 Arabic + German1 Persian101 French2 English1 Spanish1 German1 Total942
Bibliotheca Alexandrina64 Progress Phase NameDone as of November 1, 2006 Expected to finished by Comments Cataloging have metadata problems Processing742November 20, 2006 OCRing200March 1, 2007 Encoding171-- Publishing171--
Bibliotheca Alexandrina65 Metadata Problems
Bibliotheca Alexandrina66 Processing
Bibliotheca Alexandrina67 OCR Using VERUS or AR? Calculated accuracy for a small sample –Images processed once with darkening effect and once without –VERUS likes darkening, AR does not –Overall, AR won 70% of cases
Bibliotheca Alexandrina68
Bibliotheca Alexandrina69
Bibliotheca Alexandrina70
Bibliotheca Alexandrina71
Bibliotheca Alexandrina72
Bibliotheca Alexandrina73
Bibliotheca Alexandrina74
Bibliotheca Alexandrina75