Download presentation
Presentation is loading. Please wait.
Published byJames Stanley Modified over 8 years ago
1
Million Book Project @ Bibliotheca Alexandrina Youssef Eldakar 19 November 2006
2
Bibliotheca Alexandrina2 BA Digitization Workflow
3
Bibliotheca Alexandrina3
4
Image Processing Image to Better Image
5
Bibliotheca Alexandrina5 Image Processing Sequence Deskew Despeckle Rotation Noise Removal Black Edge Removal Page resize Center Text to page Enhance text quality [Grow & Erode] Renaming Files File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)
6
Bibliotheca Alexandrina6 Image Processing Sequence Deskew Despeckle Rotation Noise Removal Black Edge Removal Page resize Center Text to page Enhance text quality [Grow & Erode] Renaming Files File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)
7
Bibliotheca Alexandrina7 Scanfix Deskew Before After
8
Bibliotheca Alexandrina8 Scanfix Despeckle Before After
9
Bibliotheca Alexandrina9 Scanfix Rotation
10
Bibliotheca Alexandrina10 Image Processing Sequence Deskew Despeckle Rotation Noise Removal Black Edge Removal Page resize Center Text to page Enhance text quality [Grow & Erode] Renaming Files File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)
11
Bibliotheca Alexandrina11 Photoshop Noise removal Before After
12
Bibliotheca Alexandrina12 Photoshop Black edge removal Before After
13
Bibliotheca Alexandrina13 Photoshop Page resize
14
Bibliotheca Alexandrina14 Photoshop Center text to page Before After
15
Bibliotheca Alexandrina15 Image Processing Sequence Deskew Despeckle Rotation Noise Removal Black Edge Removal Page resize Center Text to page Enhance text quality [Grow & Erode] Renaming Files File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)
16
Bibliotheca Alexandrina16 Scanfix Enhance text quality : Grow, Erode (Horizontal / Vertical) Before After
17
Bibliotheca Alexandrina17 Image Processing Sequence Deskew Despeckle Rotation Noise Removal Black Edge Removal Page resize Center Text to page Enhance text quality [Grow & Erode] Renaming Files File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)
18
Bibliotheca Alexandrina18 ACDSee Renaming Files
19
Bibliotheca Alexandrina19 ACDSee Compression to TIFF (CCITT– Group 4)
20
OCR Image to Text
21
Bibliotheca Alexandrina21 OCR - Arabic Poses unique challenges –Written cursively, with blocks of connected characters –a ‘block of characters’ can have more than one base line. –Uses external objects such as dots, 'Hamza' and 'Madda'. –Diacritization –Characters can have more than one shape according to their position –Overlapping makes it difficult to determine the spacing Sakhr Automatic reader is used Tricky with old books Requires learning
22
Bibliotheca Alexandrina22 Arabic Script Is Cursive
23
Bibliotheca Alexandrina23 Old, Smudgy, and Sticked Together
24
Bibliotheca Alexandrina24 Use of Diacritics
25
Bibliotheca Alexandrina25 Pre-OCR Text Enhancement Condition of Arabic printings varies –Old/new –Light/heavy –Solid/dot-matrix ScanFix’s smoothing and completion features improve recognition accuracy Separate from actual processing phase –Must be tested under OCR right away –OCR specialists have a better feel for “good text”
26
Bibliotheca Alexandrina26 Text Repair in ScanFix
27
Bibliotheca Alexandrina27 Font Libraries Improvement of Arabic OCR results through –Tweaking of OCR engine settings –Learning Libraries for different fonts have been built to achieve higher recognition rates Databases of character glyphs that describe a particular type of script and improve OCR accuracy Built on a carefully selected and classified high-variety set of scanned images belonging to a batch of about 1000 books that boiled down to 15 font groups
28
Bibliotheca Alexandrina28 Font Classification Classification criteria: –Script type TA: Traditional Arabic AR: Arabic Transparent DT: Deco type Naskh and Deco type Naskh extension –Printing quality: High (H), Medium (M), and Low (L) –Font size: 1 (largest) to 5 (smallest) “Group X” – virtual font to tag unclassifiable printings and handwriting Minimum accuracy number assigned to each group based on testing results
29
Bibliotheca Alexandrina29 16 Font Groups
30
Bibliotheca Alexandrina30 Learning Train the engine on two representational pages of the book to build upon an initial font file picked from a set of pre- built font libraries Use a different page to manually calculate OCR accuracy before and after learning Batch OCR book using learned font file and save to ART
31
Bibliotheca Alexandrina31 Learning in Sakhr’s Automatic Reader
32
Bibliotheca Alexandrina32 VERUS from NovoDynamics Preliminay evaluation on two data sets is promising –Challenge: difficult to OCR, degraded images –Normal: known to return acceptable accuracy No learning capabilities—no human operators VERUS uses an XML format to store recognition data BA and NovoDynamics entered into a research agreement
33
Bibliotheca Alexandrina33 Evaluation of VERUS and AR
34
Encoding Image on Text
35
Bibliotheca Alexandrina35 Challenges in Publishing Preservation of layout Searchability of content and metadata Efficient image compression Easy browsing of books Accommodating low bandwidth user Multilingual text support Multipaging
36
Bibliotheca Alexandrina36 Image-on-Text Multilayered: –Visible page image –Hidden OCR text View exact original layout while searching and highlighting Supported with some OCR suites only Supported format: DJVU and PDF
37
Bibliotheca Alexandrina37 UDBE Universal Digital Book Encoder A framework for integrating many OCR engines and supporting many target formats into a system for encoding image-on-text documents for publishing Made possible through the use of a Common OCR Format (COF)
38
Bibliotheca Alexandrina38 UDBE Built around a Common OCR Format (COF)
39
Bibliotheca Alexandrina39 Performance – Arabic B&W
40
Bibliotheca Alexandrina40 Performance – Latin B&W
41
Quality Assurance
42
Bibliotheca Alexandrina42 Q/A - Common Errors No missing cover or pages All pages are in order Text quality Images quality Pages quality PDF quality
43
Bibliotheca Alexandrina43 Q/A - Common Errors No missing cover or pages All pages are in order Text quality Images quality Pages quality PDF quality
44
Bibliotheca Alexandrina44 Q/A - Common Errors No missing cover or pages All pages are in order Text quality Images quality Pages quality PDF quality 17
45
Bibliotheca Alexandrina45 Q/A - Common Errors No missing cover or pages All pages are in order Text quality Images quality Pages quality PDF quality Pale Text Toothed Text Curved Text
46
Bibliotheca Alexandrina46 Q/A - Common Errors No missing cover or pages All pages are in order Text quality Images quality Pages quality PDF quality
47
Bibliotheca Alexandrina47 Cut Pages Q/A - Common Errors No missing cover or pages All pages are in order Text quality Images quality Pages quality PDF quality Fingers Noise and page edges Pages Size Skew
48
Bibliotheca Alexandrina48 Q/A - Common Errors No missing cover or pages All pages are in order Text quality Images quality Pages quality PDF quality Image on Text Searching Hits
49
DAR Digital Assets Repository
50
Bibliotheca Alexandrina50 System Architecture
51
Bibliotheca Alexandrina51 DAK - Metadata Descriptive Metadata Administrative Metadata Technical Metadata
52
Bibliotheca Alexandrina52 DAK Publishing Module Providing access to the repository content through search and browse facilities Multilingual full-text search
53
Bibliotheca Alexandrina53 DAK Publishing Module Functionalities –Browse the repository contents by Collection, Subject, Creator and Title –Search content by an indexed metadata field –Multilingual full-text search using both exact and morphological matching
54
Bibliotheca Alexandrina54 DAK Publishing Module Functionalities (cont’d) –Display brief record information –Display full record information with links to digital objects –Display MARC and DC format
55
Bibliotheca Alexandrina55
56
Bibliotheca Alexandrina56
57
Bibliotheca Alexandrina57
58
Bibliotheca Alexandrina58
59
Bibliotheca Alexandrina59 Show notes
60
Bibliotheca Alexandrina60
61
Bibliotheca Alexandrina61 DAR: Future Work Consider MODS and METS standards in the new system data model. Enhance the functionalities of the Books Viewer with more security and copyright management Join the Open Source community by building DAR modules with open source technologies and languages. Provide support for the currently available digital library interoperability protocols
62
Books from India Towards Better Collaboration
63
Bibliotheca Alexandrina63 Books From India LanguageNumber Books Arabic832 Arabic + French3 Arabic + German1 Persian101 French2 English1 Spanish1 German1 Total942
64
Bibliotheca Alexandrina64 Progress Phase NameDone as of November 1, 2006 Expected to finished by Comments Cataloging801-35 have metadata problems Processing742November 20, 2006 OCRing200March 1, 2007 Encoding171-- Publishing171--
65
Bibliotheca Alexandrina65 Metadata Problems
66
Bibliotheca Alexandrina66 Processing
67
Bibliotheca Alexandrina67 OCR Using VERUS or AR? Calculated accuracy for a small sample –Images processed once with darkening effect and once without –VERUS likes darkening, AR does not –Overall, AR won 70% of cases
68
Bibliotheca Alexandrina68
69
Bibliotheca Alexandrina69
70
Bibliotheca Alexandrina70
71
Bibliotheca Alexandrina71
72
Bibliotheca Alexandrina72
73
Bibliotheca Alexandrina73
74
Bibliotheca Alexandrina74
75
Bibliotheca Alexandrina75
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.