Million Book Bibliotheca Alexandrina Noha Adly 20 November 2006
Bibliotheca Alexandrina2
3
4 BA Digitization Workflow
Bibliotheca Alexandrina5 Statistics - November 2006 ArabicLatinTotal Scanned Books22,023 4,64626,669 Pages7,003,1851,350,688 8,353,873 Processed Books21,9474,642 26,589 Pages6,987,3921,348,900 8,336,292 OCRed Books16,6524,600 21,252 Pages5,248,3371,327,385 6,575,722 Total Archived Data1,500 GB
Bibliotheca Alexandrina6 Statistics (Contd) Daily Rates –Scan: ≈ 1800 pages/person –Process: ≈ 1800 pages/person –Latin OCR: ≈ 4000 pages/person –Arabic OCR: ≈ 1500 pages/person Five Minolta scanners 2 shifts – 7 days a week
OCR Image to Text
Bibliotheca Alexandrina8 OCR - Arabic Poses unique challenges –Written cursively, with blocks of connected characters –a ‘block of characters’ can have more than one base line. –Uses external objects such as dots, 'Hamza' and 'Madda'. –Diacritization –Characters can have more than one shape according to their position –Overlapping makes it difficult to determine the spacing Sakhr Automatic reader is used Tricky with old books Requires learning
Bibliotheca Alexandrina9 Arabic Script Is Cursive
Bibliotheca Alexandrina10 Old, Smudgy, and Sticked Together
Bibliotheca Alexandrina11 Use of Diacritics
Bibliotheca Alexandrina12 16 Font Groups
Bibliotheca Alexandrina13 Evaluation of VERUS and AR Research agreement with NovoDynamics Preliminary evaluation on two data sets is promising –Challenge: difficult to OCR, degraded images –Normal: known to return acceptable accuracy
Encoding Image on Text
Bibliotheca Alexandrina15 Image-on-Text Multilayered: –Visible page image –Hidden OCR text View exact original layout while searching and highlighting Supported with some OCR suites only Supported format: DJVU and PDF
Bibliotheca Alexandrina16 Quality Assurance No missing cover or pages All pages are in order Text quality Images quality PDF quality
DAR Digital Assets Repository
Bibliotheca Alexandrina18 System Architecture
Bibliotheca Alexandrina19 DAK Publishing Module
Bibliotheca Alexandrina20 DAK Publishing Module
Bibliotheca Alexandrina21 DAK Publishing Module
Bibliotheca Alexandrina22 DAK Publishing Module
Bibliotheca Alexandrina23
Bibliotheca Alexandrina24 Show notes
Bibliotheca Alexandrina25
Bibliotheca Alexandrina26 Transfer of Digitized Books Challenges –Storage: CD vs Online –Bandwidth: 10 Mbps vs 155 Mbps –Copyright: not published Actions: –Transferred 8,500+ books to the Internet Archive –Process is still going on
Books From India Towards better collaboration
Bibliotheca Alexandrina28 Books From India LanguageNumber Books Arabic832 Arabic + French3 Arabic + German1 Persian101 French2 English1 Spanish1 German1 Total942
Bibliotheca Alexandrina29 Progress Phase NameDone as of November 1, 2006 Expected to finished by Comments Cataloging have metadata problems Processing742November 20, 2006 OCRing200March 1, 2007 Encoding171-- Publishing171--
Bibliotheca Alexandrina30 Metadata Problems
Bibliotheca Alexandrina31 Processing
Bibliotheca Alexandrina32 OCR Using VERUS or AR? Calculated accuracy for a small sample –Images processed once with darkening effect and once without –VERUS likes darkening, AR does not –Overall, AR won 70% of cases
Bibliotheca Alexandrina33
Bibliotheca Alexandrina34
Bibliotheca Alexandrina35
Bibliotheca Alexandrina36
Bibliotheca Alexandrina37
Bibliotheca Alexandrina38
Bibliotheca Alexandrina39
Bibliotheca Alexandrina40
Bibliotheca Alexandrina41
Bibliotheca Alexandrina42 Thank You