Download presentation
Presentation is loading. Please wait.
Published byBlaise Hunt Modified over 9 years ago
1
Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006
2
Bibliotheca Alexandrina2
3
3
4
4 BA Digitization Workflow
5
Bibliotheca Alexandrina5 Statistics - November 2006 ArabicLatinTotal Scanned Books22,023 4,64626,669 Pages7,003,1851,350,688 8,353,873 Processed Books21,9474,642 26,589 Pages6,987,3921,348,900 8,336,292 OCRed Books16,6524,600 21,252 Pages5,248,3371,327,385 6,575,722 Total Archived Data1,500 GB
6
Bibliotheca Alexandrina6 Statistics (Contd) Daily Rates –Scan: ≈ 1800 pages/person –Process: ≈ 1800 pages/person –Latin OCR: ≈ 4000 pages/person –Arabic OCR: ≈ 1500 pages/person Five Minolta scanners 2 shifts – 7 days a week
7
OCR Image to Text
8
Bibliotheca Alexandrina8 OCR - Arabic Poses unique challenges –Written cursively, with blocks of connected characters –a ‘block of characters’ can have more than one base line. –Uses external objects such as dots, 'Hamza' and 'Madda'. –Diacritization –Characters can have more than one shape according to their position –Overlapping makes it difficult to determine the spacing Sakhr Automatic reader is used Tricky with old books Requires learning
9
Bibliotheca Alexandrina9 Arabic Script Is Cursive
10
Bibliotheca Alexandrina10 Old, Smudgy, and Sticked Together
11
Bibliotheca Alexandrina11 Use of Diacritics
12
Bibliotheca Alexandrina12 16 Font Groups
13
Bibliotheca Alexandrina13 Evaluation of VERUS and AR Research agreement with NovoDynamics Preliminary evaluation on two data sets is promising –Challenge: difficult to OCR, degraded images –Normal: known to return acceptable accuracy
14
Encoding Image on Text
15
Bibliotheca Alexandrina15 Image-on-Text Multilayered: –Visible page image –Hidden OCR text View exact original layout while searching and highlighting Supported with some OCR suites only Supported format: DJVU and PDF
16
Bibliotheca Alexandrina16 Quality Assurance No missing cover or pages All pages are in order Text quality Images quality PDF quality
17
DAR Digital Assets Repository
18
Bibliotheca Alexandrina18 System Architecture
19
Bibliotheca Alexandrina19 DAK Publishing Module
20
Bibliotheca Alexandrina20 DAK Publishing Module
21
Bibliotheca Alexandrina21 DAK Publishing Module
22
Bibliotheca Alexandrina22 DAK Publishing Module
23
Bibliotheca Alexandrina23
24
Bibliotheca Alexandrina24 Show notes
25
Bibliotheca Alexandrina25
26
Bibliotheca Alexandrina26 Transfer of Digitized Books Challenges –Storage: CD vs Online –Bandwidth: 10 Mbps vs 155 Mbps –Copyright: not published Actions: –Transferred 8,500+ books to the Internet Archive –Process is still going on
27
Books From India Towards better collaboration
28
Bibliotheca Alexandrina28 Books From India LanguageNumber Books Arabic832 Arabic + French3 Arabic + German1 Persian101 French2 English1 Spanish1 German1 Total942
29
Bibliotheca Alexandrina29 Progress Phase NameDone as of November 1, 2006 Expected to finished by Comments Cataloging801-35 have metadata problems Processing742November 20, 2006 OCRing200March 1, 2007 Encoding171-- Publishing171--
30
Bibliotheca Alexandrina30 Metadata Problems
31
Bibliotheca Alexandrina31 Processing
32
Bibliotheca Alexandrina32 OCR Using VERUS or AR? Calculated accuracy for a small sample –Images processed once with darkening effect and once without –VERUS likes darkening, AR does not –Overall, AR won 70% of cases
33
Bibliotheca Alexandrina33
34
Bibliotheca Alexandrina34
35
Bibliotheca Alexandrina35
36
Bibliotheca Alexandrina36
37
Bibliotheca Alexandrina37
38
Bibliotheca Alexandrina38
39
Bibliotheca Alexandrina39
40
Bibliotheca Alexandrina40
41
Bibliotheca Alexandrina41
42
Bibliotheca Alexandrina42 Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.