Million Book Bibliotheca Alexandrina Noha Adly 20 November 2006.

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

End-to-end document capture, indexation, OCR to Microsoft SharePoint
Directorate of Learning Resources Teaching resources in RADAR.
Capacity Building Passing on the Experience Dr. Noha Adly World Digital Library Arab Peninsula Regional Group meeting.
Overwhelmed by Large-scale Digitization Projects
Client Lunch & Learn (12:15). Association for Information & Image Management Nov Research Scanner Utilization.
Selection and procurement of material Selection Analysis of the material + planning of workflow Analysis Scanning Digital photography Image manipulation.
Objective Understand web-based digital media production methods, software, and hardware. Course Weight : 10%
These ain’t “Old News”! Creating access to historic newspapers Christine Guenther OCLC Product Manager, Digital Services Preservation Service Centers Bethlehem,
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
NATIONAL LIBRARY OF MEDICINE PubMed Central Martha Fishel National Library of Medicine CENDI Meeting September 15, 2004.
Denise Troll Covey Principal Librarian for Special Projects The Impact of Current Copyright Law Erin Rhodes Copyright Permission Assistant Carnegie Mellon.
READING CIRCLES Collaborative reading & work project.
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
Digitization Workflow Management System for Massive Digitization Projects Bibliotheca Alexandrina November 19, 2006 The 2 nd International Conference on.
ادارة الوثائق الالكترونية Naji Shukri Alzaza University of Palestine February 2010.
1 Newspaper Digitisation Workflows Rose Holley- Manager ANDP Presentation to Cultural Heritage Digitisation professionals 26 November 2008.
1 Australian Newspapers Digitisation Program Development of the Newspapers Content Management System Rose Holley – ANDP Manager ANPlan/ANDP Workshop, 28.
Recent Progress in the Million Book Digital Library Project in China By Prof. Jihai Zhao Zhejiang University Libraries, Hangzhou, China
Co-funded by the European Union under FP7-ICT Co-ordinated by aparsen.eu #APARSEN Dealing with DRM and Digital Rights at the German National Library.
The Voice of A Community Chinese Times Digitization Project Ian Song Prepared for the Multicultural Canada Conference
The World Digital Library Initiative John Van Oudenaren Senior Advisor, World Digital Library Library of Congress April 18, 2007.
The National Digital Newspaper Program (NDNP) An NEH/LC Collaborative Program Enhancing access to historical newspapers Release: September 2006.
Digital Library Projects at Bibliotheca Alexandrina Noha Adly 16 January 2006.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
Pre-SWOT Report. Printed Arabic OCR Dr. Mohamed El-Mahallawy Eng. Hesham Osman Eng. Rana Abdou Dr. Mohamed Waleed Fakhr Dr. Mohsen Rashwan.
Mark Phillips Digital Projects Department University of North Texas Annexation of Texas Project.
IIIT Hyderabad - 1 Book Reading Interface: Image Processing Issues J.Chetan, V.Sreekanth, Rakesh Babu Vamshi Ambati and C.V.Jawahar.
HathiTrust Digital Library. Overview ›Began in 2008 ›Large scale digital preservation repository ›Partnership of major research libraries ›Focus on both.
Introduction to HTML5. History of HTML HTML first published – Tim Berners-Lee HTML 2.0 HTML 3.2 HTML 4.01 XHTML 1.0 XHTML 2.0.
Mass digitisation? Astrid Verheusen Projectmanager Research & Development Division National library of the Netherlands LIBER-EBLIDA Workshop on Digitisation.
Mark Sullivan Digital Library of the Caribbean. Imaging  Imaging Theory & Specifications  Recommended Equipment and Software 2 dLOC Training (7/29/2013)
1 Helping communities access and explore their newspaper heritage. Rose Holley – Manager Newspaper Digitisation Program
UNL-LIS Nagi & Della Senta 30 October The library held about 700,000 scrolls, arranged in storage racks.
A short guide to publishing in European Journal of Soil Science EJSS wileyonlinelibrary.com/journal/ejss.
4DigitalBooks 1 Ivo IOSSIGER President Automatic page Turning and Scanning Books Switzerland.
TDWG 2006 Conference, St Louis Digitizing the legacy literature of biodiversity An introduction to the Biodiversity Heritage Library (BHL) Neil Thomson.
1 UNOG Library Digitization and Microform Unit (DMU) – December 2009.
Million Book Bibliotheca Alexandrina Youssef Eldakar 19 November 2006.
EText for iPad App and Production Overview for HiEd.
PROJECT PROPOSAL DIGITAL IMAGE PROCESSING TITLE:- Automatic Machine Written Document Reader Project Partners:- Manohar Kuse(Y08UC073) Sunil Prasad Jaiswal(Y08UC124)
Mass Digitization Projects Celebration and Challenges Presented to the 2 nd ICUDL Alexandria, Egypt by Dr. Gloriana St. Clair Carnegie Mellon University.
Oct 12-14, 2003NSDL Challenges in Building Federation Services over Harvested Metadata Kurt Maly, Michael Nelson, Mohammad Zubair Digital Library.
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
1 THE AUSTRALIAN NEWSPAPERS DIGITISATION PROGRAM (NDP) Rose Holley – Manager Newspaper Digitisation Program Presentation for Spydus 31 October 2007, NLA,
Million Book Project: Vision Becoming Reality Gabrielle Michalek, Carnegie Mellon Presentation to Carnegie Mellon Qatar Library November 9 & 10, 2005.
DATA REPRESENTATION - TEXT
WV DOT Scanning Project
HathiTrust Digital Library Interface and Services
HTML5 Basics.
Fusion of Multiple Corrupted Transmissions and its effect on Information Retrieval Walid Magdy Kareem Darwish Mohsen Rashwan.
R.S. Saad, R.I. Elanwar, N.S. Abdel Kader, S. Mashali, and M. Betke
S.Rajeswari Head , Scientific Information Resource Division
Let’s Blog Using a Blog as a Communication Tool
Al-Ādāb Magazine Archives: Digitization, Preservation and Access
Introducing OmniPage Ultimate
Teaching Strategies for Reading Electronic Texts
Challenges against building FADA
Laurie N. Taylor, PhD University of Florida Libraries
Material Guidelines Condition.
DIGITAL LIBRARY.
Objective Understand web-based digital media production methods, software, and hardware. Course Weight : 10%
Using Today’s Technologies to Enhance Documentation of a Project
Digitizing Arabic Text: Where are we today?
Digital Stewardship Curriculum
Diagnostic scan E-Tool
Digitizing Arabic Text: Where are we today?
Current Challenges in Digitization
Quick and Dirty: the art of OCR
Presentation transcript:

Million Book Bibliotheca Alexandrina Noha Adly 20 November 2006

Bibliotheca Alexandrina2

3

4 BA Digitization Workflow

Bibliotheca Alexandrina5 Statistics - November 2006 ArabicLatinTotal Scanned Books22,023 4,64626,669 Pages7,003,1851,350,688 8,353,873 Processed Books21,9474,642 26,589 Pages6,987,3921,348,900 8,336,292 OCRed Books16,6524,600 21,252 Pages5,248,3371,327,385 6,575,722 Total Archived Data1,500 GB

Bibliotheca Alexandrina6 Statistics (Contd)  Daily Rates –Scan: ≈ 1800 pages/person –Process: ≈ 1800 pages/person –Latin OCR: ≈ 4000 pages/person –Arabic OCR: ≈ 1500 pages/person  Five Minolta scanners  2 shifts – 7 days a week

OCR Image to Text

Bibliotheca Alexandrina8 OCR - Arabic  Poses unique challenges –Written cursively, with blocks of connected characters –a ‘block of characters’ can have more than one base line. –Uses external objects such as dots, 'Hamza' and 'Madda'. –Diacritization –Characters can have more than one shape according to their position –Overlapping makes it difficult to determine the spacing  Sakhr Automatic reader is used  Tricky with old books  Requires learning

Bibliotheca Alexandrina9 Arabic Script Is Cursive

Bibliotheca Alexandrina10 Old, Smudgy, and Sticked Together

Bibliotheca Alexandrina11 Use of Diacritics

Bibliotheca Alexandrina12 16 Font Groups

Bibliotheca Alexandrina13 Evaluation of VERUS and AR  Research agreement with NovoDynamics  Preliminary evaluation on two data sets is promising –Challenge: difficult to OCR, degraded images –Normal: known to return acceptable accuracy

Encoding Image on Text

Bibliotheca Alexandrina15 Image-on-Text  Multilayered: –Visible page image –Hidden OCR text  View exact original layout while searching and highlighting  Supported with some OCR suites only  Supported format: DJVU and PDF

Bibliotheca Alexandrina16 Quality Assurance  No missing cover or pages  All pages are in order  Text quality  Images quality  PDF quality

DAR Digital Assets Repository

Bibliotheca Alexandrina18 System Architecture

Bibliotheca Alexandrina19 DAK Publishing Module

Bibliotheca Alexandrina20 DAK Publishing Module

Bibliotheca Alexandrina21 DAK Publishing Module

Bibliotheca Alexandrina22 DAK Publishing Module

Bibliotheca Alexandrina23

Bibliotheca Alexandrina24 Show notes

Bibliotheca Alexandrina25

Bibliotheca Alexandrina26 Transfer of Digitized Books  Challenges –Storage: CD vs Online –Bandwidth: 10 Mbps vs 155 Mbps –Copyright: not published  Actions: –Transferred 8,500+ books to the Internet Archive –Process is still going on

Books From India Towards better collaboration

Bibliotheca Alexandrina28 Books From India LanguageNumber Books Arabic832 Arabic + French3 Arabic + German1 Persian101 French2 English1 Spanish1 German1 Total942

Bibliotheca Alexandrina29 Progress Phase NameDone as of November 1, 2006 Expected to finished by Comments Cataloging have metadata problems Processing742November 20, 2006 OCRing200March 1, 2007 Encoding171-- Publishing171--

Bibliotheca Alexandrina30 Metadata Problems

Bibliotheca Alexandrina31 Processing

Bibliotheca Alexandrina32 OCR Using VERUS or AR?  Calculated accuracy for a small sample –Images processed once with darkening effect and once without –VERUS likes darkening, AR does not –Overall, AR won 70% of cases

Bibliotheca Alexandrina33

Bibliotheca Alexandrina34

Bibliotheca Alexandrina35

Bibliotheca Alexandrina36

Bibliotheca Alexandrina37

Bibliotheca Alexandrina38

Bibliotheca Alexandrina39

Bibliotheca Alexandrina40

Bibliotheca Alexandrina41

Bibliotheca Alexandrina42 Thank You