Million Book Bibliotheca Alexandrina Youssef Eldakar 19 November 2006.

Slides:



Advertisements
Similar presentations
End-to-end document capture, indexation, OCR to Microsoft SharePoint
Advertisements

Capacity Building Passing on the Experience Dr. Noha Adly World Digital Library Arab Peninsula Regional Group meeting.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Enterprise Integration Solutions SharePoint Imaging.
DRS 2 one in a series of periodic updates Harvard University Library Andrea Goethals October 21, 2009 DRS = Digital Repository Service.
CAPTURE SOFTWARE Please take a few moments to review the following slides. Please take a few moments to review the following slides. The filing of documents.
CAPTURE SOFTWARE Please take a few moments to review the following slides. Please take a few moments to review the following slides. The filing of documents.
Introduction to metadata for IDAH fellows Jenn Riley Metadata Librarian Digital Library Program.
SOFTWARE PRESENTATION ODMS (OPEN SOURCE DOCUMENT MANAGEMENT SYSTEM)
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
NATIONAL LIBRARY OF MEDICINE PubMed Central Martha Fishel National Library of Medicine CENDI Meeting September 15, 2004.
Information Retrieval in Practice
Resource Discovery Module DigiTool Version 3.0. Resource Discovery 2 Deposit Approval Search & Index Dispatcher & Viewers Single & Bulk Web Services DigiTool.
Aletheia Apostolos Antonacopoulos PRImA Lab, The University of Salford, United Kingdom
Introducing Symposia : “ The digital repository that thinks like a librarian”
Overview of Search Engines
Delivering Value Driven Document Management. The Business Case An unfulfilled need in the market for a powerful, comprehensive and value driven document.
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
1 Newspaper Digitisation Workflows Rose Holley- Manager ANDP Presentation to Cultural Heritage Digitisation professionals 26 November 2008.
1 Australian Newspapers Digitisation Program Development of the Newspapers Content Management System Rose Holley – ANDP Manager ANPlan/ANDP Workshop, 28.
Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law Library Rutgers-Newark School of Law.
The National Digital Newspaper Program (NDNP) An NEH/LC Collaborative Program Enhancing access to historical newspapers Release: September 2006.
Guide to Using Message Maker Robert Snelick National Institute of Standards & Technology (NIST) December 2005
Digital Library Projects at Bibliotheca Alexandrina Noha Adly 16 January 2006.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
New Partnerships for Smarter Data Discovery, eBooks and Digital Asset Management Thailand IUG 2012 – Mahidol University.
Adventures in Digital Asset Management: Fedora at the National Library of Wales Glen Robson National Library of Wales
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
“Old Style” Libraries, Digital Libraries: Convergences, Divergences, And the Troubles in Between.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
The TARO Project Texas Archival Resources Online Fred Gilmore Sr Operating Systems Specialist UT Austin General Libraries April.
Web based METS creation Ralf Stockmann case study.
© January/2008 CCS Content Conversion Specialists GmbH Weidestr. 134, Hamburg, Germany consulting technology digitization services.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
LIS 506 (Fall 2006) LIS 506 Information Technology Week 11: Digital Libraries & Institutional Repositories.
The most powerful high-speed scanning, indexing and OCR solution on the market Supports many high speed scanners: Fujitsu, Canon, Kodak, Epson, Avision,
Million Book Bibliotheca Alexandrina Noha Adly 20 November 2006.
Metadata Considerations Implementing Administrative and Descriptive Metadata for your digital images 1.
Questys Text & Image Management System Records Management for the Information Age.
1 Helping communities access and explore their newspaper heritage. Rose Holley – Manager Newspaper Digitisation Program
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
Introduction to metadata
1 By: Suman Negi, Technical Officer ‘B’ DESIDOC, DRDO, Delhi Presentation at NACLIN 14 (During 9-11 December 2014, Pondicherry) Design and Development.
1 Tools for Extracting Metadata and Structure from DTIC Documents Digital Library Group Department of Computer Science Old Dominion University December,
The physical parts of a computer are called hardware.
Digitization/Scanning Process from Crystal Infosystems & Services.
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
The library is open Digital Assets Management & Institutional Repository Russian-IUG November 2015 Tomsk, Russia Nabil Saadallah Manager Business.
The Cataloging Department  Creates and maintains the libraries’ online catalog of both physical and virtual collections  Describes, classifies, and.
Collection Management Systems
Chapter Three Presentation: User interface How to Build a Digital Library Ian H. Witten and David Bainbridge.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Presenting Documents How to Build a Digital Library Ian H. Witten and David Bainbridge.
Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Post-ALA Annual July 11, 2008 Pre-Conference Workshop: The Care and Feeding of Compound Objects Geri Ingram OCLC Digital Collection Services Manager, User.
WV DOT Scanning Project
AEM Digital Asset Management - DAM Author : Nagavardhan
UNT Libraries TRAIL Processing Mark Phillips April 26, 2016
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
Building Search Systems for Digital Library Collections
DIGITAL LIBRARY.
Digitizing Arabic Text: Where are we today?
Digitizing Arabic Text: Where are we today?
Current Challenges in Digitization
AUC’s Role In Facilitating Access To Knowledge In The Arab World
Presentation transcript:

Million Book Bibliotheca Alexandrina Youssef Eldakar 19 November 2006

Bibliotheca Alexandrina2 BA Digitization Workflow

Bibliotheca Alexandrina3

Image Processing Image to Better Image

Bibliotheca Alexandrina5 Image Processing Sequence  Deskew  Despeckle  Rotation  Noise Removal  Black Edge Removal  Page resize  Center Text to page  Enhance text quality [Grow & Erode]  Renaming Files  File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)

Bibliotheca Alexandrina6 Image Processing Sequence  Deskew  Despeckle  Rotation  Noise Removal  Black Edge Removal  Page resize  Center Text to page  Enhance text quality [Grow & Erode]  Renaming Files  File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)

Bibliotheca Alexandrina7 Scanfix  Deskew Before After

Bibliotheca Alexandrina8 Scanfix  Despeckle Before After

Bibliotheca Alexandrina9 Scanfix  Rotation

Bibliotheca Alexandrina10 Image Processing Sequence  Deskew  Despeckle  Rotation  Noise Removal  Black Edge Removal  Page resize  Center Text to page  Enhance text quality [Grow & Erode]  Renaming Files  File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)

Bibliotheca Alexandrina11 Photoshop  Noise removal Before After

Bibliotheca Alexandrina12 Photoshop  Black edge removal Before After

Bibliotheca Alexandrina13 Photoshop  Page resize

Bibliotheca Alexandrina14 Photoshop  Center text to page Before After

Bibliotheca Alexandrina15 Image Processing Sequence  Deskew  Despeckle  Rotation  Noise Removal  Black Edge Removal  Page resize  Center Text to page  Enhance text quality [Grow & Erode]  Renaming Files  File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)

Bibliotheca Alexandrina16 Scanfix  Enhance text quality : Grow, Erode (Horizontal / Vertical) Before After

Bibliotheca Alexandrina17 Image Processing Sequence  Deskew  Despeckle  Rotation  Noise Removal  Black Edge Removal  Page resize  Center Text to page  Enhance text quality [Grow & Erode]  Renaming Files  File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)

Bibliotheca Alexandrina18 ACDSee  Renaming Files

Bibliotheca Alexandrina19 ACDSee  Compression to TIFF (CCITT– Group 4)

OCR Image to Text

Bibliotheca Alexandrina21 OCR - Arabic  Poses unique challenges –Written cursively, with blocks of connected characters –a ‘block of characters’ can have more than one base line. –Uses external objects such as dots, 'Hamza' and 'Madda'. –Diacritization –Characters can have more than one shape according to their position –Overlapping makes it difficult to determine the spacing  Sakhr Automatic reader is used  Tricky with old books  Requires learning

Bibliotheca Alexandrina22 Arabic Script Is Cursive

Bibliotheca Alexandrina23 Old, Smudgy, and Sticked Together

Bibliotheca Alexandrina24 Use of Diacritics

Bibliotheca Alexandrina25 Pre-OCR Text Enhancement  Condition of Arabic printings varies –Old/new –Light/heavy –Solid/dot-matrix  ScanFix’s smoothing and completion features improve recognition accuracy  Separate from actual processing phase –Must be tested under OCR right away –OCR specialists have a better feel for “good text”

Bibliotheca Alexandrina26 Text Repair in ScanFix

Bibliotheca Alexandrina27 Font Libraries  Improvement of Arabic OCR results through –Tweaking of OCR engine settings –Learning  Libraries for different fonts have been built to achieve higher recognition rates  Databases of character glyphs that describe a particular type of script and improve OCR accuracy  Built on a carefully selected and classified high-variety set of scanned images belonging to a batch of about 1000 books that boiled down to 15 font groups

Bibliotheca Alexandrina28 Font Classification  Classification criteria: –Script type TA: Traditional Arabic AR: Arabic Transparent DT: Deco type Naskh and Deco type Naskh extension –Printing quality: High (H), Medium (M), and Low (L) –Font size: 1 (largest) to 5 (smallest)  “Group X” – virtual font to tag unclassifiable printings and handwriting  Minimum accuracy number assigned to each group based on testing results

Bibliotheca Alexandrina29 16 Font Groups

Bibliotheca Alexandrina30 Learning  Train the engine on two representational pages of the book to build upon an initial font file picked from a set of pre- built font libraries  Use a different page to manually calculate OCR accuracy before and after learning  Batch OCR book using learned font file and save to ART

Bibliotheca Alexandrina31 Learning in Sakhr’s Automatic Reader

Bibliotheca Alexandrina32 VERUS from NovoDynamics  Preliminay evaluation on two data sets is promising –Challenge: difficult to OCR, degraded images –Normal: known to return acceptable accuracy  No learning capabilities—no human operators  VERUS uses an XML format to store recognition data  BA and NovoDynamics entered into a research agreement

Bibliotheca Alexandrina33 Evaluation of VERUS and AR

Encoding Image on Text

Bibliotheca Alexandrina35 Challenges in Publishing  Preservation of layout  Searchability of content and metadata  Efficient image compression  Easy browsing of books  Accommodating low bandwidth user  Multilingual text support  Multipaging

Bibliotheca Alexandrina36 Image-on-Text  Multilayered: –Visible page image –Hidden OCR text  View exact original layout while searching and highlighting  Supported with some OCR suites only  Supported format: DJVU and PDF

Bibliotheca Alexandrina37 UDBE  Universal Digital Book Encoder  A framework for integrating many OCR engines and supporting many target formats into a system for encoding image-on-text documents for publishing  Made possible through the use of a Common OCR Format (COF)

Bibliotheca Alexandrina38 UDBE  Built around a Common OCR Format (COF)

Bibliotheca Alexandrina39 Performance – Arabic B&W

Bibliotheca Alexandrina40 Performance – Latin B&W

Quality Assurance

Bibliotheca Alexandrina42 Q/A - Common Errors  No missing cover or pages  All pages are in order  Text quality  Images quality  Pages quality  PDF quality

Bibliotheca Alexandrina43 Q/A - Common Errors  No missing cover or pages  All pages are in order  Text quality  Images quality  Pages quality  PDF quality

Bibliotheca Alexandrina44 Q/A - Common Errors  No missing cover or pages  All pages are in order  Text quality  Images quality  Pages quality  PDF quality 17

Bibliotheca Alexandrina45 Q/A - Common Errors  No missing cover or pages  All pages are in order  Text quality  Images quality  Pages quality  PDF quality   Pale Text  Toothed Text  Curved Text

Bibliotheca Alexandrina46 Q/A - Common Errors  No missing cover or pages  All pages are in order  Text quality  Images quality  Pages quality  PDF quality  

Bibliotheca Alexandrina47  Cut Pages Q/A - Common Errors  No missing cover or pages  All pages are in order  Text quality  Images quality  Pages quality  PDF quality  Fingers  Noise and page edges  Pages Size  Skew

Bibliotheca Alexandrina48 Q/A - Common Errors  No missing cover or pages  All pages are in order  Text quality  Images quality  Pages quality  PDF quality  Image on Text  Searching Hits

DAR Digital Assets Repository

Bibliotheca Alexandrina50 System Architecture

Bibliotheca Alexandrina51 DAK - Metadata  Descriptive Metadata  Administrative Metadata  Technical Metadata

Bibliotheca Alexandrina52 DAK Publishing Module  Providing access to the repository content through search and browse facilities  Multilingual full-text search

Bibliotheca Alexandrina53 DAK Publishing Module  Functionalities –Browse the repository contents by Collection, Subject, Creator and Title –Search content by an indexed metadata field –Multilingual full-text search using both exact and morphological matching

Bibliotheca Alexandrina54 DAK Publishing Module  Functionalities (cont’d) –Display brief record information –Display full record information with links to digital objects –Display MARC and DC format

Bibliotheca Alexandrina55

Bibliotheca Alexandrina56

Bibliotheca Alexandrina57

Bibliotheca Alexandrina58

Bibliotheca Alexandrina59 Show notes

Bibliotheca Alexandrina60

Bibliotheca Alexandrina61 DAR: Future Work  Consider MODS and METS standards in the new system data model.  Enhance the functionalities of the Books Viewer with more security and copyright management  Join the Open Source community by building DAR modules with open source technologies and languages.  Provide support for the currently available digital library interoperability protocols

Books from India Towards Better Collaboration

Bibliotheca Alexandrina63 Books From India LanguageNumber Books Arabic832 Arabic + French3 Arabic + German1 Persian101 French2 English1 Spanish1 German1 Total942

Bibliotheca Alexandrina64 Progress Phase NameDone as of November 1, 2006 Expected to finished by Comments Cataloging have metadata problems Processing742November 20, 2006 OCRing200March 1, 2007 Encoding171-- Publishing171--

Bibliotheca Alexandrina65 Metadata Problems

Bibliotheca Alexandrina66 Processing

Bibliotheca Alexandrina67 OCR Using VERUS or AR?  Calculated accuracy for a small sample –Images processed once with darkening effect and once without –VERUS likes darkening, AR does not –Overall, AR won 70% of cases

Bibliotheca Alexandrina68

Bibliotheca Alexandrina69

Bibliotheca Alexandrina70

Bibliotheca Alexandrina71

Bibliotheca Alexandrina72

Bibliotheca Alexandrina73

Bibliotheca Alexandrina74

Bibliotheca Alexandrina75