Download presentation
Presentation is loading. Please wait.
Published byMorgan McDonald Modified over 6 years ago
1
R.S. Saad, R.I. Elanwar, N.S. Abdel Kader, S. Mashali, and M. Betke
BCE-Arabic-v1 dataset: Towards interpreting Arabic document images for people with visual impairments R.S. Saad, R.I. Elanwar, N.S. Abdel Kader, S. Mashali, and M. Betke
2
Content Reading challenge for Arabs with VI
Solving document inaccessibility Resources needed The BCE-Arabic benchmark project Case study: BCE-Arabic dataset annotation
3
Reading Challenge VI daily need for text interpretation Education
Shopping Health care Navigation etc. Credit: VI need to recognize text on objects they deal with everyday. Printed text in books and newspapers, text on signs and street ads, text on cans and medication bottles, etc. Most text is not interpreted in Braille language and not every VI is Braille literate. Interpreting printed text to a natural language is a must. Daily access of printed text for VI requires a reliable assistive technology equipped with scanning and/or image capturing capability together with high accuracy OCR solution for text. Credit: Credit:
4
Reading Challenge MIT Media Lab study (English OCR products)
Unsatisfactory OCR result Unsatisfactory processing speed Poor performance on curved surfaces and fragmented text OCR Challenges? 1. An on-going project in MIT Media Lab study showed that VI are not satisfied with the word recognition accuracy and processing speed of English OCR software and would like cutting-edge tools for reading fragmented text or text on curved surfaces 2. Text access for VI is more difficult, when text is stored in a file that is not digitally derived (i.e., digitized by a scanner), contains non-text elements (like photos), or has complex document layouts (tables, multi-columns, etc.) and decorative backgrounds (watermarks) or decayed paper material (cuts and stains) Credit:
5
Reading Challenge
6
Reading Challenge What about Arabic?
Millions with significant visual impairment Problems with Assistive technology (AT) Some don’t support Arabic Poor performance compared to English High price of AT Effect on Arabic-like language speakers Need for digitization solutions The situation is much direr for individuals with visual impairments in the Arabic-speaking world No exact recent consensus could be found but the Arab world (22 countries) has around 7 million VI person. Natural language processing solutions like OCR, text-to-speech, language translators generally have poor performance for Arabic compared to Latin-script languages, in particular English. Arabic script may not only serve individuals in the Arab world, but also individuals who speak Urdu, Persian, Pashto, Kurdish, Jawi. The Arab world is trying to increase the Arabic content on the internet by digitizing and archiving both contemporary data as well as documentation of ancient heritage. Which all need to be accessible for VI.
7
Reading Challenge Alternatives? Pros: Quick & No-cost
Offline help (Audio book recording) Online help (Crowd question answering) Pros: Quick & No-cost Cons: daily needs, internet coverage, user experience, Supported languages The lack of satisfactory solutions for text-access has resulted in proliferation of temporary alternatives based on human collaboration to receive quick and no-cost help in everyday situations. Offline help, shows up in individual and group volunteerism for audio books recording. Online help shows up in crowd collaboration for visual question answering Although the pros of such alternatives are the quick and no cost response. However, the cons are many: Alternatives can’t solve the need of daily access of a printed book or newspaper, Recipients may not have internet coverage at an affordable cost, they may not have the appropriate technical skills to use online collaboration, and, on top of all this, their language might not be supported
8
Document Inaccessibility
Developing intelligent systems for image analysis of documents is a must. "The objective of document image analysis is to recognize the text and graphics components in images of documents, and to extract the intended information as a human would“ R. Kasturi, L. O'Gorman, and V. Govindaraju (2002)
9
Document Inaccessibility
Document images coming from a scanner are usually in PDF format PDFs are either accessible or inaccessible. Accessibility features: Interpreting textual content for users with disabilities via AT (Text stream available) Searching and navigating through content (Tags available)
10
Document Inaccessibility
PDF types: Fully-tagged digitally-born Untagged searchable scanned PDFs Untagged unsearchable scanned PDFs Untagged unsearchable scanned PDFs (raster-image) are inaccessible. They need both DLA and OCR. 1. The first category includes formatted text and graphics PDF files, which are fully tagged and have layout and text information available. The originated from word processing software and converted to PDF format. 2. The second category consists of searchable image PDF files, which are scanned copies associated with a hidden text layer out of an OCR engine for accessibility but no information about the layout. 3. The third category includes raster-image PDF files, which are scanned copies of documents (i.e. image of text) that do not contain any tags or associated text 4. Raster-image PDFs need DLA to identify the image textual content and OCR to interpret this content to natural language accessed by VI AT.
11
Document Inaccessibility
DLA importance (pilot study) Testing an open-source OCR library (Tesseract®) performance on (1) raster-image PDF directly and (2) on analyzed image (mimic DLA function) Observation: The inaccessibility of Arabic image documents is mostly due to the OCR engine failure to understand the page layout
12
Document Inaccessibility
DLA importance (pilot study) Sample images with ground-truth text regions (left) were interpreted by the Tesseract OCR engine without DLA (middle) and with DLA (right) support. Without DLA, Tesseract could not recognize most words (black). With DLA, Tesseract understood most words correctly (red) or with at most 2 character errors (green). Occasionally Tesseract became stuck interpreting a word (blue). These results suggest that the inaccessibility of Arabic image documents is in large part due to the failure of the OCR engine to understand the page layout of the documents.
13
Resources Needed Current status: Available DLA research datasets
Benchmarking datasets for different languages OCR datasets vs. DLA datasets Application-based datasets 1. Available DLA research datasets are small (private) 2. Benchmarking datasets for English much larger than other languages (Arabic, Spanish, Chinese etc.) 3. Benchmarking datasets for OCR (text only) are more available than that for DLA (text and non-text elements) 4. Datasets are application dependent (printed vs. handwritten, newspaper vs. forms vs. books, etc.)
14
Resources Needed Development and evaluation of intelligent systems for DLA still require a large number of labeled document samples. But! Ground truthing (labeling) is: manually or semi-automatically expensive, time-consuming, and application-dependent
15
Resources Needed Lots of Arabic documents (scanned books and journals) are available on the internet However suffer from: Limited layouts Low quality (B&W) Copyrights Watermarks Illegal uploads Religious content 1. The layout variability of web documents is limited. 2. The document images are of low-quality due to low-resolution B&W scanning intended to minimize the upload file size. 3. Most of the books are copyrighted and have access restrictions (e.g. viewable but not downloadable). 4. Many images contain digital library watermarks, added by publishers or libraries. 5. It is not easy to discern if the copyrights have been infringed by the upload. 6. Most uploaded Arabic image content is of religious nature and uses a script with a multitude of diacritical marks (i.e., small glyphs used as phonetic guides). Documents on science, literature, or art do not use diacritics and are therefore easier to annotate.
16
BCE-Arabic benchmark project
On-going collaboration between by team members from Boston University, Cairo University, and Electronics Research Institute. The work was partially funded by National Science Foundation, grant (to M.B.) Cairo Initiative Scholarship Program (to R.E.).
17
BCE-Arabic benchmark project
Project Objectives: accelerate the development of automated DLA solutions for Arabic document images, and help with benchmarking and comparative evaluation of research efforts.
18
BCE-Arabic benchmark project
Method: Sample selection Metadata & representation Acquisition Annotation 1. Selection of representative variety of document content, (text and non-text) suiting DLA research needs 2. Deciding best representation (ground truth metadata, storage format and hierarchy) of samples 3. Samples acquisition (different conditions) 4. Annotating (labeling) the acquired samples into the selected representation
19
BCE-Arabic benchmark project
Milestones: Phase 1: Launching pilot version BCE-Arabic v1 (some limitations). Phase 2: Massive acquisition of new samples (large variety) Phase 3: Crowd sourcing for annotation Phase 4: Constructing searchable samples database for user customization
20
BCE-Arabic v1 specs Total 1,850 images of book pages
180 books by the same publisher Fair-Use scanning of the available layouts at 400 dpi resolution (Gray scale) Stored in raster-image PDF format Layout annotations available Text ground truth (in progress) Annotations given in PAGE XML representation by Alethia® tool Sample counts for all layouts is not uniform due to the availability constraint. Fair use of random pages scanned per book so as not to violate ownership rights
21
BCE-Arabic v1 specs 1850 samples 1,235 text –only samples 383 text with images samples 179 text with graphic elements (charts and diagrams) 24 text with tables samples 29 text in multiple columns a) 1,235 images containing only text; its components are titles, page headers, body text, footers, footnotes, captions in various font sizes and with a range of formats, and b) 383 pages with text and images, c) 179 pages with text and graphic elements (charts and diagrams), d) 24 pages with text and tables, e) 29 images with text in mixed single & double columns
22
Case study: BCE-Arabic dataset annotation
We compared six state-of-the-art document annotation tools, regarding: (1) support for segmentation of image regions of interest, (2) support for a metadata annotation (definition & entry), (3) annotation time consumed (4) the resulting output format (representation), and (5) the ease-of-use of the tool. On different layout complexity levels
23
Case study: BCE-Arabic dataset annotation
Tools used by DLA problem researchers are: MS Paint, Pixlabeler, DIVADIA, GEDI, TrueViz, and Aletheia. Our results indicate that GEDI, Aletheia, and TrueViz have many common features for annotation and zoning that make them the preferable tools for ground truthing of documents. Regarding tool flexibility, average annotation time, and metadata availability and representation Alethia was outperforming its peers so we have annotated the dataset using it.
24
Finally, Phase 2 of the BCE-Arabic project is in progress and will extend the work described in this paper. We hope that the current and future versions of BCE-Arabic will serve the community of researchers finding solutions to assist people with visual impairments We share our image dataset and annotations, BCE-Arabic v1, with the research community to support application and future extensions of this work
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.