Pre-SWOT Report. Printed Arabic OCR Dr. Mohamed El-Mahallawy Eng. Hesham Osman Eng. Rana Abdou Dr. Mohamed Waleed Fakhr Dr. Mohsen Rashwan.

Slides:



Advertisements
Similar presentations
1 of 18 Information Dissemination New Digital Opportunities IMARK Investing in Information for Development Information Dissemination New Digital Opportunities.
Advertisements

Review of AI from Chapter 3. Journal May 13  What advantages and disadvantages do you see with using Expert Systems in real world applications like business,
Part 1: Basic Fashion and Business Concepts
StormingForce.com Motion. StormingForce.com StormingForce’s technology is significantly increasing productivity and quality of manual repetitive tasks.
1 Egyptian Ministry of Communications and Information Technology Research and Development Centers of Excellence Initiative Data Mining and Computer Modeling.

25 Mar 10 – WDM 204 – Session Two. Cape Area Management Program (CAMP) Sponsored by the Cape & Islands Workforce Investment Board.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Input to the Computer * Input * Keyboard * Pointing Devices
Pre-SWOT Report. Online Handwritten Arabic OCR (Online Handwritten Recognition: OHR) Dr. Ashraf Al-Marakby Eng. Hesham Osman Eng. Randa Al-Anwar Dr. Mohamed.
Pre-SWOT Report. Offline Handwritten Arabic OCR (Intelligent Character Recognition ICR) Dr. Mohamed El-Mahallawy Eng. Hesham Osman Eng. Rana Abdou Dr.
1 PAPER : PRINTING AND SCANNING Presented by: NaziaMohammedAnil.
SESSION 10 MANAGING KNOWLEDGE FOR THE DIGITAL FIRM.
Artificial Neural Networks (ANNs)
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
California Car License Plate Recognition System ZhengHui Hu Advisor: Dr. Kang.
8 Systems Analysis and Design in a Changing World, Fifth Edition.
Iris Recognition By Mohammed, Ashfaq Ahmed. Introduction Iris Recognition is a Biometric Technology which deals with identification based on the human.
Advanced Workgroup System. RED Advanced Workgroup Systems: Scan Features Copy Print Scan DNSG Software Our Customers Documents Our Customers Documents.
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
Census Data Capture Challenge Intelligent Document Capture Solution UNSD Workshop - Minsk Dec 2008 Amir Angel Director of Government Projects.
ادارة الوثائق الالكترونية Naji Shukri Alzaza University of Palestine February 2010.
Evaluating the use of OCR on a Mobile Device Presented by : Hamed Alharbi Supervisor by :Dr Brett Wilkinson.
IIIT HyderabadUMASS AMHERST Robust Recognition of Documents by Fusing Results of Word Clusters Venkat Rasagna 1, Anand Kumar 1, C. V. Jawahar 1, R. Manmatha.
A Search Engine for Historical Manuscript Images Toni M. Rath, R. Manmatha and Victor Lavrenko Center for Intelligent Information Retrieval University.
At the North of England Institute of Mining and Mechanical Engineers Library, Newcastle upon Tyne.
TERMS TO KNOW. Programming Language A vocabulary and set of grammatical rules for instructing a computer to perform specific tasks. Each language has.
Gesture Recognition Using Laser-Based Tracking System Stéphane Perrin, Alvaro Cassinelli and Masatoshi Ishikawa Ishikawa Namiki Laboratory UNIVERSITY OF.
Topics Covered: Data preparation Data preparation Data capturing Data capturing Data verification and validation Data verification and validation Data.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Quality Attributes of Web Software Applications – Jeff Offutt By Julia Erdman SE 510 October 8, 2003.
© 2015 Nuance Communications, Inc. All rights reserved. What’s new in AutoStore 7 March 2015.
Million Book Bibliotheca Alexandrina Noha Adly 20 November 2006.
WERCS Upgrade 5.X – 6.1 Steve Giamalis. Major Changes This upgrade is very significant in terms of technology, functionality, structure, and environment.
10/10: Input & Output Definitions Input devices: examples
Intro to Computers Understanding Computers and Computer Literacy.
The Dark Side of Document Imaging: ‘The Hidden Cost of Capture’
1 UNOG Library Digitization and Microform Unit (DMU) – December 2009.
Million Book Bibliotheca Alexandrina Youssef Eldakar 19 November 2006.
PLANNING ENGINEERING AND PROJECT MANAGEMENT By Lec. Junaid Arshad 1 Lecture#03 DEPARTMENT OF ENGINEERING MANAGEMENT.
Computer Fundamental MSCH 233 Lecture 6. Printers Printer is an output device which convert data into printed form. The output from a printer is referred.
ITGS Databases.
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
CS 127 Introduction to Computer Science. What is a computer?  “A machine that stores and manipulates information under the control of a changeable program”
Robert Crawford, MBA West Middle School.  Explain how input devices are suited to certain kinds of data.  Distinguish between RAM and ROM.  Identify.
Nikola Tesla Museum Clipping Library Saša Malkov Nenad Mitić Žarko Mijajlović 3 rd SEEDI Int.Conf. Cetinje, Montenegro 14. September 2007.
Performance Comparison of Speaker and Emotion Recognition
OCR Software Architecture for Embedded Device Seho Kim', Jaehwa Park Computer Science, Chung-Ang University, Seoul, Korea
Developing a Marketing Plan
Census Processing Baku Training Module.  Discuss:  Processing Strategies  Processing operations  Quality Assurance for processing  Technology Issues.
Unit 1—Computer Basics Lesson 1 Understanding Computers and Computer Literacy.
Scanners. Using a Scanner Scanners are used to digitize any flat object. Several types of scanners- flatbed, sheet fed, handheld, film. Most common is.
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.
Product Design and Development Chapter 3
Arabic Handwriting Recognition Thomas Taylor. Roadmap  Introduction to Handwriting Recognition  Introduction to Arabic Language  Challenges of Recognition.
The Big Picture Things to think about What different ways are there to collect information automatically? What are the advantages and disadvantages of.
1 MIT 5316 Web-Based Computing Lecture 1. 2 Welcome Introduction Syllabus.
June 12, 2016CITALA'121 Cloud Computing Technology For Large Scale and Efficient Arabic Handwriting Recognition System HAMDI Hassen, KHEMAKHEM Maher
1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.
1 Creating Situational Awareness with Data Trending and Monitoring Zhenping Li, J.P. Douglas, and Ken. Mitchell Arctic Slope Technical Services.
Using the Gamera framework for the recognition of cultural heritage materials Levy Project II Digital Knowledge Center, Sheridan Libraries, Michael Droettboom,
Fusion of Multiple Corrupted Transmissions and its effect on Information Retrieval Walid Magdy Kareem Darwish Mohsen Rashwan.
Chapter 8 Environments, Alternatives, and Decisions.
Automatic Speech Recognition
Mousavi,Seyed Muhammad – Lyashenko, Vyacheslav
Statistical Methods for Text Error Correction
Overview of Machine Learning
Improving assessment and feedback processes with OCR technology
Presented by: Jeff Moore – Artsyl Technologies, Inc.
Presentation transcript:

Pre-SWOT Report. Printed Arabic OCR Dr. Mohamed El-Mahallawy Eng. Hesham Osman Eng. Rana Abdou Dr. Mohamed Waleed Fakhr Dr. Mohsen Rashwan

1-Introduction and challenges These systems recognize text that has been previously written or printed on a page and then optically converted into a bit image. Offline devices include optical scanners of the flatbed, paper fed and handheld types. Arabic printed script is more difficult than Latin script for the following reasons:

challenges Connectivity problem: segmentation and recognition Dotting problem Multiple Grapheme shapes depending on the position. Ligatures: To make things even more complex, certain compounds of characters at certain positions of the Arabic word segments are represented by single atomic graphemes called ligatures. Overlapping problem Diacritics problem Fonts families and size variations: نسخ، كوفى، رقعة Each has sub-families (Mac versus Windows) Finally, the font size problem: Different Arabic graphemes do not have a fixed height or a fixed width. Moreover, neither the different nominal sizes of the same font scale linearly with their actual line heights, nor the different fonts with the same nominal size have a fixed line height.

2- Applications Digitizing billions of books for digital library storage, archival, retrieval, and classification. Digitizing historical documents

3- State of the art in products (Latin script) OCR is a highly mature technology for Latin script with excellent performance. The main challenges are in the pre-processing, page segmentation, speed of batch processing and post-processing. OmniPage-17 by Nuance is an example of such a product with less than 1% WER:

4- State of the art in products (Arabic script) 1- Sakhr: 1% WER for good quality documents but may drop significantly with poor quality documents. (Best speed, and best output layout) 2- VERUS: a little lower than Sakhr for good quality but significantly better for poor quality. (bibliotheca alexandrina uses both engines for its digitization project). 3- Readiris: Lower performance than the other two.

5- State of the art in Research and Competitions Focus mainly on producing true Omni OCR for different font families, font sizes (specially the large), document pre-processing and framing, noise robustness, and batch-mode speed. Significant recent efforts: Most recent research employ HMMs, and fusion between multiple OCR systems targeting Omni font performance.

6- Required Modules ScanFix pre-processing tool (or similar): 15$ per license. Nuance document analysis tool (Framing tools) (or similar): 30$ per license. Word based language model Character based language model Grapheme to ligature and ligature to grapheme convertor: Need to build a tool Statistical training tools: HTK, SRI, Matlab, and many neural network tools. Error analysis tools: Need to be implemented. Diacritic Preprocessing tool Language Recognition tool

7- Required Resources Word annotated corpus (estimated 5000 pages of different quality-resolution and font styles). Character/Ligature annotated corpus for initial models (estimated 8 pages covering all shapes, with about 25 instance per shape). Character-based language models (use digital resources). Word-based language models (use digital resources). Dictionaries with transcriptions

8- Available Resources and Gaps We need some tools to be available (error analysis, grapheme to character/ligature, pre- processing). No available database so we need to do data collection very soon. Character/Ligature-based Language models have to be trained and made available for researchers.

9- LR proposed by ALTEC: Training We need to focus on the Naskh fonts family. Within Naskh, there may be about 6 families. Each would have 6 different font sizes (8,10,12,14,16,18). The rule is to have about 25 instances for each shape in each case. We assumed to have about 300 different shapes (characters and ligatures). So we need 300*25=7500 instances. This is about 8 pages. This should be done for each font family and for each font size as follows: 8pages*6faontsfamilies*6fontsizes= around 300 pages total (Clean=Excellent Quality).

LR proposed by ALTEC (cont.) These pages (for clean high quality training data) will be generated artificially, by balancing the data to cover all the 300 shapes. Then, to generate lower quality training data: a- The 300 pages will be outputted from a Fax machine (once) b- The 300 pages will be copied once (one output), then twice (second output). c- The same process will be done for 600, 300, and 200 dpi. (This gives 3600 pages: 300 clean, 300 from Fax, 300 copied once, 300 copied twice) multiplied by 3 for the 3 different resolutions. We will also obtain 2000 transcribed pages from Alex. Bib. with low quality old books, etc.).

LR proposed by ALTEC (Benchmarking) The recommended Benchmarking must be two- folded; one is to measure robustness and reliability of the product (software) and this requires 40,000 documents in one batch. These should include simple and complex documents, different qualities, etc. The second test, for accuracy, should include at least 600 pages (200 high quality, 200 medium, and 200 poor quality) coming from books, newspapers, Fax outputs, Typewriters, etc. It is highly recommended to have an OCR competition co-organized by ALTEC.

10- Preliminary SWOT analysis Strengths: 1.The expertise, in DSP, pattern recognition, image processing, NLP, and stochastic methods 2.Potential to have huge amounts of annotated data. Weaknesses: 1.The tight time & budget of the intended required products. 2.No benchmarking available for printed Arabic OCR 3.No training database available for research community for Arabic OCR

Opportunities: 1.Truly reliable & robust Arabic Omni OCR systems are a much needed essential technology for the Arabic language to be fully launched in the digital age. 2.No existing product is yet satisfactory enough 3.The Arabic language has a huge heritage to be digitized. 4.Large market of such a tech. of over 300 million native speakers, plus other numerous interested parties (for reasons such as security, commerce, cultural interaction, etc.). Threats: 1.Back firing against Arabic OCR technologies in the perception of customers, due to a long history of unsatisfactory performance of past and current Arabic OCR/ICR products. 2.Other R&D groups all over the world (esp. in the US) is working hard and racing for a radical solution of the problem.

11- Survey Specify the application that OCR recognition will be used for What is the data used/intended to train the system? What is the benchmark to test your system on? Would you be interested to contribute in the data collection. At what capacity? Would you be interested to buy Arabic OCR annotated data? Would you be interested to contribute in a competition How many persons working in this area in your team? What are their qualifications? What are the platforms supported/targeted in your application? What is the market share anticipated in your application? Would your application support any other languages? Explain.

List of Survey Targets Sakhr RDI ImagiNet Orange- Cairo IBM- Cairo Cairo University Ain Shams University Arab academy (AAST) AUC GUC Nile University Azhar university Helwan university Assuit university Other Centers outside Egypt Other companies that are users of the technology

12- Key Figures in this Field NovoDynamics (VERUS) research team: Dr. Steve Schlosser et. al. Dr. John Makhoul (BBN) Dr. Hazem AbdelAzeem (Egypt)