ادارة الوثائق الالكترونية Naji Shukri Alzaza University of Palestine February 2010.

Slides:



Advertisements
Similar presentations
End-to-end document capture, indexation, OCR to Microsoft SharePoint
Advertisements

1 Survey Technology. Data Collection Tools Available in the Market 1. Paper Survey 2. Smart Paper 3. Cell Phones 4. Personal Digital Assistants - PDAs.
Review of AI from Chapter 3. Journal May 13  What advantages and disadvantages do you see with using Expert Systems in real world applications like business,
Types of Computers & Computer Hardware
Types of Computers & Computer Hardware Computer Technology.
Input & Output Devices ASHIMA KALRA.
Commercial Data Processing Lesson 2: The Data Processing Cycle.
1 Probabilistic Artificial Neural Network For Recognizing the Arabic Hand Written Characters Khalaf khatatneh, Ibrahiem El Emary,and Basem Al- Rifai Journal.
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data.
Input to the Computer * Input * Keyboard * Pointing Devices
Business Technology Applications Computer Basics.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
CAPTCHA Prabhakar Verma “08MC30”.
California Car License Plate Recognition System ZhengHui Hu Advisor: Dr. Kang.
Cambodia-India Entrepreneurship Development Centre - : :.... :-:-
Lecture-8/ T. Nouf Almujally
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data.
Presented By: Dahan Yakir Sepetnitsky Vitali. 2  The will to explore mathematical expressions given as a printed or captured image  It would be nice.
Hardware, Software & Automatic input devices LO: Recognise hardware, software. Learning outcome: Correctly identify hardware and software. Recognise and.
Census Data Capture Challenge Intelligent Document Capture Solution UNSD Workshop - Minsk Dec 2008 Amir Angel Director of Government Projects.
   Input Devices Main Memory Backing Storage PROCESSOR
TERMS TO KNOW. Programming Language A vocabulary and set of grammatical rules for instructing a computer to perform specific tasks. Each language has.
Introduction to Systems Analysis and Design Trisha Cummings.
Classification with Hyperplanes Defines a boundary between various points of data which represent examples plotted in multidimensional space according.
Unit 30 P1 – Hardware & Software Required For Use In Digital Graphics
 Optical Scanners Optical Scanners  Scanners Scanners  Electronic Tablet/Pen Electronic Tablet/Pen  Digital Camera Digital Camera  Webcam Webcam.
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and.
Computer main parts. Hardware  It refers to all physical parts of a computer system.
Million Book Bibliotheca Alexandrina Noha Adly 20 November 2006.
 The most intelligent device - “Human Brain”.  The machine that revolutionized the whole world – “computer”.  Inefficiencies of the computer has lead.
Standard Grade Computing General Purpose Packages WORD-PROCESSING WORD-PROCESSING Chapter 2.
S EGMENTATION FOR H ANDWRITTEN D OCUMENTS Omar Alaql Fab. 20, 2014.
DATA COLLECTION METHODS CONTENT PAGE How data is collected via questionnaires. How data is collected via questionnaires. How data is collected with mark.
Data Capture Overview United Nations Statistics Division
An-Najah National University Faculty Of Engineering Computer Engineering Department Abed Al-hadi kulib.
Just as there are many human languages, there are many computer programming languages that can be used to develop software. Some are named after people,
UNSD Census Workshop Day 2 - Session 7 Data Capture: Intelligent Character Recognition Andy Tye – International Manager DRS are Worldwide specialists in.
I Robot.
OCR at INIS Branko Krznarić. Outline  What is OCR?  OCR Objectives  Principles  Techniques  Software INIS Training Seminar October 2015, Vienna,
Regional Workshop on the 2010 World Programme on Population and Housing Censuses: International standards, contemporary technologies for census mapping.
OMR, OCR and MICR Software Group 2: Maaz Masood(Leader) Haris Khan Talha Mobeen Hasan Shariq.
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.
Basic Element of Electronics Data Processing Hardware Hardware Software Software Networking Networking Person involved in Computer Fields Person involved.
Arabic Handwriting Recognition Thomas Taylor. Roadmap  Introduction to Handwriting Recognition  Introduction to Arabic Language  Challenges of Recognition.
The Big Picture Things to think about What different ways are there to collect information automatically? What are the advantages and disadvantages of.
Usability of CAPTCHAs Or usability issues in CAPTCHA design Authors: Jeff Yan and Ahmad Salah El Ahmad Presented By: Kim Giglia CSC /19/2008.
 Handwritten character recognition is a frontier area for research for the past few decades  OCR-process of translation of images of handwritten shorthand.
By: Shane Serafin.  What is handwriting recognition  History  Different types  Uses  Advantages  Disadvantages  Conclusion  Questions  Sources.
Portable Camera-Based Assistive Text and Product Label Reading From Hand-Held Objects for Blind Persons.
Software Development Languages and Environments. Computer Languages Just as there are many human languages, there are many computer programming languages.
Drill Workflow- Make a workflow using the task and decision boxes on the board to simulate a student getting up and going to school in the morning. Use.
Input devices Device that accepts data and instructions from the outside world Keyboard Mouse Trackball Joystick Light pen Touch Screen Scanner Bar code.
Input & Output Devices ASHIMA KALRA.
Standard Input Devices
MAGNETIC STRIPE READER
S.Rajeswari Head , Scientific Information Resource Division
UNSD Census Workshop Data Capture: Intelligent Character Recognition
Chapter 5 - Input.
Types of Computers & Computer Hardware
Unit# 8: Introduction to Computer Programming
UN Workshop on Data Capture, Bangkok Session 7 Data Capture
Inputting Data In Other Ways
Introduction to Computers
Optical Data Capture: Optical Character Recognition (OCR)
UN Workshop on Data Capture, Dar es Salaam Session 7 Data Capture
Data Capture Process Stages
Data Capture - ICR Typical Workflow
Optical Data Capture: Optical Mark Recognition (OMR)
Introduction to Computers
Quick and Dirty: the art of OCR
Presentation transcript:

ادارة الوثائق الالكترونية Naji Shukri Alzaza University of Palestine February 2010

Optical Character Recognition “OCR” Naji ShukriAlzaz, EDM, University of Palestine, February 2010

What is OCR? Optical character recognition, usually abbreviated to OCR. OCR is the mechanical or electronic translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable text. It is used to convert paper books and documents into electronic files, for instance, to computerize an old record-keeping system in an office, or to serve on a website. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

What is OCR?... By replacing each block of pixels that resembles a particular character (such as a letter, digit or punctuation mark) or word with that character or word, OCR makes it possible to edit printed text, search it for a given word or phrase, store it more compactly, display or print a copy free of scanning artifacts, and apply such techniques as machine translation, text-to- speech and text mining to it. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

What is OCR?... OCR is a field of research in pattern recognition, artificial intelligence and computer vision. Though academic research in the field continues, the focus on OCR has shifted to implementation of proven techniques. OCR(using optical techniques such as mirrors and lenses) and digital character recognition (using scanners and computer algorithms) were originally considered separate fields. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

What is OCR?... Because very few applications survive that use true optical techniques, the OCR term has now been broadened to include digital image processing as well. Early systems required training to read a specific font; they needed to be programmed with images of each character, and it only worked on one font at a time. "Intelligent" systems with a high degree of recognition accuracy for most fonts are now common. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

What is OCR?... Some systems are even capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

OCR technology’s Current State The accurate recognition of Latin-script, typewritten text is now considered largely a solved problem on applications where clear imaging is available such as scanning of printed documents. Typical accuracy rates on these exceed 99%; total accuracy can only be achieved by human review. Other areas—including recognition of hand printing, cursive handwriting, and printed text in other scripts (especially those with a very large number of characters)—are still the subject of active research. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

OCR technology… Accuracy rates can be measured in several ways, and how they are measured can greatly affect the reported accuracy rate. For example, if word context (basically a lexicon of words) is not used to correct software finding non- existent words, a character error rate of 1% (99% accuracy) may result in an error rate of 5% (95% accuracy) or worse if the measurement is based on whether each whole word was recognized with no incorrect letters. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

On-line Character Recognition On-line character recognition is sometimes confused with OCR. OCR is an instance of off-line character recognition, where the system recognizes the fixed static shape of the character, while on-line character recognition instead recognizes the dynamic motion during handwriting. For example, on-line recognition, such as that used for PDA or Tablet PC can tell whether a horizontal mark was drawn right-to-left, or left-to-right. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

On-line Character Recognition… On-line character recognition is also referred to by other terms such as dynamic character recognition, real-time character recognition, and Intelligent Character Recognition or ICR. On-line systems for recognizing hand-printed text on the fly have become well-known as commercial products in recent years. Among these are the input devices for personal digital assistants such as those running Palm OS. The Apple Newton pioneered this product. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

On-line Character Recognition… The algorithms used in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known. Also, the user can be retrained to use only specific letter shapes. These methods cannot be used in software that scans paper documents, so accurate recognition of hand- printed documents is still largely an open problem. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

OCR Accuracy Accuracy rates of 80% to 90% on neat, clean hand- printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications. Recognition of cursive text is an active area of research, with recognition rates even lower than that of hand- printed text. Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

OCR Accuracy… For example, recognizing entire words from a dictionary is easier than trying to parse individual characters from script. Reading the Amount line of a cheque (which is always a written-out number) is an example where using a smaller dictionary can increase recognition rates greatly. Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

OCR Accuracy… The shapes of individual cursive characters themselves simply do not contain enough information to accurately (greater than 98%) recognize all handwritten cursive script. It is necessary to understand that OCR technology is a basic technology also used in advanced scanning applications. For more complex recognition problems, intelligent character recognition systems are generally used, as artificial neural networks can be made indifferent to both affine and non-linear transformations. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

OCR Accuracy… A technique which is having considerable success in recognizing difficult words and character groups within documents generally amenable to computer OCR is to submit them automatically to humans in the reCAPTCHA system. reCAPTCHA is a system originally developed to help digitize the text of books while protecting websites from bots attempting to access restricted areas. On September 16, 2009, Google acquired reCAPTCHA. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

Sakhr OCR القارئ الآلي Sakhr OCR converts scans of Arabic printed documents into digital text. Sakhr is rated #1 in recognizing clean copy Arabic text, with an output accuracy of 99%. Sakhr is the leading OCR provider for the Middle East, U.S., and Europe security and business needs. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

Sakhr OCR القارئ الآلي High performance 99.8% accuracy for high-quality documents 96% accuracy for low-quality documents Supports Arabic, Farsi, Pashto, Jawi, and Urdu Auto-detects translation language Supports bilingual documents Naji ShukriAlzaz, EDM, University of Palestine, February 2010

Sakhr OCR القارئ الآلي Features Available standalone SDK, or integrated with document management systems User-friendly output editor (WYSIWYG) Robust zoning with individual settings Multithreaded with concurrent recognition sessions Naji ShukriAlzaz, EDM, University of Palestine, February 2010

Sakhr OCR القارئ الآلي Challenges of Arabic OCR Sakhr’s powerful OCR engine overcomes numerous complexities of Arabic fonts and language, including: Arabic is written cursively, where several characters are connected to form "blocks of characters“. Arabic can be written in many fonts, so that a "block of characters" has more than one base line. Arabic uses many types of external objects such as dots, "Hamza" and "Madda". Naji ShukriAlzaz, EDM, University of Palestine, February 2010

Sakhr OCR القارئ الآلي Challenges of Arabic OCR … Arabic characters can have more than one shape according to their position inside the block of characters (initial, middle, final or standalone block of characters) Overlapping also makes it difficult to determine the spacing between blocks of characters and words Arabic font suppliers do not follow a common standard Naji ShukriAlzaz, EDM, University of Palestine, February 2010