S.Rajeswari Head , Scientific Information Resource Division

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
KompoZer. This is what KompoZer will look like with a blank document open. As you can see, there are a lot of icons for beginning users. But don't be.
Segmentation of Touching Characters in Devnagari & Bangla Scripts Using Fuzzy MultiFactorial Analysis Presented By: Sanjeev Maharjan St. Xavier’s College.
Chapter 3 Creating a Business Letter with a Letterhead and Table
OCRdroid : A Framework to Digitize Text Using Mobile Phones  Authors  Mi Zhang, Anand Joshi, Ritesh Kadmawala, Karthik Dantu, Sameera Poduri, and Gaurav.
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Processing Digital Images. Filtering Analysis –Recognition Transmission.
Document Image Analysis CSE 717 An Introduction. Document Image Analysis  DIA is the theory and practice of recovering the symbol structures of digital.
S OFTWARE AND M ULTIMEDIA Chapter 6 Created by S. Cox.
Developing a Basic Web Page with HTML
The University of Adelaide Table Talk: Using tables in Word Peter Murdoch March 2014 PREPARING GOOD LOOKING DOCUMENTS.
Braille Converter For Exam Background What is Braille? Braille is a series of raised dots that can be read with the fingers by people who are.
Creating Accessible PDF’s in Adobe Acrobat Professional 7.0.
Abstract # 0000 Make the Main Title with Large Bold Type Your Name Here Your Department Here Texas A&M Health Science Center Make the Main Title with Large.
VEHICLE NUMBER PLATE RECOGNITION SYSTEM. Information and constraints Character recognition using moments. Character recognition using OCR. Signature.
Acadia Institute for Teaching and Technology Basic Web Page Design for Courses.
Unit 30 P1 – Hardware & Software Required For Use In Digital Graphics
2 pt 3 pt 4 pt 5pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2pt 3 pt 4pt 5 pt 1pt 2pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4pt 5 pt 1pt Terms 2 Terms 3 Terms 4 Terms 5 Terms.
Make the Main Title with Large Bold Type Your Name and Title Here Your Department Here Texas A&M Health Science Center Make the Main Title with Large Bold.
Abstract # 0000 Make the Main Title with Large Bold Type Use Smaller Type for the Subtitle. Above Type is 105pt. This Type is 70pt. Make authors’ names.
CHAPTER FIVE TEXT.
BACKGROUND LEARNING AND LETTER DETECTION USING TEXTURE WITH PRINCIPAL COMPONENT ANALYSIS (PCA) CIS 601 PROJECT SUMIT BASU FALL 2004.
Chapter 10-Basic Software Tools. Overview Text-based editing tools. Graphical tools. Sound editing tools. Animation, video, and digital movie tools. Video.
Standard Grade Computing General Purpose Packages WORD-PROCESSING WORD-PROCESSING Chapter 2.
S EGMENTATION FOR H ANDWRITTEN D OCUMENTS Omar Alaql Fab. 20, 2014.
Braille Converter For Exam Agenda 1.Introduction 2.Research Problem 3.Objectives 4.Methodology 5.Users & Benefits 6.Expected Outputs 7.References.
Introduction to Interactive Media Interactive Media Components: Text.
Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall. 1 Skills for Success with Microsoft ® Office 2007 PowerPoint Lecture to Accompany.
ITGS Application Software. ITGS Application software (productivity software) –Allows the user to perform tasks to solve problems, such as creating documents,
Word 2003 The Word Screen. Word 2003 Screen File Menu –Holds the options for creating a new document, opening a document, saving a document, printing.
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.
Unit 14 - Desktop Publishing Part 2 Desktop Publishing InDesign – Inside Pages
HOW SCANNERS WORK A scanner is a device that uses a light source to electronically convert an image into binary data (0s and 1s). This binary data can.
Preliminary Transformations Presented By: -Mona Saudagar Under Guidance of: - Prof. S. V. Jain Multi Oriented Text Recognition In Digital Images.
The Big Picture Things to think about What different ways are there to collect information automatically? What are the advantages and disadvantages of.
COM: 111 Introduction to Computer Applications Department of Information & Communication Technology Panayiotis Christodoulou.
Optical Character Recognition
Make the Main Title with Large Bold Type Use Smaller Type for the Subtitle. Above Type is 110pt. This Type is 80pt. Make authors’ names smaller. This is.
Chapter 3 – Part 1 Word Processing Pages for Mac CMPF 112 : COMPUTING SKILLS.
Section 8.1 Section 8.2 Create a custom theme Design a color scheme
2.01 Understand Digital Raster Graphics
OCR Reading.
CNIT 131 HTML5 - Tables.
MSU Libraries’ Course Materials Program:
Optical Character Recognition
Lesson 16 Enhancing Documents
2.01 Understand Digital Raster Graphics
Shelly Cashman: Microsoft Word 2016
Adobe Flash Professional CS5 – Illustrated
The How-to-Guide for Using Word
Software and Multimedia
Lesson 17 Working with Graphics
Software and Multimedia
Getting Started with DTP
UN Workshop on Data Capture, Bangkok Session 7 Data Capture
Graphic Communication
eCopy PDF Pro Office Integration with iManage Work.
2.01 Understand Digital Raster Graphics
Terms 1 Terms 2 Terms 3 Terms 4 Terms 5 1pt 1 pt 1 pt 1pt 1 pt 2 pt
Aniko T. Valko, Keymodule Ltd.
UN Workshop on Data Capture, Dar es Salaam Session 7 Data Capture
Tutorial Developing a Basic Web Page
Training & Development
ICT Word Processing Lesson 1: Introduction to Word Processing
Word Processing Software Photo credit: © 2007 JupiterImagesCorporation.
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
5.00 Apply procedures to organize content by using Dreamweaver. (22%)
Layout Terms Visual Hierarchy
Quick and Dirty: the art of OCR
Presentation transcript:

Implementation of a Web based Text Extraction Tool using Open Source Object Models S.Rajeswari Head , Scientific Information Resource Division Indira Gandhi Centre for Atomic Research Kalpakkam. 603102 raj@igcar.gov.in

Motivation Scanned documents older than 2 decades are available in libraries and institutional repositories in large number. To retrieve meta data from these documents, text extraction tools are necessary. To extract text from such scanned documents, the tool used is Optical Character Recognition (OCR).

Scope of this work Open source libraries were used in the implementation of a web based OCR tool to extract text from scanned documents. Developing an in-house OCR tool would make it possible to customize Text extraction from documents and store the extracted text in the format that is required for further processing. In addition to an OCR tool, a pre processing tool to remove skew introduced in the document during scanning was developed using open source tools

Scanner – The tool that generates scanned documents Scanner involves recording changes in light intensity reflected from the document as a matrix of dots. The light/ colour values of each dot is stored in binary digits, one bit being required for each dot in a black/white scan and up to 32 bits for a colour scan. - Reference : Haigh, Susan. Optical character recognition (OCR) as a digitization technology. Network Notes #37. November 15, 1996.

Optical Character Recognition Optical Character Recognition (OCR) is the ability of a computer to read the content of a document from a scanned image. Normally a scanned document is one large graphic file. OCR enables the computer to recognize letters and turn the document into regular text. An OCR document is smaller to store on the computer. It can be searched or edited like other text documents.

What is OCR Ack: https://abbyy.technology/en:features:ocr:start

OCR Step 1 : Pre-Processing Image cleaning routines Skew correction Binarisation

OCR Step 2 : Layout Analysis Page analysis is the next major routine used in character recognition process. Find page elements like blocks of text, tables, lines, single characters . Line segmentation The word segmentation. The character segmentation separates the various letters of a word. If the characters have fixed pitch, character segmentation is easy. Page analysis is the next major routine used in character recognition process. It is finding page elements like blocks of text, tables, lines, single characters . Line segmentation consists of slicing a page of text into its different lines. The word segmentation isolates one word from another. The character segmentation separates the various letters of a word. If the characters have the same width (fixed pitch), character segmentation is easy.

OCR Step 3 : Character Recognition The pixels of a scanned image has to be organised into characters. To do this a library of characters in pixel form is necessary. This library should support different fonts, styles and sizes. At this point either a feature extraction routine or matrix matching routine compares the characters in the scanned image file to the characters in the library.

Microsoft Object Document Imaging (MODI) It was found MODI consisted of object libraries for Binarization Segmentation Pixel Library for a lot of fonts, libraries for font expansion, glyphs, inversions, font styles(italics, bold,..) were available in MODI So it was decided to use MODI – a script engine using ASP.Net A web based tool using MODI was developed.

Evaluation of OCRs It was decided to evaluate other OCR tools with the OCR tool developed in-house. Two OCRs (Adobe Acrobat Pro, Abby Fine Reader) and open source OCR ( Tesseract ) was selected for the study . Different Documents were chosen and used for evaluating the OCR engine . The documents are chosen with o many columns, o multi font, o blurred, o bevel, o invoices, o text embedded within images , o distorted in different formats .

Documents used to check their accuracy 1. 3 column 2. Bevel

4. Conversation 3. Blur

6. Distorted 5. Different Fonts & Colours

9.Less image 8. Isometric rt up 7. Invoice

11. More image 10. More image

12. Shadow

ABBY Fine Reader OCR Adobe Acrobat Tesseract MODI OCR With the above 12 documents accuracy was tested with in house developed MODI OCR and ABBY Fine Reader OCR , Adobe Acrobat and TessNet OCR and the results are given below. ABBY Fine Reader OCR Adobe Acrobat Tesseract MODI OCR S No Documents Name No of Word No .of ocr word Accuracy in Percentage 1 3-cols 433 421 97.22 2 Bevel 134 132 98.50 3 Blur 202 172 85.14 4 Converse 35 26 74.28 5 Diff_font 111 100 6 Distorted 539 470 87.14 7 Invoice 93 82.79 8 Iso-metric-right-up 133 26.31 9 Less_image_text 10 More-image 105 98 93.33 11 More-image-1 118 107 90.67 12 Shadow 123 34 27.64  

Inference On developing the text extraction tool (MODI OCR), it was found, the OCR engine extracts documents with 90 , 180 degrees skew perfectly but has some difficulty with less than 20 degree skew. So survey & study of skew correction algorithms was done.

Skew During the scanning of the document, skew is being inevitably introduced in the document image. A skew in the scanned document reduces the accuracy in the text extraction of the document.

Skew correction algorithms used in our project Projection Profile algorithm Hough Transform algorithm

Projection Profile algorithm To determine the skew of a document, the projection profile is computed at a number of angles and for each angle, a measure of difference of peak and trough height is made. The maximum difference corresponds to the best alignment with the text line direction which, in turn, determines the skew angle. The projection profile methods are, however, well suited for skew angle less than ±10° B.B.Chaudhuri and U.Pal, (1997) “Skew Detection of Digitized Indian Script Documents” IEEE Transactions on pattern analysis and machine intelligence, Vol. 19, pp. 182-186.

Hough Transform In general, the straight line y = mx + b can be represented as a point (b, m) in the parameter space. The (b,m) plane is sometimes referred to as Hough space for the set of straight lines in two dimensions. Given a single point in the plane, then the set of all straight lines going through that point corresponds to a sinusoidal curve in the (b,m) plane, which is unique to that point. A set of two or more points that form a straight line will produce sinusoids which cross at the (b,m) for that line. Thus, the problem of detecting collinear points can be converted to the problem of finding concurrent curves.

Features of the developed Tool Facility to Upload Files from anywhere Retrieves Text and displays Text– all through a standard web browser. Text File saved to Desktop OCR Facility OCR Skew correction (Preprocessing)

Architecture of the Tool Windows Server Web Server : IIS Script engine : Dot Net

Thank You S.Rajeswari Head, Scientific Information Resource Division Indira Gandhi Centre for Atomic Research, Kalpakkam Tamil nadu raj@igcar.gov.in 044-27480281