OCR at INIS Branko Krznarić. Outline  What is OCR?  OCR Objectives  Principles  Techniques  Software INIS Training Seminar 12-16 October 2015, Vienna,

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

Services Digitisation & Content Management. 600 People – India.
Input & Output Devices ASHIMA KALRA.
1 Probabilistic Artificial Neural Network For Recognizing the Arabic Hand Written Characters Khalaf khatatneh, Ibrahiem El Emary,and Basem Al- Rifai Journal.
PDF (Portable Document Format) for Digital Preservation and Delivery John Laurie Digital Initiatives Librarian The University of Auckland Library National.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
OCR and making your publications accessible A practical guide.
Input to the Computer * Input * Keyboard * Pointing Devices
IAEA International Atomic Energy Agency INIS Collection Search: Introduction and main features INIS Training Seminar 7-11 October 2013, Vienna Domenico.
The ACCESS Project Jesse Hausler, UDL/Accessibility Coordinator Marla Roll, Director of the Assistive Technology Resource Center.
The ACCESS Project Jesse Hausler, UDL/Accessibility Coordinator Craig Spooner, Project Coordinator.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Toward Automatic Processing and Indexing of Microfilm.
2.01 Understand Digital Raster Graphics
بسم الله الرحمن الرحيم معالج الحروف الضوئي OCR. Introduction Definition : OCR stands for O ptical C haracter R ecognition refers to the branch of computer.
Dictionary By Thomas Slack. Automatic number plate recognition Automatic Number Plate Recognition (ANPR) is a surveillance method that uses optical character.
UNIVERSITY OF MACEDONIA ECONOMIC AND SOCIAL SCIENCES Support and Inclusion of students with disabilities at higher education institutions in Montenegroz.
George Irwin Syracuse University.  Definitions  Creating PDF  Retrofitting PDF documents  Assistive technology and PDF  Resources.
Advanced Workgroup System. RED Advanced Workgroup Systems: Scan Features Copy Print Scan DNSG Software Our Customers Documents Our Customers Documents.
Software for Digital Library By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
International Atomic Energy Agency NCL documents Case Studies, Requirements Germain St-Pierre Thomas Kalapurackal Branko Krznaric INIS Unit INIS Training.
ادارة الوثائق الالكترونية Naji Shukri Alzaza University of Palestine February 2010.
Digital Text Primer Prepared for: AIEA Roundtable on Digitization of Armenian Documents Saturday 7 October 2006, University of Geneva, Switzerland Roland.
October 29, Marla Roll Director Shannon Lavey Service Coordinator and Provider Allison Kidd Assistive Technology IT Coordinator Accessibility Specialist.
IAEA International Atomic Energy Agency Dobrica Savić & Germain St-Pierre Nuclear Information Section, IAEA Vienna Austria.
IAEA International Atomic Energy Agency International Nuclear Information System (INIS) OCR at INIS INIS Training Seminar 7-11 October 2013, Vienna, Austria.
Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law Library Rutgers-Newark School of Law.
IAEA International Atomic Energy Agency Digital Preservation at INIS United Nations Library and Information Network for Knowledge Sharing (UN-LINKS) 24.
IAEA International Atomic Energy Agency Agenda item 3.3 INIS IT developments 13th INIS/ETDE Joint Technical Committee Meeting October 2011, Vienna,
International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,
Get more out of your OKI MFP
CHAPTER FIVE TEXT.
The most powerful high-speed scanning, indexing and OCR solution on the market Supports many high speed scanners: Fujitsu, Canon, Kodak, Epson, Avision,
Options for digital delivery Record Society Conference, April 19 th 2007 Bruce Tate Project Manager British History Online.
Questys Text & Image Management System Records Management for the Information Age.
Outcome: Participants will increase skills in the use and integration of technology in classroom instruction. Indicator: Participants will explore Evernote.
Unit Seven Database 1.Passage One. Foundation of Database.
DATA COLLECTION METHODS CONTENT PAGE How data is collected via questionnaires. How data is collected via questionnaires. How data is collected with mark.
An-Najah National University Faculty Of Engineering Computer Engineering Department Abed Al-hadi kulib.
Image Workflow Processes Elspeth Haston, Robert Cubey, Martin Pullan & David J Harris.
IAEA International Atomic Energy Agency Special Characters Implementation Zbigniew Majewski 12th Joint INIS/ETDE Technical Committee Meeting October.
Million Book Bibliotheca Alexandrina Youssef Eldakar 19 November 2006.
File Formats Different applications (programs) store data in different formats. Applications support some file formats and not others. Open…, Save…, Save.
Introduction to metadata
IAEA International Atomic Energy Agency OCR at INIS Database Production & Imaging Group Yves Reynaud iaea.org.
IAEA International Atomic Energy Agency International Nuclear Information System (INIS) 2.3 Digital Preservation Activities 36 th Consultative Meeting.
IAEA International Atomic Energy Agency OCR at INIS Database Production & Imaging Group Yves Reynaud iaea.org INIS Training Seminar.
Reporter: 資訊所 P Yung-Chih Cheng ( 鄭詠之 ).  Introduction  Data Collection  System Architecture  Feature Extraction  Recognition Methods  Results.
Unit A Getting Started with Adobe Photoshop. What is Adobe Photoshop? Adobe Photoshop delivers powerful, industry-standard image-editing tools for designers.
WHAT SHOULD YOU HAVE IN YOUR ALTERNATE FORMAT TOOLBOX?
1 Machine Vision. 2 VISION the most powerful sense.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
HOW SCANNERS WORK A scanner is a device that uses a light source to electronically convert an image into binary data (0s and 1s). This binary data can.
The Big Picture Things to think about What different ways are there to collect information automatically? What are the advantages and disadvantages of.
Automation Living in a Paper Oriented World and The Steps to Automation.
USING PDF AS SOURCE Liz 1 #stc16 Mike
WP3: Image Segmentation - OCR Stavros Perantonis, Vassilis Maragos Edinburgh, March 6-7, 2003 Institute of Informatics & Telecommunications NCSR “Demokritos”
DIGITIZATION IN THEORY AND PRACTICE WEBSITE: Helen Nneka Okpala Presentation done at University of.
Post-ALA Annual July 11, 2008 Pre-Conference Workshop: The Care and Feeding of Compound Objects Geri Ingram OCLC Digital Collection Services Manager, User.
1 Multimedia Literacy Taxonomy of Multimedia Objects.
Input & Output Devices ASHIMA KALRA.
Creating Accessible PDF’s for the Web
S.Rajeswari Head , Scientific Information Resource Division
Building A Web-based University Archive
Electronic Document Management Software
McGraw-Hill Technology Education
Digitizing Arabic Text: Where are we today?
My Program Session Title
Digitizing Arabic Text: Where are we today?
Quick and Dirty: the art of OCR
Presentation transcript:

OCR at INIS Branko Krznarić

Outline  What is OCR?  OCR Objectives  Principles  Techniques  Software INIS Training Seminar October 2015, Vienna, Austria 2

What is OCR? INIS Training Seminar October 2015, Vienna, Austria 3 (source: pcmag.com)

Optical Character Recognition (OCR)  OCR is the “conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text.” [1]  Make digitized images of printed documents searchable.  Font encoding issues. INIS Training Seminar October 2015, Vienna, Austria 4

OCR Objectives  Data entry from printed records.  OCR adds an extra value to your image.  OCR brings to life your digitized collection. We can “find the needle in the haystack” INIS Training Seminar October 2015, Vienna, Austria 5

OCR Objectives (contd.) Method of digitizing printed texts  Electronically edited  Searched  Stored more compactly  Displayed on-line  Machine processes INIS Training Seminar October 2015, Vienna, Austria 6

OCR Techniques  Pre-processing  De-skew  Despeckle  Binarization  Line removal  Layout analysis (zoning)  Post-processing (dictionary) INIS Training Seminar October 2015, Vienna, Austria 7

Scanned vs. Vector Image INIS Training Seminar October 2015, Vienna, Austria 8

“Do not look at the trees (letters) try to see the forest (sentences)“ F0R 488UR1N6 7H3 L0N63V17Y 0F 1NF0RM4710N, P3RH4P8 7H3 M087 1MP0R74N7 R0L3 1N 7H3 0P3R4710N 0F 4 D16174L 4RCH1V3 18 M4N461N6 7H3 1D3N717Y, 1N736R17Y 4ND QU4L17Y 0F 7H3 4RCH1V LF RU873D 80URC3 0F 7H3 CUL7UR4L R3C0RD. INIS Training Seminar October 2015, Vienna, Austria 9

Verdana Font FOR ASSURING THE LONGEVITY OF INFORMATION, PERHAPS THE MOST IMPORTANT ROLE IN THE OPERATION OF A DIGITAL ARCHIVE IS MANAGING THE IDENTITY, INTEGRITY AND QUALITY OF THE ARCHIVES ITSELF AS A TRUSTED SOURCE OF THE CULTURAL RECORD. INIS Training Seminar October 2015, Vienna, Austria 10

Brush Script MT (Windows Font) FOR ASSURING THE LONGEVITY OF INFORMATION, PERHAPS THE MOST IMPORTANT ROLE IN THE OPERATION OF A DIGITAL ARCHIVE IS MANAGING THE IDENTITY, INTEGRITY AND QUALITY OF THE ARCHIVES ITSELF AS A TRUSTED SOURCE OF THE CULTURAL RECORD. INIS Training Seminar October 2015, Vienna, Austria 11

PCs ≠ Humans  OCR compares patterns and selects the closest match. It can be forced to a specific context, but requires customization.  People adapt to circumstances and can circumvent misspellings if context is clear. INIS Training Seminar October 2015, Vienna, Austria 12

True or false Usually, printed text is adequately sampled if each line is at least two pixels wide: INIS Training Seminar October 2015, Vienna, Austria 13

Zoom in INIS Training Seminar October 2015, Vienna, Austria 14

Zoom in INIS Training Seminar October 2015, Vienna, Austria 15

Results from OCR It is in this context that I… … and an additional protocol on the basis… INIS Training Seminar October 2015, Vienna, Austria 16

Chinese Raster Image (scanned) INIS Training Seminar October 2015, Vienna, Austria 17

Chinese Vector Image (OCR) 滤器 INIS Training Seminar October 2015, Vienna, Austria 18

Arabic Raster Image (scanned) INIS Training Seminar October 2015, Vienna, Austria 19

Arabic Vector Image (OCR) هذ ا وشملت INIS Training Seminar October 2015, Vienna, Austria 20

Japanese Raster Image (scanned) INIS Training Seminar October 2015, Vienna, Austria 21

Japanese Vector Image (OCR) INIS Training Seminar October 2015, Vienna, Austria 22

Font Encoding INIS Training Seminar October 2015, Vienna, Austria 23

Font Encoding (cont.) INIS Training Seminar October 2015, Vienna, Austria 24

OCR Software  High degree of recognition accuracy  Reproducing formatted output  OCR Software at INIS:  Abbyy FineReader (multilingual OCR)  Adobe Acrobat  InftyReader INIS Training Seminar October 2015, Vienna, Austria 25

Abbyy FineReader (interface) INIS Training Seminar October 2015, Vienna, Austria 26

InftyReader - an OCR System for Math Documents INIS Training Seminar October 2015, Vienna, Austria 27

Reference [1] “Optical character recognition” recognition. Retrieved recognition INIS Training Seminar October 2015, Vienna, Austria 28

Thank you! INIS Training Seminar October 2015, Vienna, Austria 29