Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

Patient information extraction in digitized X-ray imagery Hsien-Huang P. Wu Department of Electrical Engineering, National Yunlin University of Science.
Delivering textual resources. Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Review of AI from Chapter 3. Journal May 13  What advantages and disadvantages do you see with using Expert Systems in real world applications like business,
Word Spotting DTW.
A Digital Imaging Primer Nick Dvoracek Instructional Resources Center University of Wisconsin Oshkosh.
Image Processing David Kauchak cs160 Fall 2009 Empirical Evaluation of Dissimilarity Measures for Color and Texture Jan Puzicha, Joachim M. Buhmann, Yossi.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Information Retrieval in Practice
Search Engines and Information Retrieval
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
Multimedia for the Web: Creating Digital Excitement Multimedia Element -- Graphics.
Processing Digital Images. Filtering Analysis –Recognition Transmission.
Information Retrieval in Practice
INFO 624 Week 3 Retrieval System Evaluation
LYU 0102 : XML for Interoperable Digital Video Library Recent years, rapid increase in the usage of multimedia information, Recent years, rapid increase.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Indexing and Retrieving Images of Documents LBSC 796/INFM 718R David Doermann, UMIACS October 29 th, 2007.
Scanned Documents LBSC 796/INFM 718R Douglas W. Oard Week 8, March 30, 2011.
Overview of Search Engines
Braille Converter For Exam Background What is Braille? Braille is a series of raised dots that can be read with the fingers by people who are.
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law Library Rutgers-Newark School of Law.
Wizards, Templates, Styles & Macros Chapter 3. Contents This presentation covers the following: – Purpose, Characteristics, Advantages and Disadvantages.
IT Introduction to Information Technology CHAPTER 05 - INPUT.
ONLINE HANDWRITTEN GURMUKHI SCRIPT RECOGNITION AND ITS CHALLENGES R. K. SHARMA THAPAR UNIVERSITY, PATIALA.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Search Engines and Information Retrieval Chapter 1.
Multimedia Databases (MMDB)
Photoshop Software Rasterized, file formats, and printing choices.

Meta-Knowledge Computer-age study skill or What kids need to know to be effective students Graham Seibert Copyright 2006.
TECHNOLOGY SUPPORT FOR ESSSS Progress, Issues, and Challenges Marshall Breeding Director for Innovative Technology and Research Vanderbilt University Library.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Intro to Scanners. A scanner works by creating a digital image. When you scan a document, you are making a picture of it. This digital image can be used.
1 Tools for Extracting Metadata and Structure from DTIC Documents Digital Library Group Department of Computer Science Old Dominion University December,
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
ITGS Application Software. ITGS Application software (productivity software) –Allows the user to perform tasks to solve problems, such as creating documents,
Information Retrieval
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
Image File Formats. What is an Image File Format? Image file formats are standard way of organizing and storing of image files. Image files are composed.
Scanners. Using a Scanner Scanners are used to digitize any flat object. Several types of scanners- flatbed, sheet fed, handheld, film. Most common is.
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
HOW SCANNERS WORK A scanner is a device that uses a light source to electronically convert an image into binary data (0s and 1s). This binary data can.
Chapter Three Presentation: User interface How to Build a Digital Library Ian H. Witten and David Bainbridge.
1a) Explain the use of voice recognition in embedded systems. Embedded systems are devices that contain a microprocessor controlling them Embedded systems.
The Big Picture Things to think about What different ways are there to collect information automatically? What are the advantages and disadvantages of.
Automation Living in a Paper Oriented World and The Steps to Automation.
1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.
Optical Character Recognition
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval Representation  Retrieval Thanks for David Doermann for most of these.
Information Retrieval in Practice
Information Retrieval in Practice
Visual Information Retrieval
Search Engine Architecture
Introduction Multimedia initial focus
Multimedia Information Retrieval
Information Retrieval and Web Design
Current Challenges in Digitization
Presentation transcript:

Document Image Retrieval David Kauchak cs160 Fall 2009 adapted from: David Doermann

Assign 4 writeups Overall, I was very happy See how big a difference the modifications make! Some general comments –explain data set and characteristics –explain your evaluation measure(s) –think about the points you’re trying to make, then use the data to make that point –comment on anything abnormal or surprising in the data –dig deeper if you need to –if you have multiple evaluation measures, use them to explain/understand different behavior –try and explain why you got the results you obtained

Information retrieval systems Spend 15 minutes playing with three different image retrieval systems – has a number –What works well? –What doesn’t work well? –Anything interesting you noticed? You won’t hand anything in, but we’ll start class on Monday with a discussion of the systems

Image Retrieval

Image Retrieval Problems

Different Systems

Information retrieval: data Text retrieval Audio retrieval Image retrieval amount of data data characteristics trillions of web pages within an order of magnitude in “private” data order of a few billion? last fm has 150M songs somewhere in between user generated user generated some semi-structured some semi-structured link structure link structure mostly professionally generated mostly professionally generated co-occurrence statistics co-occurrence statistics user generated user generated becoming more prevelant becoming more prevelant some tagging some tagging incorporated into web pages (context) incorporated into web pages (context)

Information retrieval: challenges Text retrieval Audio retrieval Image retrieval challenges scale scale ambiguity of language ambiguity of language link structure link structure spam spam query language query language user interface user interface features/pre-processing features/pre-processing query language query language user interface user interface features/pre-processing features/pre-processing ambiguity of pictures ambiguity of pictures other dimensions?

What’s in a document? I give you a file I downloaded You know it has text in it What are the challenges in determining what characters are in the document? –File format:

What is a document?

Document Images A document image is a document that is represented as an image, rather than some predefined format Like normal images, contain pixels –often binary-valued (black, white) –But greyscale or color sometimes 300 dots per inch (dpi) gives the best results –But images are quite large (1 MB per page) –Faxes are normally 72 dpi Usually stored in TIFF or PDF format Want to be able to process them like text files

Sources of document images Web – –Arabic news stories are often GIF images –Google Books, Project Gutenberg (though these are a bit different) Library archives Other –Tobacco Litigation Documents 49 million page images

Document Image Database Collection of scanned images Need to be available for indexing and retrieval, abstracting, routing, editing, dissemination, interpretation NOTE: more needs than just searching! DOCUMENT DATABASE IMAGE

What are the challenges? What are the sub-problems?

Document images So far, we’ve only been interested in documents as strings of text Document images introduce contain additional information –embedded images –formatting –handwritten annotations –figures/diagrams/tables –classes of documents memo newspaper article book page

Challenges They’re an image Quality –scan orientation –noise –contrast Hand-written text Hand-written diagrams

Sub-problems Classification - what type of document image is this? Page segmentation –structure –identify images –identify text –identify handwritten text –diagram identification Meta-data identification –title, author –language OCR Reading ordering Indexing

Problems we’ll discuss today… Preprocessing issues –Page Layer Segmentation –OCR –Reading order IR issues

Problem: Page Layer Segmentation A document consists of many layers, such as handwriting, machine printed text, background patterns, tables, figures, noise, etc.

Step 1 - segmentation

Segmentation

Step 2 – classify the segments Printed text Handwriting Noise We can use features of the “segment” as well as positional information about the other segments

Segmentation Classification Before enhancementAfter enhancement

Problem: OCR One of the more successful applications of computer vision How does this happen?

OCR: One solution Pattern-matching approach –Standard approach in commercial systems –Segment individual characters –Recognize using a neural network classifier

OCR Ideas?

Optical Character Recognition Hidden Markov model approach –Experimental approach developed at BBN –Segment into sub-character slices –Limited lookahead to find best character choice Determining character segmentation is difficult! - Uniform slices - View as a sequential prediction problem

OCR Accuracy Problems Character segmentation errors –In English, segmentation often changes “m” to “rn” Character confusion –Characters with similar shapes often confounded OCR on copies is much worse than on originals –Pixel bloom, character splitting, binding bend Uncommon fonts can cause problems –If not used to train a neural network

Improving OCR Accuracy Image preprocessing –Mathematical morphology for bloom and splitting –Particularly important for degraded images “Voting” between several OCR engines helps –Individual systems depend on specific training data Linguistic analysis can correct some errors –Use confusion statistics, word lists, syntax, … –But more harmful errors might be introduced

OCR Speed Neural networks take about 10 seconds a page –Hidden Markov models are slower Voting can improve accuracy –But at a substantial speed penalty Easy to speed things up with several machines –For example, by batch processing - using desktop computers at night Challenge with OCR is there is a often a trade-off between speed and accuracy

Problem: Reading Order What is the sequence of words from this document? Ideas?

Logical Page Analysis Can be hard to guess in some cases –Newspaper columns, figure captions, appendices, … Sometimes there are explicit guides –“Continued on page 4” (but page 4 may be big!) Structural cues can help –Column 1 might continue to column 2 Content analysis is also useful –Word co-occurrence statistics, syntax analysis

Traditional Approach Optical Character Recognition Page Decomposition Scanner Document Page Image Structure, images, etc Text Regions

Remember our goal Create an IR system over image documents Challenge: OCR is not perfect –Success for high quality OCR (Croft et al 1994, Taghva 1994) –Limited success for poor quality OCR (1996 TREC, UNLV) Ideas?

Proposed Solutions Improve OCR Again, speed is always a concern Similar to spelling correction –Automatic Correction –Characters N-grams Statistically robust to small numbers of errors Rapid indexing and retrieval Works from 70%-85% character accuracy where traditional IR fails

Matching with OCR errors 5 with confidence X% > 80% Keep base system answer 75% - 80% Character n-grams <75% More intensive image techniques (e.g. shape codes)

Conversion to Text? Full Conversion often required Conversion is difficult! –Noisy data –Complex Layouts –Non-text components Points to Ponder n Do we really need to convert? n Can we expect to fully describe documents without assumptions?

Idea: do processing on images Characteristics –Does not require expensive OCR/Conversion –Applicable to filtering applications –May be more robust to noise Possible Disadvantages –Application domain may be very limited –Indexing?

Shape Coding Approach –Use of Generic Character Descriptors –Map Character based on Shape features including ascenders, descenders, punctuation and character with holes

Shape Codes Group all characters that have similar shapes –{a, c, e, n, o, r, s, u, v, x, z} –{b, d, h, k, } –{f, t} –{g, p, q, y} –{i, j, l, 1, I} –{m, w} Shape codes whether a subset of an image belongs to a given character set Sub-process later based on linguistic and/or OCR

Why Use Shape Codes? Can recognize shapes faster than characters –Seconds per page, and very accurate Preserves recall, but with lower precision –Useful as a first pass in any system Easily extracted from JPEG-2 images –Because JPEG-2 uses object-based compression

Evaluation The usual approach: Model-based evaluation –Apply confusion statistics to an existing collection A bit better: Print-scan evaluation –Scanning is slow, but availability is no problem Best: Scan-only evaluation –Few existing IR collections have printed materials

Summary Many applications benefit from image based indexing –Less discriminatory features –Features may therefore be easier to compute –More robust to noise –Often computationally more efficient Many classical IR techniques have application for DIR Structure as well as content are important for indexing Preservation of structure is essential for in-depth understanding

Closing thoughts…. What else is useful? –Document Metadata? – Logos? Signatures? Where is research heading? –Cameras to capture Documents? What massive collections are out there? –Google Books –Other Digital Libraries

Additional Reading A. Balasubramanian, et al. Retrieval from Document Image Collections, Document Analysis Systems VII, pages 1-12, D. Doermann. The Indexing and Retrieval of Document Images: A Survey. Computer Vision and Image Understanding, 70(3), pages , 1998.

Fun Stuff