Download presentation
Presentation is loading. Please wait.
Published byMariah Barton Modified over 8 years ago
1
Scanned Documents INST 734 Module 10 Doug Oard
2
Agenda Document image retrieval Representation Retrieval Thanks for David Doermann for most of these slides
3
Character N-Grams Approach –Split up document into overlapping n-character sequences May or may not break on space between words –“DOCUMENT” -> DOC, OCU, CUM, UME, MEN, ENT Somewhat language-neutral –Use average word length to get a stemming-like effect Statistically robust to small numbers of errors –Good choice between 70%-85% character accuracy Above 85%, use words, dictionary-based correction, and stemming Below 70%, use shape codes –Modeling character confusion can ad a bit more robustness
4
Shape Codes Group all characters that have similar shapes –{A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, 2, 3, 4, 5, 6, 7, 8, 9, 0} –{a, c, e, n, o, r, s, u, v, x, z} –{b, d, h, k, } –{f, t} –{g, p, q, y} –{i, j, l, 1} –{m, w} Faster than character recognition –Seconds per page, and very accurate Preserves recall, but with lower precision
5
Retrieval Using Image Features Advantages –Efficiency: process only what you need Particularly important for filtering tasks –Effectiveness: Limit effect of cascaded errors Applications –Handwriting Character segmentation is hard –Keyword spotting Capitalization of names is easily seen
6
Evaluation The simplest approach: Model-based evaluation –Apply character confusion statistics to an existing collection A bit better: Print-then-scan evaluation –Scanning is slow, but availability is no problem Best: Scan-only evaluation –Requires relevance judgments on document images
7
Key Points Converting images to text is a good start –But there is more to think about Image processing yields useful features Structure and content are both important –Always true for any collection –Especially clear with document images
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.