Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval Representation  Retrieval Thanks for David Doermann for most of these.

Scanned Documents INST 734 Module 10 Doug Oard

Agenda Document image retrieval Representation  Retrieval Thanks for David Doermann for most of these slides

Character N-Grams Approach –Split up document into overlapping n-character sequences May or may not break on space between words –“DOCUMENT” -> DOC, OCU, CUM, UME, MEN, ENT Somewhat language-neutral –Use average word length to get a stemming-like effect Statistically robust to small numbers of errors –Good choice between 70%-85% character accuracy Above 85%, use words, dictionary-based correction, and stemming Below 70%, use shape codes –Modeling character confusion can ad a bit more robustness

Shape Codes Group all characters that have similar shapes –{A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, 2, 3, 4, 5, 6, 7, 8, 9, 0} –{a, c, e, n, o, r, s, u, v, x, z} –{b, d, h, k, } –{f, t} –{g, p, q, y} –{i, j, l, 1} –{m, w} Faster than character recognition –Seconds per page, and very accurate Preserves recall, but with lower precision

Retrieval Using Image Features Advantages –Efficiency: process only what you need Particularly important for filtering tasks –Effectiveness: Limit effect of cascaded errors Applications –Handwriting Character segmentation is hard –Keyword spotting Capitalization of names is easily seen

Evaluation The simplest approach: Model-based evaluation –Apply character confusion statistics to an existing collection A bit better: Print-then-scan evaluation –Scanning is slow, but availability is no problem Best: Scan-only evaluation –Requires relevance judgments on document images

Key Points Converting images to text is a good start –But there is more to think about Image processing yields useful features Structure and content are both important –Always true for any collection –Especially clear with document images

Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval Representation  Retrieval Thanks for David Doermann for most of these.

Similar presentations

Presentation on theme: "Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval Representation  Retrieval Thanks for David Doermann for most of these."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval Representation  Retrieval Thanks for David Doermann for most of these.

Similar presentations

Presentation on theme: "Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval Representation  Retrieval Thanks for David Doermann for most of these."— Presentation transcript:

Similar presentations

About project

Feedback