Prénom Nom Document Analysis: TextRecognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008
© Prof. Rolf Ingold 2 Outline Objectives Typology Processing chain Methodology Character recognition Word recognition An OCR experiment Conclusion
© Prof. Rolf Ingold 3 Objectives Text recognition is the most advanced domain of document analysis It aims at analyzing images to extracting text, i.e. sequences of character codes (ASCII/Unicode)
© Prof. Rolf Ingold 4 Character recognition typology Machine printed vs. handwritten text on-line vs. off-line handwriting Isolated characters, connected characters, cursive text Various Alphabet Western languages (Roman, Greek, Asian (Chinese, Japanese, Korean, Thai,... Arabic alphabets Limited vocabulary only numbers, only uppercase text with/without diacritics/punctuation restricted vocabulary (city names, street names,...) language contextual knowledge
© Prof. Rolf Ingold 5 Factor influencing performance Variability in style single scriber, omni-scriber mono-font, multi-font, omni-font Geometrical variability in size in orientation (rotated) in transformations (slanted, perspective view,...) Image resolution binary images, starting at 200 dpi grey-level images starting at 150 dpi Image quality degraded support (historical documents) acquisition conditions (bad illumination, optical aberration, noise,...)
© Prof. Rolf Ingold 6 European languages European text is characterized by limited set of characters (26 to ~100 classes) diacritics and punctuation isolated characters and cursive scripts left-to right and top-to-bottom writing large variety of fonts different handwriting styles
© Prof. Rolf Ingold 7 Arabic Arabic text is characterized right-to-left writing limited set of characters context dependent glyphs connected characters diacritics justification by word stretching
© Prof. Rolf Ingold 8 Asian scripts Asian text is characterized numerous scripts (hanzi, kanji, hanja, han tu, hiragana, katakana,...) horizontal and vertical writing very large alphabets structured characters grid based layout
© Prof. Rolf Ingold 9 OCR Methodologies OCR systems are very complex and combine several steps image segmentation into characters or isolated shapes preprocessing of hypothetical segmented characters size normalization, morphological filtering, thinning,... feature extraction of isolated shapes global measures (width to height ratio, density, center of gravity, moments,...) local properties (stroke thickness,...) shape identification using a single classifier (neural network, support vector machine,...) multiple classifiers and fusion methods word validation using contextual information
© Prof. Rolf Ingold 10 Character vs. word recognition There is a paradox with cursive text or connected characters: character recognition supposes prior character segmentation character segmentation requires prior character recognition Several approaches to bypass this paradox entire word modeling and recognition multiple hypothesis generation and testing combined character segmentation and recognition (HMM)
© Prof. Rolf Ingold 11 Processing chain line segmentation word segmentation character segmentation isolated character rec.word recognition feature/primitive extr. identification feature/primitive extr. identification post-analysis recognized word normalization
© Prof. Rolf Ingold 12 Isolated character recognition Isolated character recognition is applicable on high quality printed text on constrained handwriting (forms) The challenge is to take into account the variability of the class Performance depends on size of alphabet (number of classes) image quality
© Prof. Rolf Ingold 13 Several classification strategies Direct comparison with class model Statistical pattern recognition using features Structural pattern recognition using primitives Hybrid approaches combining statistical and structural approaches Use of multiple classifiers and fusion of their results
© Prof. Rolf Ingold 14 Comparison with class model or class samples The unknown pattern is compared with one representative of each class (model) a similarity measure is returned decision is determined by most similar sample rejection may occur if similarity is above a threshold
© Prof. Rolf Ingold 15 Similarity measures Hamming distance Warping distance
© Prof. Rolf Ingold 16 Features for statistical approaches Someimes preprocessing at image level is required smoothing size normalization stroke normalization skeletization ... Features are extracted horizontal and vertical projection profile central moments intersections with lines global transforms (Hough, Fourier,...) local features (densities, moments,...) ...
© Prof. Rolf Ingold 17 Central moments Central moments are shift invariant properties defined as with They can be computed using the following formulas
© Prof. Rolf Ingold 18 Primitives for structural approaches Shapes are decomposed in strokes and several properties are extracted number of connected components number of holes number and relative position of singular points extremities connections crossings concavities, convexities ... These primitives are represented as strings trees graphs and used for comparison
© Prof. Rolf Ingold 19 Identification For statistical approaches, different classifier are used discriminant functions kNN classifier multi-layer perceptron support ... For statistical approaches use hierarchical classification string distances graph matching
© Prof. Rolf Ingold 20 Multilayer perceptron Information is propagated throw a layered network of "neurons" decision is given by the highest activation on the output layer weights of connections are computed in a training phase
© Prof. Rolf Ingold 21 Hierarchical classification Structural pattern recognition can be performed hierarchically
© Prof. Rolf Ingold 22 OCR Difficulties The main sources of errors are variability of character shapes (special fonts, handwriting) image defects : noise and distortions broken or touching characters shape similarity ("0" and "O", "1", "I" and "l", "5" and "S",...) small shapes : punctuation, accents, superscripts (" er ", " ème ") special characters ("©", "½", "±",...) or bullets
© Prof. Rolf Ingold 23 Word recognition Text recognition at word level makes sense in case of restricted vocabulary for language driven approaches for knowledge based approaches for keyword spotting Word recognition is typically used cursive scripts handwriting noisy text, difficult to segment low resolution text
© Prof. Rolf Ingold 24 Word Recognition specificities Word recognition is more complex than character recognition usually the number of classes is much higher more features are needed Word recognition can take into account external information language based knowledge dictionary character frequencies, bigrams, trigrams, etc. structural constraints (security number, dates,...) restricted vocabulary (e.g. city names, street names,...) redundancy (e.g. zip codes and city names) Hidden Markov Models (HH) are suited for word recognition
© Prof. Rolf Ingold 25 Hidden Markov models (1) Each class is modeled by a two stage stochastic process using hidden and visible states A model =(A,B, π) is composed of A, the matrix of transition probabilities B, the matrix of observation probabilities π, the vector of the initial state probabilities
© Prof. Rolf Ingold 26 Hidden Markov models (2) The probability of an observation can be computed using A pattern is assigned to the model with highest posterior probability (i.e, the model that best explains the pattern) The parameters of the model (probabilities) are determined in a training phase using training samples
© Prof. Rolf Ingold 27 Post-analysis Post-analysis is performed with the aim of validating / correcting character recognition based on dictionaries bigrams, trigrams confidence of character recognition (if available)
© Prof. Rolf Ingold 28 Character Recognition Performance OCR (Optical Character Recognition) is the most mature technique of document analysis For most applications, very high accuracy is required 99% recognition rate would generate errors per page 99,9% – 99,99% is often requested OCR systems may be designed for standard OCR-A, OCR-B fonts, specially designed for OCR mono-font recognition, specialized for typewriting omni-font or multi-font text recognition (the most popular) trainable systems, being tuned for specific fonts and styles
© Prof. Rolf Ingold 29 OCR experiment One magazine page [Hebdo 18, 2007, Editorial] good quality printed text medium layout complexity
© Prof. Rolf Ingold 30 First Experiment (Standard MS-Office OCR) Layout not understood Many OCR errors => Unusable !
© Prof. Rolf Ingold 31 Second Experiment (ReadIris) Page layout correctly recognized Italic correctly detected including one false detection A few word segmentation errors Correct de-hyphenation A few OCR errors often as consequence of segmentation errors !
© Prof. Rolf Ingold 32 Conclusion on OCR technologies Imperfect results in printed character recognition Recognition of uncontrolled handwriting not mature Practical problems with mathematics (symbols and formulas) special fonts or scripts logos => perfect document recognition is not achievable ! Some applications can deal with approximate results Recognition algorithms should be tuned to prefer rejections to errors Include manual correction in the processing step