Prénom Nom Document Analysis: TextRecognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Prénom Nom Document Analysis: TextRecognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold 4 Character recognition typology  Machine printed vs. handwritten text  on-line vs. off-line handwriting  Isolated characters, connected characters, cursive text  Various Alphabet  Western languages (Roman, Greek,  Asian (Chinese, Japanese, Korean, Thai,...  Arabic alphabets  Limited vocabulary  only numbers, only uppercase text  with/without diacritics/punctuation  restricted vocabulary (city names, street names,...)  language  contextual knowledge

© Prof. Rolf Ingold 5 Factor influencing performance  Variability in style  single scriber, omni-scriber  mono-font, multi-font, omni-font  Geometrical variability  in size  in orientation (rotated)  in transformations (slanted, perspective view,...)  Image resolution  binary images, starting at 200 dpi  grey-level images starting at 150 dpi  Image quality  degraded support (historical documents)  acquisition conditions (bad illumination, optical aberration, noise,...)

© Prof. Rolf Ingold 6 European languages  European text is characterized by  limited set of characters (26 to ~100 classes)  diacritics and punctuation  isolated characters and cursive scripts  left-to right and top-to-bottom writing  large variety of fonts  different handwriting styles

© Prof. Rolf Ingold 7 Arabic  Arabic text is characterized  right-to-left writing  limited set of characters  context dependent glyphs  connected characters  diacritics  justification by word stretching

© Prof. Rolf Ingold 8 Asian scripts  Asian text is characterized  numerous scripts (hanzi, kanji, hanja, han tu, hiragana, katakana,...)  horizontal and vertical writing  very large alphabets  structured characters  grid based layout

© Prof. Rolf Ingold 9 OCR Methodologies  OCR systems are very complex and combine several steps  image segmentation into characters or isolated shapes  preprocessing of hypothetical segmented characters  size normalization, morphological filtering, thinning,...  feature extraction of isolated shapes  global measures (width to height ratio, density, center of gravity, moments,...)  local properties (stroke thickness,...)  shape identification using  a single classifier (neural network, support vector machine,...)  multiple classifiers and fusion methods  word validation using contextual information

© Prof. Rolf Ingold 10 Character vs. word recognition  There is a paradox with cursive text or connected characters:  character recognition supposes prior character segmentation  character segmentation requires prior character recognition  Several approaches to bypass this paradox  entire word modeling and recognition  multiple hypothesis generation and testing  combined character segmentation and recognition (HMM)

© Prof. Rolf Ingold 11 Processing chain line segmentation word segmentation character segmentation isolated character rec.word recognition feature/primitive extr. identification feature/primitive extr. identification post-analysis recognized word normalization

© Prof. Rolf Ingold 12 Isolated character recognition  Isolated character recognition is applicable  on high quality printed text  on constrained handwriting (forms)  The challenge is to take into account the variability of the class  Performance depends on  size of alphabet (number of classes)  image quality

© Prof. Rolf Ingold 13 Several classification strategies  Direct comparison with class model  Statistical pattern recognition  using features  Structural pattern recognition  using primitives  Hybrid approaches combining statistical and structural approaches  Use of multiple classifiers and fusion of their results

© Prof. Rolf Ingold 14 Comparison with class model or class samples  The unknown pattern is compared with one representative of each class (model)  a similarity measure is returned  decision is determined by most similar sample  rejection may occur if similarity is above a threshold

© Prof. Rolf Ingold 16 Features for statistical approaches  Someimes preprocessing at image level is required  smoothing  size normalization  stroke normalization  skeletization ...  Features are extracted  horizontal and vertical projection profile  central moments  intersections with lines  global transforms (Hough, Fourier,...)  local features (densities, moments,...) ...

© Prof. Rolf Ingold 18 Primitives for structural approaches  Shapes are decomposed in strokes and several properties are extracted  number of connected components  number of holes  number and relative position of singular points  extremities  connections  crossings  concavities, convexities ...  These primitives are represented as  strings  trees  graphs and used for comparison

© Prof. Rolf Ingold 19 Identification  For statistical approaches, different classifier are used  discriminant functions  kNN classifier  multi-layer perceptron  support ...  For statistical approaches use  hierarchical classification  string distances  graph matching

© Prof. Rolf Ingold 20 Multilayer perceptron  Information is propagated throw a layered network of "neurons"  decision is given by the highest activation on the output layer  weights of connections are computed in a training phase

© Prof. Rolf Ingold 22 OCR Difficulties  The main sources of errors are  variability of character shapes (special fonts, handwriting)  image defects : noise and distortions  broken or touching characters  shape similarity ("0" and "O", "1", "I" and "l", "5" and "S",...)  small shapes : punctuation, accents, superscripts (" er ", " ème ")  special characters ("©", "½", "±",...) or bullets

© Prof. Rolf Ingold 23 Word recognition  Text recognition at word level makes sense  in case of restricted vocabulary  for language driven approaches  for knowledge based approaches  for keyword spotting  Word recognition is typically used  cursive scripts  handwriting  noisy text, difficult to segment  low resolution text

© Prof. Rolf Ingold 24 Word Recognition specificities  Word recognition is more complex than character recognition  usually the number of classes is much higher  more features are needed  Word recognition can take into account external information  language based knowledge  dictionary  character frequencies, bigrams, trigrams, etc.  structural constraints (security number, dates,...)  restricted vocabulary (e.g. city names, street names,...)  redundancy (e.g. zip codes and city names)  Hidden Markov Models (HH) are suited for word recognition

© Prof. Rolf Ingold 25 Hidden Markov models (1)  Each class is modeled by a two stage stochastic process using hidden and visible states  A model =(A,B, π) is composed of  A, the matrix of transition probabilities  B, the matrix of observation probabilities  π, the vector of the initial state probabilities

© Prof. Rolf Ingold 26 Hidden Markov models (2)  The probability of an observation can be computed using  A pattern is assigned to the model with highest posterior probability (i.e, the model that best explains the pattern)  The parameters of the model (probabilities) are determined in a training phase using training samples

© Prof. Rolf Ingold 27 Post-analysis  Post-analysis is performed with the aim of validating / correcting character recognition  based on dictionaries  bigrams, trigrams  confidence of character recognition (if available)

© Prof. Rolf Ingold 28 Character Recognition Performance  OCR (Optical Character Recognition) is the most mature technique of document analysis  For most applications, very high accuracy is required  99% recognition rate would generate 30-50 errors per page  99,9% – 99,99% is often requested  OCR systems may be designed for  standard OCR-A, OCR-B fonts, specially designed for OCR  mono-font recognition, specialized for typewriting  omni-font or multi-font text recognition (the most popular)  trainable systems, being tuned for specific fonts and styles

© Prof. Rolf Ingold 31 Second Experiment (ReadIris)  Page layout correctly recognized  Italic correctly detected  including one false detection  A few word segmentation errors  Correct de-hyphenation  A few OCR errors  often as consequence of segmentation errors !

© Prof. Rolf Ingold 32 Conclusion on OCR technologies  Imperfect results in printed character recognition  Recognition of uncontrolled handwriting not mature  Practical problems with  mathematics (symbols and formulas)  special fonts or scripts  logos  => perfect document recognition is not achievable !  Some applications can deal with approximate results  Recognition algorithms should be tuned to prefer rejections to errors  Include manual correction in the processing step

Prénom Nom Document Analysis: TextRecognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Similar presentations

Presentation on theme: "Prénom Nom Document Analysis: TextRecognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Prénom Nom Document Analysis: TextRecognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Similar presentations

Presentation on theme: "Prénom Nom Document Analysis: TextRecognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008."— Presentation transcript:

Similar presentations

About project

Feedback