Prénom Nom Document Analysis: Fundamentals of pattern recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008
© Prof. Rolf Ingold 2 Outline Introduction Feature extraction and decision Role of training Feature selection Example : Font recognition Bayesian decision theory Evaluation
© Prof. Rolf Ingold 3 Goals of Pattern Recognition Pattern recognition aims at discovering and identifying patterns in raw data it consists of assigning symbols to data (patterns) it is based on a a priori knowledge, often statistical information Pattern recognition is used for computer perception (image/sound analysis) in a preliminary step, a sensor captures raw information this information is interpreted to take decisions Pattern recognition can be thought as a methodic way of reducing the information in order to keep only the relevant meaning
© Prof. Rolf Ingold 4 Pattern Recognition Applications Pattern recognition is involved in many applications seismological survey speech recognition scientific imagery (biology, health-care, physics,...) satellite based observation (military and civil applications,...) document analysis, with several components: optical character recognition (OCR) font identification handwriting recognition (off-line ) graphics recognition computer vision (3D scene analysis) biometry: person identification and authentication ... Pattern recognition methodologies rely on other scientific domains: statistics, operation research, graph theory, artificial intelligence,...
© Prof. Rolf Ingold 5 Origin of Difficulties Pattern recognition is mainly an information overload problem The difficulty is issued from variability of objects belonging to the same class distortion of captured data (noise, degradations,...)
© Prof. Rolf Ingold 6 Steps Involved in Pattern Recognition Pattern recognition is basically a two stage process: Feature extraction, aiming at removing redundancy while keeping significant information Classification, consisting in making a decision by associating a class label observation feature vector class
© Prof. Rolf Ingold 7 Role of Training Features classes decision training extraction Models Classifiers (tools that perform classification tasks) are generally designed to be trained Each class is characterized by a model Models are built with representative training data
© Prof. Rolf Ingold 8 Supervised vs. Unsupervised Training Two different situations may occur regarding training material: Supervised training is performed when the training samples are labeled with the class they belong to each class is associated with a set of training samples T i ={x i1, x i2,..., x iN i }, supposed to be statistically representative for the class Unsupervised training is performed when the training samples are statistically representative but mixed over all classes T={x 1, x 2,..., x n },
© Prof. Rolf Ingold 9 Feature Selection Features are selected accordingly to the application Features should be chosen carefully by considering discrimination power between classes robustness to intra-class distortions and noise global statistical independency (spread over the entire feature space) "fast computation" reasonable dimension (number of features)
© Prof. Rolf Ingold 10 Features for Character Recognition Given a binary image of a character, a lot of features can be used for character recognition Size, i.e., width and height of the bounding box Position of baseline (if available) Weight (number of black pixels) Perimeter (length of the contours) Center of gravity Moments (second and third order in both directions) Distributions of horizontal and vertical runs Number of intersections with a (eventually random) set of lines Length and structure (singular points, holes) of skeleton ... Local features computed on sub-images …
© Prof. Rolf Ingold 11 Font Recognition: Goal Goal: recognize fonts of synthetically generated isolated words as binary (black & white) or grey level images at 300 dpi 12 standard font classes are considered 3 families: Arial Courier New Times New Roman 4 styles: Plain Italic Bold Bold Italic single size : 12 pt
© Prof. Rolf Ingold 12 Font Recognition: Extracted Features Words are segmented with a surrounding white border of 1 pixel Some preprocessing steps are used Horizontal projection profile (hp) Derivative of horizontal projection profile (hpd) The following features are calculated hp-mean (or density): mean of hp hpd-stdev (or slanting): standard deviation of hpd hr-mean: mean of horizontal runs (up to length 12) hr-stdev: standard deviation of horizontal runs (up to length 12) vr-mean: mean of vertical runs (up to length 12) vr-stdev: standard vertical of horizontal runs (up to length 12)
© Prof. Rolf Ingold 13 Font Recognition: Illustration of Features Basic image processing features used are horizontal projection profile distribution of horizontal runs (from 1 to 11) distribution of vertical runs (from 1 to 11)
© Prof. Rolf Ingold 14 Font Recognition: decision boundaries on single feature (1) Some single features are highly discriminant for some font sets hpd-stdev is discriminating ■ roman and ■ italic fonts hr-mean is discriminating ■ normal and ■ bold fonts
© Prof. Rolf Ingold 15 Font Recognition: decision boundaries on single feature (2) Other features may partly discriminate font sets hr-mean can partly discriminate ■ Arial, ■ Courier and ■ Times
© Prof. Rolf Ingold 16 Font Recognition: decision boundaries on multiple features (1) By combining two features, font discrimination is improved (hpd-stdev, vr-stdev) discriminate ■ roman and ■ italic fonts hpd-stdev vr-stdev
© Prof. Rolf Ingold 17 Font Recognition: decision boundaries on multiple features (2) font family discrimination (■ Arial, ■ Courier and ■ Times) becomes possible by combining several couple of features
© Prof. Rolf Ingold 18 Bayesian Decision Theory Bayesian decision makes the assumption that all information contributing to the decision can be stated in form of probabilities P( i ) : the a priori probability (or prior) of each class p(x| i ) : the class conditional density function of the feature vector x, also called likelihood of the class i with respect to x The goal is to determine the class i, for which the a posteriori probability (or posterior) P( i |x) is the highest
© Prof. Rolf Ingold 19 Bayesian Rule The Bayes rule allows to calculate the a posteriori probability of each class, as a function of priors and likelihoods where p(x) is called evidence and can be considered as a normalization factor, i.e.,
© Prof. Rolf Ingold 20 Influence of Posterior Probabilities P( 1 )=0.5, P( 2 )=0.5P( 1 )=0.1, P( 2 )=0.9 Example with a single feature: posterior probabilities in two different cases regarding a priori probabilities 22 11 22 11
© Prof. Rolf Ingold 21 Probability of Error Given a feature x of a given sample, the probability of error for a decision (x)= i is equal to The probability of error is given by
© Prof. Rolf Ingold 22 Optimal Decision Boundaries The minimal error is obtained by the decision (x)= i with
© Prof. Rolf Ingold 23 Decision Theory In the simplest case a decision consist in assigning to an observation x a class label i = x A natural extension consists in adding a “rejection class” R so that x R In the most general case, the decision results in an action i = x
© Prof. Rolf Ingold 24 Optimal Decision Theory Let us consider a loss function i j defining the loss incurred by taking action i when the true state of nature is j ; usually The risk of taking an action i for a particular sample x is The optimal decision consists in choosing i that minimizes the risk
© Prof. Rolf Ingold 25 Optimal decision When i i = 0 and i j = 1 j ≠ i, the optimal decision consists of minimizing the probability of error The minimal error is obtained by the decision (x)= i with or equivalently In the case when all a priori probabilities are equivalent
© Prof. Rolf Ingold 26 Minimum Risk for Two Classes Let ij i j be the loss of action i when the true state is j The conditional risks of each decision is expressed as Then, the optimal decision rule becomes or equivalently And in the case of 11 22
© Prof. Rolf Ingold 27 Discriminant Functions In the case of multiple classes a pattern classifier can be specified by a set of discriminant functions g i (x) such that the decision i corresponds to Thus, a Bayesian classifier is naturally represented by The choice of discriminant functions is not unique g i (x) can be replaced by f (g i (x)) for any monotonic increasing function f(x) A minimum error-rate classifier can be obtained with
© Prof. Rolf Ingold 28 Bayesian Rule in Higher Dimensions The Bayesian rule can easily be generalized to the multidimensional case, where features are represented by a vector x. where
© Prof. Rolf Ingold 29 Conclusion about Bayesian Decision Bayesian decision theory provides a theoretical framework for statistical pattern recognition This theory supposes the following probabilistic information to be known: the number of classes a priori probabilities of each class class dependent feature distributions for each class The remaining problem is: how to estimate all these things feature distributions are hard to be estimated priors are seldom known even the number of classes is not always given
© Prof. Rolf Ingold 30 Performance Evaluation Performance evaluation is a very important issue of PR it gives an objective measure of the performance it allows to compare different methods Performance evaluation requires correctly labeled test data test data should be different from training data a strategy consists in cyclically using 80% of the data for training, and the remaining 20% for evaluation
© Prof. Rolf Ingold 31 Performance Measures: Recognition / Error Rates Performance evaluation uses several measures recognition rate corresponds to the ratio number of correct answers / number of total answers error rate corresponds to the ratio number of incorrect answers / number of total answers rejection rate corresponds to the ratio number of rejections / number of total answers recognition rate = 1 – (rejection rate + error rate)
© Prof. Rolf Ingold 32 Performance Measures: Recall & Precision On binary decisions (a sample belongs to the class or not) two other measurements are frequently used recall corresponds to the ratio of correctly assigned samples to the size of the class precision corresponds to the ratio of correctly assigned samples to the number of assigned samples Recall and precision are changing in opposite directions equal error rate is sometimes considered to be the best trade- off Additionally, the harmonic mean of precision and recall, called F-measure is frequently used