O PTICAL C HARACTER R ECOGNITION USING H IDDEN M ARKOV M ODELS Jan Rupnik.

Slides:



Advertisements
Similar presentations
1 Gesture recognition Using HMMs and size functions.
Advertisements

Lecture 16 Hidden Markov Models. HMM Until now we only considered IID data. Some data are of sequential nature, i.e. have correlations have time. Example:
Hidden Markov Models.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Introduction to Hidden Markov Models
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Hidden Markov Models Theory By Johan Walters (SR 2003)
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models in NLP
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Spring 2003Data Mining by H. Liu, ASU1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association Rules.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Hidden Markov Models Usman Roshan BNFO 601. Hidden Markov Models Alphabet of symbols: Set of states that emit symbols from the alphabet: Set of probabilities.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Doug Downey, adapted from Bryan Pardo,Northwestern University
Hidden Markov Models 戴玉書
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Ch10 HMM Model 10.1 Discrete-Time Markov Process 10.2 Hidden Markov Models 10.3 The three Basic Problems for HMMS and the solutions 10.4 Types of HMMS.
Isolated-Word Speech Recognition Using Hidden Markov Models
Graphical models for part of speech tagging
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Some Probability Theory and Computational models A short overview.
Hidden Markov Models for Information Extraction CSE 454.
H IDDEN M ARKOV M ODELS. O VERVIEW Markov models Hidden Markov models(HMM) Issues Regarding HMM Algorithmic approach to Issues of HMM.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Hidden Markov Models in Keystroke Dynamics Md Liakat Ali, John V. Monaco, and Charles C. Tappert Seidenberg School of CSIS, Pace University, White Plains,
22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
CSC321: Neural Networks Lecture 16: Hidden Markov Models
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
Hidden Markov Models (HMMs) Chapter 3 (Duda et al.) – Section 3.10 (Warning: this section has lots of typos) CS479/679 Pattern Recognition Spring 2013.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,..., sN Si Sj.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Other Models for Time Series. The Hidden Markov Model (HMM)
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Hidden Markov Models Part 2: Algorithms
N-Gram Model Formulas Word sequences Chain rule of probability
CONTEXT DEPENDENT CLASSIFICATION
Hassanin M. Al-Barhamtoshy
Presentation transcript:

O PTICAL C HARACTER R ECOGNITION USING H IDDEN M ARKOV M ODELS Jan Rupnik

O UTLINE HMMs Model parameters Left-Right models Problems OCR - Idea Symbolic example Training Prediction Experiments

HMM Discrete Markov model : probabilistic finite state machine Random process: random memoryless walk on a graph of nodes called states. Parameters: Set of states S = {S 1,..., S n } that form the nodes Let q t denote the state that the system is in at time t Transition probabilities between states that form the edges, a ij = P(q t = S j | q t-1 = S i ), 1 ≤ i,j ≤ n Initial state probabilities, π i = P(q 1 = S i ), 1 ≤ i ≤ n

M ARKOV PROCESS π1π1 π2π2 π3π3 π4π4 π5π5 π6π6 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 a 41 a 14 a 21 a 12 a 61 a 26 a 65 a 35 a 34 a 23

M ARKOV PROCESS π1π1 π2π2 π3π3 π4π4 π5π5 π6π6 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 a 41 a 14 a 21 a 12 a 61 a 26 a 65 a 35 a 34 a 23 Pick q 1 according to the distribution π q 1 = S 2

M ARKOV PROCESS π1π1 π2π2 π3π3 π4π4 π5π5 π6π6 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 a 41 a 14 a 21 a 12 a 61 a 26 a 65 a 35 a 34 a 23 Move to a new state according to the distribution a ij q 2 = S 3

M ARKOV PROCESS π1π1 π2π2 π3π3 π4π4 π5π5 π6π6 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 a 41 a 14 a 21 a 12 a 61 a 26 a 65 a 35 a 34 a 23 Move to a new state according to the distribution a ij q 3 = S 5

H IDDEN M ARKOV P ROCESS The Hidden Markov random process is a partially observable random process. Hidden part: Markov process q t Observable part: sequence of random variables with the same domain v t, where conditioned on the variable q i the distribution of the variable v i is independent of every other variable, for all i = 1,2,... Parameters: Markov process parameters: π i, a ij for generation of q t Observation symbols V = {V 1,..., V m } Let v t denote the observation symbol emitted at time t. Observation emission probabilities b i (k) = P(v t = V k | q t = S i ) Each state defines its own distribution over observation symbols

H IDDEN M ARKOV PROCESS π1π1 π2π2 π3π3 π4π4 π5π5 π6π6 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 a 41 a 14 a 21 a 12 a 61 a 26 a 65 a 35 a 34 a 23 V1V1 V2V2 V3V3 b 1 (1)b 1 (2)b 1 (3) V1V1 V2V2 V3V3 b 2 (1)b 2 (2)b 3 (3) V1V1 V2V2 V3V3 b 3 (1)b 3 (2)b 3 (3) V1V1 V2V2 V3V3 b 4 (1)b 4 (2)b 4 (3) V1V1 V2V2 V3V3 b 5 (1)b 5 (2)b 5 (3) V1V1 V2V2 V3V3 b 6 (1)b 6 (2)b 6 (3)

H IDDEN M ARKOV PROCESS π1π1 π2π2 π3π3 π4π4 π5π5 π6π6 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 a 41 a 14 a 21 a 12 a 61 a 26 a 65 a 35 a 34 a 23 V1V1 V2V2 V3V3 b 1 (1)b 1 (2)b 1 (3) V1V1 V2V2 V3V3 b 2 (1)b 2 (2)b 3 (3) V1V1 V2V2 V3V3 b 3 (1)b 3 (2)b 3 (3) V1V1 V2V2 V3V3 b 4 (1)b 4 (2)b 4 (3) V1V1 V2V2 V3V3 b 5 (1)b 5 (2)b 5 (3) V1V1 V2V2 V3V3 b 6 (1)b 6 (2)b 6 (3) Pick q 1 according to the distribution π q 1 = S 3

V1V1 H IDDEN M ARKOV PROCESS π1π1 π2π2 π3π3 π4π4 π5π5 π6π6 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 a 41 a 14 a 21 a 12 a 61 a 26 a 65 a 35 a 34 a 23 V1V1 V2V2 V3V3 b 1 (1)b 1 (2)b 1 (3) V1V1 V2V2 V3V3 b 2 (1)b 2 (2)b 3 (3) V2V2 V3V3 b 3 (1)b 3 (2)b 3 (3) V1V1 V2V2 V3V3 b 4 (1)b 4 (2)b 4 (3) V1V1 V2V2 V3V3 b 5 (1)b 5 (2)b 5 (3) V1V1 V2V2 V3V3 b 6 (1)b 6 (2)b 6 (3) Pick v 1 according to the distribution b 1 v 1 = V 1

H IDDEN M ARKOV PROCESS π1π1 π2π2 π3π3 π4π4 π5π5 π6π6 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 a 41 a 14 a 21 a 12 a 61 a 26 a 65 a 35 a 34 a 23 V1V1 V2V2 V3V3 b 1 (1)b 1 (2)b 1 (3) V1V1 V2V2 V3V3 b 2 (1)b 2 (2)b 3 (3) V1V1 V2V2 V3V3 b 3 (1)b 3 (2)b 3 (3) V1V1 V2V2 V3V3 b 4 (1)b 4 (2)b 4 (3) V1V1 V2V2 V3V3 b 5 (1)b 5 (2)b 5 (3) V1V1 V2V2 V3V3 b 6 (1)b 6 (2)b 6 (3) Pick q 2 according to the distribution a ij q 1 = S 4

V3V3 H IDDEN M ARKOV PROCESS π1π1 π2π2 π3π3 π4π4 π5π5 π6π6 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 a 41 a 14 a 21 a 12 a 61 a 26 a 65 a 35 a 34 a 23 V1V1 V2V2 V3V3 b 1 (1)b 1 (2)b 1 (3) V1V1 V2V2 V3V3 b 2 (1)b 2 (2)b 3 (3) V1V1 V2V2 V3V3 b 3 (1)b 3 (2)b 3 (3) V1V1 V2V2 b 4 (1)b 4 (2)b 4 (3) V1V1 V2V2 V3V3 b 5 (1)b 5 (2)b 5 (3) V1V1 V2V2 V3V3 b 6 (1)b 6 (2)b 6 (3) Pick v 2 according to the distribution b 4 v 2 = V 4

L EFT R IGHT HMM Sparse and easy to train Parameters π 1 = 1, π i = 0, i > 1 a ij = 0 if i > j Left-right HMM example S1S1 S2S2 SnSn...

P ROBLEMS Given a sequence of observations v 1,..., v T find the sequence of hidden states q 1,..., q T that most likely generated it. Solution: Viterbi algorithm (dynamic programming, complexity: O(Tn 2 )) How to determine the model parameters π i, a ij, b i (k)? Solution: Expectation Maximization (EM) algorithm that finds the local maximum of the parameters, given a set of initial parameters π i 0, a ij 0, b i (k) 0.

O PTICAL C HARACTER R ECOGNITION “THE”“revenue”“of”... Input for training of the OCR system: pairs of word images and their textual strings Input for the recognition process: a word image “?”

I DEA The modelling of the generation of character images is accomplished by generating sequences of thin vertical images (segments) Build a HMM for each character separately (model the generation of images of character ‘a’, ‘b’,... ) Merge the character HMMs into the word HMM (model the generation of sequences of characters and their images) Given a new image of a word, use the word HMM to predict the most likely sequence of characters that generated the image.

S YMBOLIC EXAMPLE Example of word images generation for a two character alphabet {‘n’, ‘u’}. Set S = {S u 1,..., S u 5, S n 1,..., S n 5 } Set V = {V 1, V 2, V 3, V 4 } Assign to each V i a thin vertical image: The word model (for words like ‘unnunuuu’) is constructed by joining two left-right character models.

Word model state transition architecture Character ‘n’ model Character ‘u’ model Sn1Sn1 Sn2Sn2 Sn3Sn3 Sn4Sn4 Sn5Sn5 Su1Su1 Su2Su2 Su3Su3 Su4Su4 Su5Su5 A nn := A S n 5 S n 1 ASn1Sn1ASn1Sn1 ASn1Sn2ASn1Sn2 A uu := A S u 5 S u 1 A un := A S u 5 S n 1 A nu := A S n 5 S u 1 ASn2Sn3ASn2Sn3 ASn2Sn2ASn2Sn2 W ORD MODEL

ASn1Sn2Sn3Sn4Sn5Su1Su2Su3Su4Su5 Sn10.5 Sn2 0.5 Sn3 0.5 Sn4 0.5 Sn50.33 Su1 0.5 Su2 0.5 Su3 0.5 Su4 0.5 Su50.33 BV1V2V3V4 Sn11 Sn Sn3 1 Sn Sn51 Su11 Su Su3 1 Su Su51 V 1 V 4 V 2 V 4 V 1 A sequence of observation symbols that correspond to an “image” of a word

E XAMPLE OF IMAGES OF GENERATED WORDS Example: word 'nunnnuuun' Example: word ‘uuuunununu’

R ECOGNITION Find best matching patterns V1V4V4V2V4V1V1V1V4V2V2V2V4V4V1V1V1V4V2V4V4V4V4V4V4V1V1V1V4V4V2V4V1V1V1V4V2V2V2V4V4V1V1V1V4V2V4V4V4V4V4V4V1V1 1-1 correspondence Viterbi S n 1 S n 2 S n 2 S n 3 S n 4 S n 5 S n 1 S n 1 S n 2 S n 3 S n 3 S n 3 S n 4 S n 4 S n S n 1 S n 1 S n 2 S n 3 S n 4 S n 4 S n 4 S n 4 S n 4 S n 4 S n 5 S n 1 S n 1 Predict: ‘nnn’ Look at transitions of type S * 5 S ** 1 to find transitions from character to character

F EATURE EXTRACTION, CLUSTERING, DISCRETIZATION Discretization: if we have a set of basic patterns (thin images of the observation symbols), we can transform any sequence of thin images into a sequence of symbols (previous slide) – the input for our HMM. We do not deal with images of thin slices directly but rather with some feature vectors computed from them (and then compare vectors instead of matching images). The basic patterns (feature vectors) can be found with k- mean clustering (from a large set of feature vectors).

F EATURE EXTRACTION Transformation of the image into a sequence of 20-dimensional feature vectors. Thin overlapping rectangles split into 20 vertical cells. The feature vector for each rectangle is computed by computing the average luminosity of each cell.

C LUSTERING Given a large set of feature vectors (100k) extracted from the training set of images and compute the k- means clustering with 512 clusters. Eight of the typical feature vectors:

T RAINING GATHERING OF TRAINING EXAMPLES FEATURE EXTRACTION CLUSTERING DISCRETIZATION... CHARACTER MODELS ‘a’ HMM ‘b’ HMM ‘z’ HMM... GET CHARACTER TRANSITION PROBABILITIES WORD MODEL Input: images of words Output: instances of character images Input: large corpus of text Output: character transition probabilities Input: instances of character images Output: a sequence of 20-dimensional vectors per character intance Input: a sequence of feature vectors for every character instance, C Output: a sequence of discrete symbols for every character instance Input: a sequence of observation symbols for each character instance Output: separately trained character HMMs for each character Input: all feature vectors computed in the previous step, number of clusters Output: a set of centroid vectors C Input: character transition probabilities, character HMMs Output: word HMM

FEATURE EXTRACTION DISCRETIZATION VITERBI DECODING WORD MODEL computed in the training phase. Input: image of a word Output: sequence of 20-dimensional vectors Input: sequence of feature vectors for the image of a word, C computed in training phase Output: a sequence of symbols from a finite alphabet for the image of a word Input: word HMM and a sequence of observation symbols Output: sequence of states that most likely emitted the observations. An image of a word that is to be recognized Set of 20-dimensional centroid vectors C, computed in the training phase. PREDICTION ‘c’ ‘o’ ‘n’ ‘t’ ‘i’ ‘n’ ‘u’ ‘i’ ‘t’ ‘y’ Input: sequence of states Output: sequence of characters P REDICTION

E XPERIMENTS Book on French Revolution, John Emerich Edward Dalberg-Acton (1910) (source: archve.org) Test word error rate on the words containing lower-case letters only. Approximately 22k words on the first 100 pages

E XPERIMENTS – DESIGN CHOICES Extract 100 images of each character Use 512 clusters for discretization Build 14-state models for all characters, except for ‘i’, ’j’ and ‘l’ where we used 7-state models, and ‘m’ and ‘w’ where we used 28-state models. Use character transition probabilities from [1] Word error rate: 2% [1] Michael N. Jones; D.J.K. Mewhort, Case-sensitive letter and bigram frequency counts from large-scale English corpora, Behavior Research Methods, Instruments, & Computers, Volume 36, Number 3, August 2004, pp (9)

T YPICAL ERRORS Predicted: proposled. We see that the there is a short strip between 'o' and 's' where the system predicted an 'l'. Predicted: inifluenoes. The system interpreted the character 'c' and the beginning of character 'e' as an 'o', and it predicted an extra ‘i'. Predicted: dennocratic. The character 'm' was predicted as a sequence of two 'n' characters. Places where the system predicted transitions between character models

Q UESTIONS ?