Language-Independent Text Line Extraction from Historical Document Images Presented by: Abed Asi First International workshop on Historical Document Imaging and Processing 2011, Beijing , China
Motivation Historical handwritten manuscripts are valuable cultural heritage Providing insights into both tangible and intangible cultural aspects from the past Efforts to understand, manipulate and archive historical manuscripts Digitization increases accessibility and allows automatic processing *Courtesy: - wadod.com - Genizah Project
Outline Background Challenges Seam Carving Text line representation by seams Energy Map Seam Generation Experimental Results Summary
Image representation N x M (Matrix)
Binarization # pixels intensity
Connectivity & Components We can define 4- or 8-paths depending on the type of connectivity specified A set of pixels S is a Connected Component if for each pixel pair (x1,y1) є S and (x2,y2) є S there is a path between them such that every two successive pixels in the path are in S and are X-neighbors. (X = 4, 8). 4-Neighborhood 8-Neighborhood
Connected Component One word, but 3connected components
Distances Given 2 points P = (u,v) , Q = (x,y) Euclidean Distance City Block Distance Chessboard Distance In example: P = (1,8); Q = (4,1)
Distance transform Given a set of pixels S, calculate the distance of other pixels to S The pixels in the set S will be considered as reference pixels Let . We scan the image by a pre-defined connectivity : First pass: Consider Green pixels (N1)
Distance transform In reverse scan, consider Blue pixels (N2) First scan Distance transform
Distance transform – (cont’d) 3 2 1 4 1 Alef Letter - Arabic Printed Handwritten Binary Representation Distance transform Chessboard metric = Reference pixels
Sign Distance transform 3 2 1 -1 4 Alef Letter Printed Handwritten Sign Distance transform chessboard metric
Sign Distance transform – (cont’d) The brighter the color the larger the distance from reference pixels Original Document Image Sign Distance transform (SDT)
Gradient is the derivative of the image in the horizontal direction A gray-scale image I is defined as a two-dimensional function I(x,y)=gray The gradient of the image (I ) is given by the formula : Where: is the derivative of the image in the horizontal direction is the derivative of the image in the vertical direction The magnitude of the gradient is defined by:
Gradient
*Courtesy: Islamic manuscript, Leipzig University Library, Germany Background Pre-Processing De-noising Binarization Page Layout Analysis Text-line and word Segmentation Indexation and Recognition Segmentation Original *Courtesy: Islamic manuscript, Leipzig University Library, Germany
*Courtesy: Juma Al-majid Center for Culture and Heritage, Dubai. Text-line Extraction Assigning the same color to each text line ب ت ث يــ جـ خـ حـ Original Manuscript Processed Manuscript *Courtesy: Juma Al-majid Center for Culture and Heritage, Dubai.
Outline Background Challenges Seam Carving Text line representation by seams Energy Map Seam Generation Experimental Results Summary
Challenges Looser layout format Line Proximity Multi-Oriented lines Historical handwritten documents pose different challenges than those in machine-printed. Looser layout format Line Proximity Multi-Oriented lines Touching components Different slope (within the same line) Delayed strokes Overlapping components A 19th century master thesis – SAAB medical Library, American University of Beirut
Outline Background Challenges Seam Carving Text line representation by seams Energy Map Seam Generation Experimental Results Summary
Seam Carving Content-aware image resizing An energy function defines energy value for each pixel A seam is an optimal 8-connected path of low energy pixels Original Image Calculated seams Gradient Image Resized
Seam Carving – (cont’d) let I be an n x m size image. Define a vertical seam to be: where x is a mapping x : [1, . . . ,n] [1, . . . ,m]. Seam contains one, and only one, pixel in each row of the image, otherwise a distorted image might be obtained. The pixels of the path of a seam will therefore be : one can change the value of K in the constraint, and get either a simple column for k = 0 , or even completely disconnected set of pixels.
Seam Carving – (cont’d) Given an energy function e, the cost of a seam is: We look for the optimal seam s* that minimizes this cost : The optimal seam can be found using Dynamic programming
Outline Background Challenges Seam Carving Text line representation by seams Energy Map Seam Generation Experimental Results Summary
Text line representation by seams Human perception of text lines Tracks text lines by ink concentration and in-between line spaces Two types of seams have been defined *Courtesy: Wadod Center for masnuscripts.
Text line representation by seams -(cont’) The medial seam crosses the text area of a text line. A Separating seam is a path that passes between two consecutive text lines. Original Document Image Seam Seed Medial Seam Separating Seam Processed *Courtesy: Wadod Center for masnuscripts.
Outline Background Challenges Seam Carving Text line representation by seams Energy Map Seam Generation Experimental Results Summary
Energy Map We use the Sign distance transform (SDT) as an energy map In SDT, pixels values are assigned according to their distance from the nearest reference pixel Recall, distance values are negative inside connected components and positive in- between Intuition: Local minima and maxima points determine the medial and separating seams, respectively Original Document Image Sign Distance Transform (SDT) *Courtesy: Wadod Center for masnuscripts
Outline Background Challenges Seam Carving Text line representation by seams Energy Map Seam Generation Experimental Results Summary
Seam Generation – (cont’d) The SDT is traversed horizontally to compute a cumulative energy map - Seam Map - for all possible connected seams for each entry (i,j): SDT is traversed with two passes to enhance text line patterns Sign distance transform Bi-linearly interpolate the resulting two maps Left-to-right pass Interpolated map Right-to-left pass
Seam Generation – (cont’d) The minimal entry of the last column is detected. Backtrack from the minimal entry to find the medial seam. Original Document Image Seam Map – One pass Seam Map – Two passes
Seam Generation – (cont’d) Iteratively, all text lines will be extracted
Seam Generation – (cont’d) Then, why separating seams are needed? Avoid recalculation of energy and seam maps after each line extraction Avoid additional strokes classification (post processing)
Seam Generation – (cont’d) Separating seams define the boundaries of text lines Generated with respect to the medial seam of the corresponding text line Grown from seam seeds toward the two sides of the image guided by the SDT
Seam Generation – (cont’d) Seam fragment is a connected group of pixels defined as the closest local maxima along the vertical direction Seam fragments with low priority are discarded Seeds candidate set is constructed The seed that generates the optimal (maximal cost) seam was chosen Medial Seam Seam Map Sign Distance Transform
Seam Generation – (cont’d) The separating seams may diverge from the medial seam due to the fork of ridges A spring force anchored at the medial seam guides the separating seams Before After
Touching/Overlapping Components Usually, crossing overlapping components is avoided gracefully Touching components are split too, but not necessarily in the optimal position Processed Processed
Outline Background Challenges Seam Carving Text line representation by seams Energy Map Seam Generation Experimental Results Summary
Overlapping Components Experimental Results Language Overlapping Components Lines Description Dataset Arabic and Spanish 516 1050 Wadod Center for Manuscripts Wadod Arabic 258 900 Al-Majid Center for Culture and Heritage, Dubai Al-Majid English 485 420 American University of Beirut AUB 317 150 Congress Library 1576 2520
Experimental Results- (cont’d) Correctness (%) Dataset Line Lower Upper Medial 98 97 99 Wadod 96 Al-Majid 95 94 AUB 94.25 93 Congress library Stroke Crossing (%) Overlapping Components Dataset 9 516 Wadod 2 258 Al-Majid 485 AUB 10 317 Congress library Table 1: correctness of text line extraction Table 2: crossed components
Experimental Results- (cont’d)
Outline Background Challenges Seam Carving Text line representation by seams Energy Map Seam Generation Experimental Results Summary
Summary Summary Language independent approach Dynamic programming was used to find text lines Saves energy map re-computing after text line extraction Post processing steps are avoided Crossing overlapping components was avoided in most cases Still need more research to split touching components optimally
Thank you