Zone Identification in the Printed Gujarati Text Jignesh Dholakia Department of Applied Mathematics The M.S. University of Baroda. Baroda. INDIA jignesh_dholakia@yahoo.com Atul Negi Department of Computer and Information Sciences, University of Hyderabad Hyderabad. INDIA atulcs@cs.uohyd.ernet.in S. Rama Mohan Department of Applied Mathematics The M.S. University of Baroda. Baroda. INDIA srm@msubaroda.ac.in
Abstract Gujarati - a language from the Indo-Aryan family of languages Used by 50 million people in the western part of India. Script used to right Gujarati language is also called Gujarati A multilevel script, written in three zones: base character zone upper modifier zone lower modifier zone. Several characters are discriminated by the specific modifiers, which exist in the upper and lower zones. Zone boundary Detection an important task in the Gujarati OCR.
Abstract Gujarati script - related (in some respects ) to the Devanagari script Certain peculiar differences Already known techniques for zone boundary detection for other scripts ( Bengali, Assamese and Devanagari) cannot be used. Gujarati OCR – At a priliminary stage Only one previous documented effort for Gujarati OCR an approach to recognize a small subset of Gujarati alphabet Here is a sophisticated method for accurate zone detection in images of printed Gujarati. shall make the way smoother for the design and development of Gujarati OCR systems for complete character sets.
Introduction OCR for Indian Scripts – Maturing Technology for most of the Scripts Telugu (A South Indian Script) - Atul Negi et al [1] Bangla (An East Indian Script) - Chaudhuri and Pal [2] Devanagari (A script used for many Indic Languages) - Bansal and Sinha [6] Gujarati OCR – Under development Peculiar Characteristics hence needs to be treated differently than other Indo Aryan Scripts. Only one Documented Effort - Samir Antani and Lalitha Agnihotri[4] Limitations in the approach proposed
KÉ [É - Conjuncts considered to be the part of consonant set The Script Gujarati - Script for writing the language Gujarati Belongs to Indo Aryan Family of Languages Used by 50 million people of Gujarat – a state in the western India and many others spread across the globe. Vowels : + +É < > A C Eì +à +ä +Éà +Éä Consonants : Hí LÉ NÉ PÉ Rî SÉ Uï Wð ]ñ _É ` òcó e ô hõ iÉ lÉ oÉ q öyÉ {É ~É £í ¥É §É ©É «É 2 ±É ´É ¶É ºÉ »É ¾ú ³ KÉ [É - Conjuncts considered to be the part of consonant set
GlÉ Á Jí l©É v `Äò mÉ s }É i«É The Script Vowel modifiers / Matras ા િ ી ુ ૂ ૃ ે ૅ ૉ ો ૈ ૌ # - be replaced by some consonant or conjunct. Other modifier like symbols #Å #Æ #Ã #& Numerals 0 1 2 3 4 5 6 7 8 9 Some Conjuncts GlÉ Á Jí l©É v `Äò mÉ s }É i«É
The Script Similarity in the shapes of Gujarati Characters with the phonetically Similar Devanagari Character. ટ ट ઠ ठ No lower and upper case Prominent Distinction from Devanagari – NO SHIROREKHA (Header line) Consonant Vowel Combination and use of matra ક + એ = ક + ે = કે Printed Gujarati Text Three Zones
Need for Zone Identification Two Possible strategies for recognition Recognizing the complete consonant-vowel cluster as a distinct symbol : No. of glyphs to be identified = (36 consonants + 250 conjuncts approx. ) * 11 dependent vowel modifiers * 2 other symbols + 11 independent vowels. First segmenting the consonants from a dependent vowel modifier and then recognizing them separately Second Approach is feasible – From above and experience of Chaudhuri[5] for Bangla and of Bansal, Sinha [6] for Devanagari HENCE THE NEED FOR A ROBUST ALGORITHM TO FIND THE ZONE BOUNDARIES ACCURATELY.
Why New Algorithm ? [5],[6] describe a method of zone identification for Bangla and Devanagari scripts Simpler in case of Bangla and Devanagari because of presence of Shirorekha. Horizontal projection (No of black pixel in a pixel row) shows a prominent peak No Shirorekha in the Gujarati => No prominent peak in the Horizontal projection => the above algorithms will not work well with Gujarati
HENCE THE NEED OF NEW ALGORITHM FOR GUJARATI SCRIPT Why New Algorithm ? A Possible argument : instead of a peak at the header line, a trough in the horizontal projection may be used Instances where it will not work Number of modifiers is significantly large (Trough will not be very prominent.) Misaligned text : leads to cutting off of a significant part of a glyph in the middle zone Affect the recognition accuracy ય વ Two letters look similar due to over cutting of the first one Number of modifiers is less (The trough will be shallow and hence almost undetectable.) HENCE THE NEED OF NEW ALGORITHM FOR GUJARATI SCRIPT
Proposed Algorithm Line Detection Matras: Connected / Disconnected to the base characters Smearing in vertical direction before line detection Horizontal projection of vertically smeared line is used for line detection
Proposed Algorithm Steps in the procedure of the new Algorithm Identify potential connected components (CC) within a text line Compute slopes of the imaginary lines joining the top left corner of the bounding boxes of all possible pairs of CC Consider the lines having minimum slope Consider the row coordinate of the pixels through which maximum number of lines with the minimum slope passes. Gives row of separation between the upper zone and the middle zone. Similar operations on bottom right corners will give row, separating middle and lower zone.
Novel Algorithm for Zone Boundary Detection Input: Image of a line of Gujarati text Output: Row numbers of the two lines that separate upper and lower modifiers from the middle zone. Step1: Extract the connected components in the line with the information about their bounding boxes. Step2: For each pair of distinct connected components, compute the following: Identify the Coordinates (u1, v1) and (u2, v2) of the top left corners of the bounding boxes of the two components.
Algorithm Identify the Coordinates (l1, m1) and (l2, m2) of the bottom right corners of the bounding boxes of the two components. Find the absolute values S1 and S2 of the slopes of the lines connecting (u1, v1) to (u2, v2) and (l1, m1) to (l2, m2)) S1 = |(u2 - u1) / (v2 - v1)| and S2 = |(l2 - l1) / (m2 - m1)|
Algorithm Step3: Identify the lines that give the minimum of slopes S1. Those lines that fall in the region between 15% and 40% of line height below the top of the text line are candidates for being considered as separators of upper zone from the middle zone. If there is more than one line that satisfies this criterion, choose the line that occurs maximum number of times as the zone separator. Step4: Identify the lines that give the minimum of slopes S2. Those lines that fall in the region between 15% and 40% of line height above the bottom of the text line are candidates for being considered as separators of lower zone from the middle zone. If there is more than one line that satisfies this criterion, choose the line that occurs maximum number of times as the zone separator.
Analysis Possibility of errors in the situations where words are not horizontally aligned Hence only line level detection is not sufficient Repeat the same process at word level Possibility of Disagreement between the analysis at two levels A new location detected for any of the two separators : consider it as final Removes a zone detected in the line level execution? Consider line level decision as final. Only word level analysis is not sufficient Over cutting the base character
Analysis Hence same process needs to be carried out first with the connected components of an image of text line and CCs of image of word extracted out of the line. Computational Efficiency CC extraction – A time consuming process Not repeated for word level analysis – appropriate subset is used Another place where efficiency seems to be lost independent calculation of the slopes and identification of zone boundaries on connected components of lines and words – Already justified
Results and Discussions Zone identification of text is necessary for reducing the stored glyph database size reducing the possibilities of misclassification due to similarity in the glyphs. The Algorithm : methods used for other Indian scripts similar to Gujarati are not sufficient for Gujarati script The algorithm described above handles the special properties of the Gujarati script applied on 20 lines extracted from 3 different document images in 19 cases the zone boundary was detected correctly
Results and Discussion Zone separation by applying method used for scripts with Shirorekha (Noice white gap on the top and bottom of the line ) – Notice the Over cutting of the characters in the middle zone Zone separation by applying the approach proposed in the paper
Result and Discussion necessary to do zone boundary detection on connected components at both line and word level
Sample Paragraph
Result of Zone detection