Zone Identification in the Printed Gujarati Text

Slides:

Advertisements

Similar presentations

Patient information extraction in digitized X-ray imagery Hsien-Huang P. Wu Department of Electrical Engineering, National Yunlin University of Science.

Advertisements

Segmentation of Touching Characters in Devnagari & Bangla Scripts Using Fuzzy MultiFactorial Analysis Presented By: Sanjeev Maharjan St. Xavier’s College.

1 Integration Testing CS 4311 I. Burnstein. Practical Software Testing, Springer-Verlag, 2003.

Word Spotting DTW.

Document Processing Methods for Telugu and other SE Asian Scripts

IntroductionIntroduction AbstractAbstract AUTOMATIC LICENSE PLATE LOCATION AND RECOGNITION ALGORITHM FOR COLOR IMAGES Kerem Ozkan, Mustafa C. Demir, Buket.

Chapter 2: Pattern Recognition

Prénom Nom Document Analysis: Segmentation & Layout Analysis Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

IIIT HyderabadUMASS AMHERST Robust Recognition of Documents by Fusing Results of Word Clusters Venkat Rasagna 1, Anand Kumar 1, C. V. Jawahar 1, R. Manmatha.

October 8, 2013Computer Vision Lecture 11: The Hough Transform 1 Fitting Curve Models to Edges Most contours can be well described by combining several.

Classification with Hyperplanes Defines a boundary between various points of data which represent examples plotted in multidimensional space according.

FEATURE EXTRACTION FOR JAVA CHARACTER RECOGNITION Rudy Adipranata, Liliana, Meiliana Indrawijaya, Gregorius Satia Budhi Informatics Department, Petra Christian.

25th June 2002IEMCT CDAC Pune1 Non-linear Normalization to Improve Telugu OCR Atul Negi, Chakravarthy Bhagvati, V.V. Suresh Kumar Department of Computer.

October 14, 2014Computer Vision Lecture 11: Image Segmentation I 1Contours How should we represent contours? A good contour representation should meet.

Presented by Tienwei Tsai July, 2005

BACKGROUND LEARNING AND LETTER DETECTION USING TEXTURE WITH PRINCIPAL COMPONENT ANALYSIS (PCA) CIS 601 PROJECT SUMIT BASU FALL 2004.

IIIT Hyderabad Thesis Presentation By Raman Jain ( ) Towards Efficient Methods for Word Image Retrieval.

S EGMENTATION FOR H ANDWRITTEN D OCUMENTS Omar Alaql Fab. 20, 2014.

CS 6825: Binary Image Processing – binary blob metrics

© 2008 The McGraw-Hill Companies, Inc. All rights reserved. WORD 2007 M I C R O S O F T ® THE PROFESSIONAL APPROACH S E R I E S Lesson 14 Tables.

September 23, 2014Computer Vision Lecture 5: Binary Image Processing 1 Binary Images Binary images are grayscale images with only two possible levels of.

Digital Image Processing CCS331 Relationships of Pixel 1.

Optimization of Line Segmentation Techniques for Thai Handwritten Document Olarik Surinta Mahasarakham University Thailand.

UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.

Proposed Vedic Sanskrit Coding Scheme: Some suggestions Akshar Bharati Amba Kulkarni Department of Sanskrit Studies University of Hyderabad Hyderabad

October 1, 2013Computer Vision Lecture 9: From Edges to Contours 1 Canny Edge Detector However, usually there will still be noise in the array E[i, j],

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

Nottingham Image Analysis School, 23 – 25 June NITS Image Segmentation Guoping Qiu School of Computer Science, University of Nottingham

Wonjun Kim and Changick Kim, Member, IEEE

Irfan Ullah Department of Information and Communication Engineering Myongji university, Yongin, South Korea Copyright © solarlits.com.

Essential components of the implementation are:  Formation of the network and weight initialization routine  Pixel analysis of images for symbol detection.

Course 3 Binary Image Binary Images have only two gray levels: “1” and “0”, i.e., black / white. —— save memory —— fast processing —— many features of.

Digital Image Processing CCS331 Relationships of Pixel 1.

 Handwritten character recognition is a frontier area for research for the past few decades  OCR-process of translation of images of handwritten shorthand.

Date of download: 6/2/2016 Copyright © 2016 SPIE. All rights reserved. The Ottoman alphabet without diacritics and dots. Letters in the rectangles are.

Optical Character Recognition

CS552: Computer Graphics Lecture 16: Polygon Filling.

Exploring Group Differences

A Plane-Based Approach to Mondrian Stereo Matching

Computer Graphics CC416 Week 13 Clipping.

Computer Vision Lecture 13: Image Segmentation III

Introduction to System Analysis and Design

Fill Area Algorithms Jan

Lecture 2 Introduction to Programming

UZAKTAN ALGIILAMA UYGULAMALARI Segmentasyon Algoritmaları

Mean Shift Segmentation

Computer Vision Lecture 12: Image Segmentation II

Agenda: 10/05/2011 and 10/10/2011 Review Access tables, queries, and forms. Review sample forms. Define 5-8 guidelines each about effective form and report.

UNIT-4 BLACKBOX AND WHITEBOX TESTING

Computer Programming.

Computer Vision Lecture 5: Binary Image Processing

Fitting Curve Models to Edges

Unit# 9: Computer Program Development

Computer Vision Lecture 9: Edge Detection II

Low Dimensionality in Gene Expression Data Enables the Accurate Extraction of Transcriptional Programs from Shallow Sequencing Graham Heimberg, Rajat.

Topic 5: Exploring Quantitative data

Tutorial 3 – Creating a Multiple-Page Report

Lesson Comparing Two Means.

Chapter 2 Basic Models for the Location Problem

Lesson 15 Working with Tables

Atul Negi and Ravi Raj Singh

Quantitative Data Who? Cans of cola. What? Weight (g) of contents.

Multi-Information Based GCPs Selection Method

Computer and Robot Vision I

Formatting and Editing Skills

An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,

Albert K. Lee, Matthew A. Wilson Neuron

Introduction to Artificial Intelligence Lecture 22: Computer Vision II

UNIT-4 BLACKBOX AND WHITEBOX TESTING

Presentation transcript:

Zone Identification in the Printed Gujarati Text Jignesh Dholakia Department of Applied Mathematics The M.S. University of Baroda. Baroda. INDIA jignesh_dholakia@yahoo.com Atul Negi Department of Computer and Information Sciences, University of Hyderabad Hyderabad. INDIA atulcs@cs.uohyd.ernet.in S. Rama Mohan Department of Applied Mathematics The M.S. University of Baroda. Baroda. INDIA srm@msubaroda.ac.in

Abstract Gujarati - a language from the Indo-Aryan family of languages Used by 50 million people in the western part of India. Script used to right Gujarati language is also called Gujarati A multilevel script, written in three zones: base character zone upper modifier zone lower modifier zone. Several characters are discriminated by the specific modifiers, which exist in the upper and lower zones. Zone boundary Detection an important task in the Gujarati OCR.

Abstract Gujarati script - related (in some respects ) to the Devanagari script Certain peculiar differences Already known techniques for zone boundary detection for other scripts ( Bengali, Assamese and Devanagari) cannot be used. Gujarati OCR – At a priliminary stage Only one previous documented effort for Gujarati OCR an approach to recognize a small subset of Gujarati alphabet Here is a sophisticated method for accurate zone detection in images of printed Gujarati. shall make the way smoother for the design and development of Gujarati OCR systems for complete character sets.

Introduction OCR for Indian Scripts – Maturing Technology for most of the Scripts Telugu (A South Indian Script) - Atul Negi et al [1] Bangla (An East Indian Script) - Chaudhuri and Pal [2] Devanagari (A script used for many Indic Languages) - Bansal and Sinha [6] Gujarati OCR – Under development Peculiar Characteristics hence needs to be treated differently than other Indo Aryan Scripts. Only one Documented Effort - Samir Antani and Lalitha Agnihotri[4] Limitations in the approach proposed

KÉ [É - Conjuncts considered to be the part of consonant set The Script Gujarati - Script for writing the language Gujarati Belongs to Indo Aryan Family of Languages Used by 50 million people of Gujarat – a state in the western India and many others spread across the globe. Vowels : + +É < > A C Eì +à +ä +Éà +Éä Consonants : Hí LÉ NÉ PÉ Rî SÉ Uï Wð ]ñ _É ` òcó e ô hõ iÉ lÉ oÉ q öyÉ {É ~É £í ¥É §É ©É «É 2 ±É ´É ¶É ºÉ »É ¾ú ³ KÉ [É - Conjuncts considered to be the part of consonant set

GlÉ Á Jí l©É v `Äò mÉ s }É i«É The Script Vowel modifiers / Matras ા િ ી ુ ૂ ૃ ે ૅ ૉ ો ૈ ૌ # - be replaced by some consonant or conjunct. Other modifier like symbols #Å #Æ #Ã #& Numerals 0 1 2 3 4 5 6 7 8 9 Some Conjuncts GlÉ Á Jí l©É v `Äò mÉ s }É i«É

The Script Similarity in the shapes of Gujarati Characters with the phonetically Similar Devanagari Character. ટ ट ઠ ठ No lower and upper case Prominent Distinction from Devanagari – NO SHIROREKHA (Header line) Consonant Vowel Combination and use of matra ક + એ = ક + ે = કે Printed Gujarati Text Three Zones

Need for Zone Identification Two Possible strategies for recognition Recognizing the complete consonant-vowel cluster as a distinct symbol : No. of glyphs to be identified = (36 consonants + 250 conjuncts approx. ) * 11 dependent vowel modifiers * 2 other symbols + 11 independent vowels. First segmenting the consonants from a dependent vowel modifier and then recognizing them separately Second Approach is feasible – From above and experience of Chaudhuri[5] for Bangla and of Bansal, Sinha [6] for Devanagari HENCE THE NEED FOR A ROBUST ALGORITHM TO FIND THE ZONE BOUNDARIES ACCURATELY.

Why New Algorithm ? [5],[6] describe a method of zone identification for Bangla and Devanagari scripts Simpler in case of Bangla and Devanagari because of presence of Shirorekha. Horizontal projection (No of black pixel in a pixel row) shows a prominent peak No Shirorekha in the Gujarati => No prominent peak in the Horizontal projection => the above algorithms will not work well with Gujarati

HENCE THE NEED OF NEW ALGORITHM FOR GUJARATI SCRIPT Why New Algorithm ? A Possible argument : instead of a peak at the header line, a trough in the horizontal projection may be used Instances where it will not work Number of modifiers is significantly large (Trough will not be very prominent.) Misaligned text : leads to cutting off of a significant part of a glyph in the middle zone Affect the recognition accuracy ય વ Two letters look similar due to over cutting of the first one Number of modifiers is less (The trough will be shallow and hence almost undetectable.) HENCE THE NEED OF NEW ALGORITHM FOR GUJARATI SCRIPT

Proposed Algorithm Line Detection Matras: Connected / Disconnected to the base characters Smearing in vertical direction before line detection Horizontal projection of vertically smeared line is used for line detection

Proposed Algorithm Steps in the procedure of the new Algorithm Identify potential connected components (CC) within a text line Compute slopes of the imaginary lines joining the top left corner of the bounding boxes of all possible pairs of CC Consider the lines having minimum slope Consider the row coordinate of the pixels through which maximum number of lines with the minimum slope passes. Gives row of separation between the upper zone and the middle zone. Similar operations on bottom right corners will give row, separating middle and lower zone.

Novel Algorithm for Zone Boundary Detection Input: Image of a line of Gujarati text Output: Row numbers of the two lines that separate upper and lower modifiers from the middle zone. Step1: Extract the connected components in the line with the information about their bounding boxes. Step2: For each pair of distinct connected components, compute the following: Identify the Coordinates (u1, v1) and (u2, v2) of the top left corners of the bounding boxes of the two components.

Algorithm Identify the Coordinates (l1, m1) and (l2, m2) of the bottom right corners of the bounding boxes of the two components. Find the absolute values S1 and S2 of the slopes of the lines connecting (u1, v1) to (u2, v2) and (l1, m1) to (l2, m2)) S1 = |(u2 - u1) / (v2 - v1)| and S2 = |(l2 - l1) / (m2 - m1)|

Algorithm Step3: Identify the lines that give the minimum of slopes S1. Those lines that fall in the region between 15% and 40% of line height below the top of the text line are candidates for being considered as separators of upper zone from the middle zone. If there is more than one line that satisfies this criterion, choose the line that occurs maximum number of times as the zone separator. Step4: Identify the lines that give the minimum of slopes S2. Those lines that fall in the region between 15% and 40% of line height above the bottom of the text line are candidates for being considered as separators of lower zone from the middle zone. If there is more than one line that satisfies this criterion, choose the line that occurs maximum number of times as the zone separator.

Analysis Possibility of errors in the situations where words are not horizontally aligned Hence only line level detection is not sufficient Repeat the same process at word level Possibility of Disagreement between the analysis at two levels A new location detected for any of the two separators : consider it as final Removes a zone detected in the line level execution? Consider line level decision as final. Only word level analysis is not sufficient Over cutting the base character

Analysis Hence same process needs to be carried out first with the connected components of an image of text line and CCs of image of word extracted out of the line. Computational Efficiency CC extraction – A time consuming process Not repeated for word level analysis – appropriate subset is used Another place where efficiency seems to be lost independent calculation of the slopes and identification of zone boundaries on connected components of lines and words – Already justified

Results and Discussions Zone identification of text is necessary for reducing the stored glyph database size reducing the possibilities of misclassification due to similarity in the glyphs. The Algorithm : methods used for other Indian scripts similar to Gujarati are not sufficient for Gujarati script The algorithm described above handles the special properties of the Gujarati script applied on 20 lines extracted from 3 different document images in 19 cases the zone boundary was detected correctly

Results and Discussion Zone separation by applying method used for scripts with Shirorekha (Noice white gap on the top and bottom of the line ) – Notice the Over cutting of the characters in the middle zone Zone separation by applying the approach proposed in the paper

Result and Discussion necessary to do zone boundary detection on connected components at both line and word level

Sample Paragraph

Result of Zone detection