Optical Character Recognition Qurat-ul-Ain (Ainie) Akram Sarmad Hussain Center for language Engineering Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore, Pakistan Lecture 8
Syllable String Creation using lookup table ISSALE Syllable String Main body ID Diacritics1_I D …. تا 5002 و 501 پتھر
Project Presentation 1.Front Page – Optical Character Recognition(in English) – Optical Character Recognition(in Your Language) – Document Image – Output of OCR (Recognized Syllable Strings of OCR) – Syllable String Recognition Accuracy(Syllables /Total Syllables*100) – Group Members Name ISSALE 20143
1.Preprocessing – Line Segmentation Samples of line segmentation Line segmentation accuracy results Samples of incorrect line segmentation – Syllable/Ligature Segmentation Samples of Syllable/Ligature segmentation Syllable/Ligature Segmentation Accuracy Results Samples of incorrect Syllable/Ligature segmentation ISSALE Total LinesCorrectl LinesIncorrect Lines % Accuracy Total SyllablesCorrectly Syllables Incorrect Syllables % Accuracy
Pre-processing – Main body and diacritics disambiguation ISSALE Total main bodiesCorrectly classified as main bodies % Accuracy Total diacriticsCorrectly classified as diacritics % Accuracy
Classification and Recognition – Data Description 15 Main body Types (DataSet-1) – Training Data (35 Tokens) – Testing Data (15 Tokens) – Image samples Document Images(DataSet-2) – Testing Data » X Tokens of Y main body Types » X Tokens of Y diacritics Types » Image sample ISSALE Main body TypeTotal tokens in document images Total unique syllables in document images
Classification and recognition results – Recognition Results on DataSet-1 using Decision Trees Main body recognition accuracy – Diacritics recognition accuracy – Recognition Results on DataSet-1 using Tesseract Main body recognition accuracy – Diacritics recognition accuracy ISSALE Class TypeTotal Samples Test data (15 Tokens) Correctly Recognized % Accuracy Class TypeTotal Samples Test data (15 Tokens) Correctly Recognized % Accuracy
Classification and recognition results – Recognition Results on DataSet-2 using Decision Trees Main body recognition accuracy – Diacritics recognition accuracy OR – Recognition Results on DataSet-2 using Tesseract Main body recognition accuracy – Diacritics recognition accuracy ISSALE Class TypeTotal SamplesCorrectly Recognized % Accuracy Class TypeTotal SamplesCorrectly Recognized % Accuracy
Post-processing – Syllable String Creation – Syllable String Recognition Accuracy ISSALE Syllable String Main body ID Diacritics1_I D …. تا 5002 و 501 Syllable TypeTotal SamplesCorrectly Recognized % Accuracy
Output of OCR Input Document Image ISSALE OCR Output
Deliverables to submit 1.Presentation slides 2.OCR Complete Code 1.Line segmentation 2.Syllable segmentation 3.Recognition of diacritics and main bodies 4.Syllable string creation using lookup Table 5.Output.txt file generation 3.Data Set-1 4.Data Set-2 5.Tesseract Traineddata file ISSALE
Good Luck
Document Image Creation ISSALE Syllable_of_MB1_Samples_1 Syllable_of_MB2_Samples_1 Syllable_of_MB2_Samples_1 Syllable_of_MB3_Samples_1 Syllable_of_MB4_Samples_1 Syllable_of_MB5_Samples_1,,, Syllable_of_MB15_Samples_1 Syllable_of_MB1_Samples_2 Syllable_of_MB2_Samples_2 Syllable_of_MB2_Samples_2 Syllable_of_MB3_Samples_2 Syllable_of_MB4_Samples_2 Syllable_of_MB5_Samples_2,,, Syllable_of_MB15_Samples_2 Syllable_of_MB1_Samples_3 Syllable_of_MB2_Samples_3 Syllable_of_MB2_Samples_3 Syllable_of_MB3_Samples_3 Syllable_of_MB4_Samples_3 Syllable_of_MB5_Samples_3,,, Syllable_of_MB15_Samples_3 Syllable_of_MB1_Samples_4 Syllable_of_MB2_Samples_4 Syllable_of_MB2_Samples_4 Syllable_of_MB3_Samples_4 Syllable_of_MB4_Samples_4 Syllable_of_MB5_Samples_4,,, Syllable_of_MB15_Samples_4, Syllable_of_MB1_Samples_15 Syllable_of_MB2_Samples_15 Syllable_of_MB2_Samples_15 Syllable_of_MB3_Samples_15 Syllable_of_MB4_Samples_15 Syllable_of_MB5_Samples_15,,, Syllable_of_MB15_Samples_15 Syllable = MB + Diacritics or Syllable = MB
Examples of Document Image ISSALE