Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications, Faculty of Engineering Cairo University Research Team: Farhan M. A. Nashwan Prof. Dr. Mohsen A. A. Rashwan Presented By: Farhan M. A. Nashwan
Contribution: Reduce vocabulary Increase speed 2 The Thirteenth Conference on Language Engineering 11-12, December 2013
Generated Image word Preprocessing and Word segmentor Word Grouping Clustering Groups and Clusters for Holistic Recognition Proposed Approach: 3 The Thirteenth Conference on Language Engineering 11-12, December 2013
Grouping: Extraction subwords (PAW) Extraction dots and diacritics Used it to select the group 4 The Thirteenth Conference on Language Engineering 11-12, December 2013
Grouping: 5 Secondaries separation using contour analysis Secondaries Recognition using SVM Grouping Process Groups Preprocessing and Word segmentor Generated Image Word
Grouping Example: 6 Grouping code (1,21,2) Grouping Code (3,0, 2) Grouping Code (4,11, 12) Grouping Code (3,2, 21) Grouping Code (2,0, 2) PAW=1 Upper Sec.=2 PAW=3 Down Sec.=0 Upper Sec.=2 PAW=4 Down Sec.=1&1 Upper Sec.=1 & 2 PAW=3 Down Sec.=2 Upper Sec.=2 &1 PAW=2 Down Sec.=0 Upper Sec.=2 Down Sec.= 2 & 1 The Thirteenth Conference on Language Engineering 11-12, December 2013
7 Challenges Sticking Sensitive to noise Treatments PAWs Down secondaries Upper secondaries Grouping based on: Overlapping SVM The Thirteenth Conference on Language Engineering 11-12, December 2013
Clustering: Complementary of grouping LBG algorithm used Done on groups contain large words Euclidean distance used 8 The Thirteenth Conference on Language Engineering 11-12, December 2013 Groups Feature Extraction Clustering using LBG Clustering using LBG Clusters & Groups
Features : 1- (ICC): Image centroid and CellsImage centroid and Cells 2- (DCT):Discrete Cosine TransformDiscrete Cosine Transform 3- (BDCT):Block Discrete Cosine TransformBlock Discrete Cosine Transform 4-(DCT-4B): Discrete Cosine Transform 4- BlocksDiscrete Cosine Transform 4- Blocks 5- (BDCT+ICC):Hybrid BDCT with ICC. 6- (ICC+DCT): Hybrid DCT with ICC 7- (ICZ):Image Centroid and ZoneImage Centroid and Zone 8- (DCT+ICZ): Hybrid DCT and ICZ. 9- (DTW ):Dynamic Time WarpingDynamic Time Warping 10- The Moment Invariant FeaturesThe Moment Invariant Features 9 The Thirteenth Conference on Language Engineering 11-12, December 2013
Results : Word/ClusterGroup ER (%) Clustering ER (%) Total ER (%) Cluster Rate (%) Features ICC BDCT DCT DCT-4B ICC+BDCT ICC+ DCT IZC IZC+DCT DTW Moments TABLE 1: CLUSTERING RATE OF SIMPLIFIED ARABIC FONT USING DIFFERENT FEATURES 10 The Thirteenth Conference on Language Engineering 11-12, December 2013
To_Ave_Time (ms) Clus_Ave_Time (ms) Feat_Ext_Time (ms) Word/Cluster Cluster Rate (%) Features ICC BDCT DCT DCT-4B ICC+BDCT ICC+ DCT IZC IZC+DCT DTW Moments TABLE 2: PROCESSING TIME FOR FEATURE EXTRACTION AND CLUSTERING OF SIMPLIFIED ARABIC FONT USING DIFFERENT FEATURES 11 The Thirteenth Conference on Language Engineering 11-12, December 2013
Conclusion: based on their holistic features: Recognition speed increased unnecessary entries in the vocabulary removed Total average time of ICC or Moments (0.29 ms) is better than that of other methods. but the clustering rates are not the best (98.69% for ICC and 82.61% for Moment). the clustering rate of DCT (99.19%) is the better, but time is the worst (~12 ms). With two parameters (clustering rate and time) ICC may be a good compromise. 12 The Thirteenth Conference on Language Engineering 11-12, December 2013
Thanks for your attention.. 13 The Thirteenth Conference on Language Engineering 11-12, December 2013
Go Back counting the number of black pixels Vertical transitions from black to white horizontal transitions from black to white 14 The Thirteenth Conference on Language Engineering 11-12, December 2013
Go Back DCT. -Applying DCT to the whole word image -The features are extracted in a vector form by using the DCT coefficient set in a zigzag order. -Usually we get the most significant DCT coefficients(160 coef.) 15 The Thirteenth Conference on Language Engineering 11-12, December 2013
Go Back Block Discrete Cosine Transform (BDCT) Apply the DCT transform for each cell Get the average of the differences between all the DCT coefficients 16 The Thirteenth Conference on Language Engineering 11-12, December 2013
Go Back Discrete Cosine Transform 4-Blocks (DCT-4B) 1- Compute the center of gravity of the input image. 2- Divide the word image into 4-parts taking the center of gravity as the origin point. 3- Apply the DCT transform for each Part. 4- Concatenate the features taken from each part to form the feature set of the given word. 17 The Thirteenth Conference on Language Engineering 11-12, December 2013
Go Back Image Centroid and Zone (ICZ) Compute the average distance among these points (in a given zone) and the centroid of the word image 18 The Thirteenth Conference on Language Engineering 11-12, December 2013
Go Back DTW (Dynamic Time Warping) Features. The three types of features are extracted from the binarized images and used in our DTW techniques: X-axis and Y-axis Histogram Profile Profile Features(Upper, Down, Left and Right) Forground/Background Transition DTW) is an algorithm for measuring similarity between two sequences The distance between two time series x1... xM and y1... yN is D(M,N), that is calculated in a dynamic programming approach using 19 The Thirteenth Conference on Language Engineering 11-12, December 2013
Go Back DTW (Dynamic Time Warping) Features. 20 The Thirteenth Conference on Language Engineering 11-12, December 2013 Figure 1: The Four Profiles Features: (A) Left Profile. B) Up (C) Down Profile. D) Right Profile
Go Back The Moment Invariant Features Hu moments: Hu defined seven values, computed from central moments through order three 21 The Thirteenth Conference on Language Engineering 11-12, December 2013
Go Back 22 The Thirteenth Conference on Language Engineering 11-12, December 2013
Go Back Moments 23 The Thirteenth Conference on Language Engineering 11-12, December 2013 The moment invariant descriptors are calculated and fed to the feature vector