Form Image Compression using Template Extraction and Matching Jianguo Wang and Hong Yan School of Electrical and Information Engineering University of Sydney, NSW 2006, Australia phone: fax:
Multi-copy Form Images Redundancy Analysis Local Redundancy (CCITT Group 3, Group 4, JBIG) Global Redundancy –Component-level redundancy (JBIG2) –Pattern assemblage redundancy in similar images (TEM)
Flow chart of the TEM form compression scheme
Template extraction image de-skewing and locating, distortion adjusting, template extraction, –generating greyscale image –thresholding to get two pre-templates –getting template by comparing pre-templates template refining.
A set of adjusted binary form images is overlapped to generate a greyscale image. The density of a pixel is determined by the times of black pixels overlapped
Examples of the compression approach (a) an original form image; (b) template extracted from a set of filled-in forms
Compression image de-skewing and locating, distortion adjusting, filled-in data extraction, –three possible situation –two types of prototypes: SCC and CCC compression with Group 4 as tiff files.
Decompression two types of prototypes: –SCC: performing in the rectangle area –CCC: performing in the pixel set of prototypes Three possible situations: –blank: copy the corresponding prototype –different: no substitution occurs –exactly same: delete the component
(c) the reconstructed image (d) the filled-in data extracted from (a).
Sample forms used for testing
Form Document Compression Experiment Results
Conclusion TEM to reduce pattern assemblage redundancy in similar images; –can combine with any current standard (CCITT G3, G4, JBIG) to reduce local redundancy –can combine with JBIG2 to reduce Component-level redundancy in same image; a statistical template extraction algorithm by over- lapping binary images to a greyscale images; Form images de-skewing, location and distortion adjusting; pattern matching rules for SCC and CCC.