Groundtruthing for Performance Evaluation of Document Image Analysis Systems: a primer Mathieu Delalandre Pattern Recognition and Image Analysis Group Laboratory of Computer Science François Rabelais University Tours city, France Digidoc meeting, 6 th of January
Groundtruthing for Performance Evaluation of Document Image Analysis Systems: a primer “Introduction” Performance evaluation is a particular cross-disciplinary research field in a variety of domains. Its purpose is the development of frameworks to evaluate and compare a set of methods in order to select the best-suited for a given application. Groundtruth must be reliable (i.e. 100% recognition rate) and exhaustive (label, localization, geometric transforms, noise estimation, metadata, etc.) Considering the document image analysis field (apart of the graphics), five main approaches exist. Characterisation Groundtruth Groundtruthing Results Performance evaluation System Training data dATA Data Degradation SpeedReliabilityTypeConstraint GUI based groundtruthing --Real (any)None Semi automatic transcription ++Real (any)None Electronic document mapping +++ Real (modern) Electronic document Transcript mapping ++Real (old)Transcription Generation of synthetic document ++ Synthetic (any) None
Groundtruthing for Performance Evaluation of Document Image Analysis Systems: a primer “GUI based groundtruthing” Principles: GUI plugged to a DIA systems, based on user correction. e.g. TrueViz [Kan’01], Xmillum [Hitz’00], PinkPanther [Yanikoglu’01], PerfectDoc [Yacoub’05], etc. Pros: Discussion about groundtruth formalism Cons: Time consuming considering the user correction, specific DIA chains must be designed for every application, groundtruth is not still reliable. 3 SpeedReliabilityTypeConstraint GUI based groundtruthing --Real (any)None Semi automatic transcription ++Real (any)None Electronic document mapping +++ Real (modern) Electronic document Transcript mapping ++Real (old)Transcription Generation of synthetic document ++ Synthetic (any) None
Groundtruthing for Performance Evaluation of Document Image Analysis Systems: a primer “Semi-automatic transcription” Principles: To exploit the context and user interaction to make more robust the recognition process. Transcription is achieved at metadata level, without considering the images. e.g. [Bal’ 08], [Lebourgeois’ 01] Algorithms: binarization and connected component labeling, shape context, image distance, clustering, etc. Pros: Interesting idea, 5% of labeling could result in 95% of correct transcription. Cons: What about the robustness, are we sure of a complete transcription, what about the impact of the segmentation, robustness of the approach is not proved yet. 4 SpeedReliabilityTypeConstraint GUI based groundtruthing --Real (any)None Semi automatic transcription ++Real (any)None Electronic document mapping +++ Real (modern) Electronic document Transcript mapping ++Real (old)Transcription Generation of synthetic document ++ Synthetic (any) None
Groundtruthing for Performance Evaluation of Document Image Analysis Systems: a primer “Electronic document mapping Principles: A registration algorithm estimates the global geometric transformation and then performs a robust local bitmap match to register an ideal document image to its corresponding scanned version. e.g. [Kan’96], [Hobby’98], [Beusekom’08], [Kim’02] Algorithms: Registration for transformation estimation, RAST (Recognition using Adaptive Subdivision of Transformation space), branch- and-bound algorithm Pros: The strongest approach of the literature. Cons: Can’t be applied with “old” documents, as an electronic version is mandatory. 5 SpeedReliabilityTypeConstraint GUI based groundtruthing --Real (any)None Semi automatic transcription ++Real (any)None Electronic document mapping +++ Real (modern) Electronic document Transcript mapping ++Real (old)Transcription Generation of synthetic document ++ Synthetic (any) None
Groundtruthing for Performance Evaluation of Document Image Analysis Systems: a primer “Transcript mapping” Principles: Transcript mapping eases the construction of document image segmentation ground truth that includes text-image alignment. e.g. [Stamatopoulos’10], [Zinger’09], [Jawahar’07], etc. Algorithms: HHM, DTW Pros: When no electronic documents exist, certainly the only valid way to obtain a groundtruth at the graphical level. Cons: Depends of the quality of transcriptions, producing transcriptions is time consuming, the approach is more sensitive to segmentation errors. 6
7 Groundtruthing for Performance Evaluation of Document Image Analysis Systems: a primer “Generation of synthetic document” SpeedReliabilityTypeConstraint GUI based groundtruthing --Real (any)None Semi automatic transcription ++Real (any)None Electronic document mapping +++ Real (modern) Electronic document Transcript mapping ++Real (old)Transcription Generation of synthetic document ++ Synthetic (any) None Principles: In such a system, the test documents are generated by an automatic system which combines pre-defined models of document components in a pseudo-random way. As documents are synthetically generated, the groundtruth becomes automatically available. e.g. [Heroux’07], [Zi’05], etc. Pros: No previous data is mandatory, efficient and exhaustive groundtruth is generated automatically. Cons: Synthetic is not real, to prove similarity between synthetic and real data is not so simple.