Automated Form processing for DTIC Documents March 20, 2006 Presented By, K. Maly, M. Zubair, S. Zeil
Outline Overall process for handling documents in batches Issues Results Conclusion
Overall process for handling documents in batches
Figure: Flowchart of Processing One Document Read Next Page Match against all form templates no pages left Matched Templates # >0 Add the page with its templates into candidate set start Get the best one Have candidates? 1 Extract metadata End Store “resolved” results Move to “unresolved” folder 2 1.Omnipage xml document having 10 pages (first 5 and last 5). 2.Possibly, more than one page will have a match with more than one templates. At this time, we do not check how well they matched. 3.Determined by the ratio of the number of fields matched over the total number of fields. 3 yes no yes no yes
Issues in form based metadata extraction IssueSolution 1) Illegal characters in omnipage xml document that causes fatal parser exceptions. Before processing the xml pages are cleaned to remove these illegal characters. 2) Forms may miss some obvious fields.Forms may miss some obvious fields.Not yet addressed 3) Incorrect form captions due toIncorrect form captions due to OCR errorsOCR errors. resolved using edit distance algorithm 4) Forms miss captions.Forms miss captionsThese forms are detected using the metadata key field names. 5) Forms may span multiple pages.Forms may span multiple pagesThe code processes the pages subsequent to the form page to find if the form spans multiple pages. 6) Word length boundary detection.Word length boundary detectionSolved to a certain limit, by using different string matching algorithms. 7) Metadata field names have different variations.Metadata field names have different variationsMatch field name part by part 8) Coverage type is missing Coverage type is missingNot yet addressed 9) Document is wrongly identified Document is wrongly identifiedNot yet addressed 10) OCR Error OCR ErrorNot yet addressed Results Of 246 Documents Results Of 100 Documents
For example in the following document, the POINT page (first page) has the author, but the form doesn’t. Forms are missing some obvious fields
In the following form, the caption “REPORT DOCUMENTATION PAGE” is OCRed incorrectly as “REPORT DOCUMENTA110N PAGE “. These type of OCR errors are resolved using edit distance.
The following has no form caption. If the captions of a form page is missing, we recognize it as a form if more than 10 metadata field names have been found.
The following form spans on two pages. After finding a form page, we check the following pages by using field name match to see if it’s a part of the form.
In the following form we have word boundary detection errors for metadata field names. For example, “4. TITLE AND SUBTITLE” appears as “4. T ITL E A ND SUB TI TL E”. (We use the following seqence for matching field names: exact match, match after removing white spaces, similar match (using edit distance))
Following are parts of two forms, where we can see the variations for the field “17. LIMITATION OF ABSTRACT”. Here we recognize the field name by matching it part by part. If the cell boundary information is available (i.e. "17. LIMITATION OF ABSTRACT" is in one cell), we will also rebuild the text field name by connecting the texts in the cell (e.g. "17.", "LIMITATION", "OF", "ABSTRACT" ===> "17. LIMITATION OF ABSTRACT") and match it against defined field name directly. Its worth noting that not all form pages have cell boundary information.
Coverage Type Missing In the Original Document The Title is missing in the Third Field of the PDF document it should contain “REPORT TYPE AND DATES COVERED”
Identified as sf298_1 The current templates identified this form but failed to extract because this was a new kind of form and we can handle this case by writing a new template.
OCR Error In the Date Covered Field OCR has produced a garbage for the Third Field (From-TO) In the Dates Covered Field
OCR Error In the Date Covered Field OCR has produced a garbage for the Third Field (From-TO) In the Dates Covered Field
We are currently handling six types of forms (through templates), five are variations of sf298 form (Report Documentation Page) and one is other type of form. For any new forms templates can be written to handle them. Following are the recall and precision results based on 264 documents. class# documentsRecallPrecision sf298_19295%97% sf298_213792%98% sf298_3293%100% sf298_424100% citation_19100% Results of 264 Documents
Results of 100 Documents class# documentsRecallPrecision sf298_13091%95% sf298_23098%99% sf298_31068% (Problem with sf298_3)(Problem with sf298_3) 96% sf298_410100% Control1096%100% citation_110100%
Execution Time : The Code took 21 hrs, 58 minutes to process our testbed of 10K pdf documents. We found that for 10k documents we are getting good results for most of the form classes and relatively poor performance for sf298_3 due to OCR errors. Conclusion