Automated Form processing for DTIC Documents March 20, 2006 Presented By, K. Maly, M. Zubair, S. Zeil.

Slides:



Advertisements
Similar presentations
Debugging ACL Scripts.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
1 Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules Chun-Nan Hsu Arizona State University.
CSCI 6962: Server-side Design and Programming Input Validation and Error Handling.
Segmentation of Touching Characters in Devnagari & Bangla Scripts Using Fuzzy MultiFactorial Analysis Presented By: Sanjeev Maharjan St. Xavier’s College.
Excel Chapter 6 Review slides. How many worksheets are in a workbook, by default? three.
Area and Perimeter January 2013 Elementary Coaches Meeting By PresenterMedia.comPresenterMedia.com.
Programming Logic and Design, Third Edition Comprehensive
Dates in Bibliographic Records Books. Where do you start?  Where on a book do you usually find publication and/or copyright date information?  Title.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
ITEC810 Final Report Inferring Document Structure Wieyen Lin/ Supervised by Jette Viethen.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
1 ICS103 Programming in C Lecture 2: Introduction to C (1)
Modules, Hierarchy Charts, and Documentation
Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Oracle Enterprise Data Quality CDEP: Tailoring Parser Configuration.
Overview of Search Engines
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
1 Newspaper Digitisation Workflows Rose Holley- Manager ANDP Presentation to Cultural Heritage Digitisation professionals 26 November 2008.
1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.
Computer Skills /1436 Department of Computer Science Foundation Year Program Umm Alqura University, Makkah Place photo here 1.
GUIDELINES FOR PREPARATION OF PROJECT REPORT Ramesh Parajuli.
Classroom User Training June 29, 2005 Presented by:
A Visual Comparison Approach to Automated Regression Testing (PDF to PDF Compare)
Why XML ? Problems with HTML HTML design - HTML is intended for presentation of information as Web pages. - HTML contains a fixed set of markup tags. This.
Sign up for an Easybib account
Introduction Why we do it? To disseminate research To report a new result; To report a new technique; To critique/confirm another's result. Each discipline.
Amber Annett David Bell October 13 th, What will happen What is this business about personal web pages? Designated location of your own web page.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
Introduction to XML 1. XML XML started out as a standard data exchange format for the Web Yet, it has quickly become the fundamental instrument in the.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Checking data Chapter 7 Prepared by:Sir Mazhar Javed.
XML and Digital Libraries M. Zubair Department of Computer Science Old Dominion University.
Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Chapter 17 Creating a Database.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
Lesson Understand templates 2 Create a new document from a template 3 Work with template elements 4 Create a custom template 5 Use a custom template.
Lecture 4 Programming Technique Programming Appreciation.
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf.
Running Head and Cover Page write-a-cover-page-in-apa-style cover-apa-style.html.
September 25, 2006 NASA Feasibility Study Status Update.
Chapter 3 Automating Your Work. It is frustrating when you have to type the same passage of text repeatedly. For example your name and address. Word includes.
Lecture 3- Microsoft Word COE 201- Computer Proficiency.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
By: Wilmer Arellano FIU Summer Overview s Introduction to Proposal Style General Recommendations ▫Section Headings ▫References Title Page.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Title of Presentation May Go Here Department Name Presentation for March 4, 2014 University Marketing.
Click to Edit Title Click to edit subtitle style.
Using Microsoft Office Word Assignment Layout. Target Create a Cover Page (Front Page) Create a Table of Contents Page Create a Table of Figures Page.
Text2PTO: Modernizing Patent Application Filing A Proposal for Submitting Text Applications to the USPTO.
By Kelley Moody BSN, RN Graduate Student
Poster Title Goes Here & Must Match Your Submitted Abstract Title
Presentation to Senior Management January 7, 2010
CINAHL
Metadata Extraction Progress Report 12/14/2006.
Ariba Contracts: Initiate eSignature
By Kelley Moody BSN, RN Graduate Student
Presentation Title Here
Query Languages.
Welcome ! Excel 2013/2016 Data Consolidation (Lab Format)
Poster Title Authors Addresses
This is a Scientific Poster Template created by Graphicsland & Makesigns.com Your poster title would go on these lines Author names go here and you can.
A Green Form-Based Information Extraction System for Historical Documents Tae Woo Kim No HD. I’m glad to present GreenFIE today. A Green Form-…
Title Introduction: Discussion & Conclusion: Methods & Results:
Extraction Rule Creation by Text Snippet Examples
Conclusion & Discussion Research purposes/ Research hypothesis
Presentation transcript:

Automated Form processing for DTIC Documents March 20, 2006 Presented By, K. Maly, M. Zubair, S. Zeil

Outline  Overall process for handling documents in batches  Issues  Results  Conclusion

Overall process for handling documents in batches

Figure: Flowchart of Processing One Document Read Next Page Match against all form templates no pages left Matched Templates # >0 Add the page with its templates into candidate set start Get the best one Have candidates? 1 Extract metadata End Store “resolved” results Move to “unresolved” folder 2 1.Omnipage xml document having 10 pages (first 5 and last 5). 2.Possibly, more than one page will have a match with more than one templates. At this time, we do not check how well they matched. 3.Determined by the ratio of the number of fields matched over the total number of fields. 3 yes no yes no yes

Issues in form based metadata extraction IssueSolution 1) Illegal characters in omnipage xml document that causes fatal parser exceptions. Before processing the xml pages are cleaned to remove these illegal characters. 2) Forms may miss some obvious fields.Forms may miss some obvious fields.Not yet addressed 3) Incorrect form captions due toIncorrect form captions due to OCR errorsOCR errors. resolved using edit distance algorithm 4) Forms miss captions.Forms miss captionsThese forms are detected using the metadata key field names. 5) Forms may span multiple pages.Forms may span multiple pagesThe code processes the pages subsequent to the form page to find if the form spans multiple pages. 6) Word length boundary detection.Word length boundary detectionSolved to a certain limit, by using different string matching algorithms. 7) Metadata field names have different variations.Metadata field names have different variationsMatch field name part by part 8) Coverage type is missing Coverage type is missingNot yet addressed 9) Document is wrongly identified Document is wrongly identifiedNot yet addressed 10) OCR Error OCR ErrorNot yet addressed Results Of 246 Documents Results Of 100 Documents

For example in the following document, the POINT page (first page) has the author, but the form doesn’t. Forms are missing some obvious fields

In the following form, the caption “REPORT DOCUMENTATION PAGE” is OCRed incorrectly as “REPORT DOCUMENTA110N PAGE “. These type of OCR errors are resolved using edit distance.

The following has no form caption. If the captions of a form page is missing, we recognize it as a form if more than 10 metadata field names have been found.

The following form spans on two pages. After finding a form page, we check the following pages by using field name match to see if it’s a part of the form.

In the following form we have word boundary detection errors for metadata field names. For example, “4. TITLE AND SUBTITLE” appears as “4. T ITL E A ND SUB TI TL E”. (We use the following seqence for matching field names: exact match, match after removing white spaces, similar match (using edit distance))

Following are parts of two forms, where we can see the variations for the field “17. LIMITATION OF ABSTRACT”. Here we recognize the field name by matching it part by part. If the cell boundary information is available (i.e. "17. LIMITATION OF ABSTRACT" is in one cell), we will also rebuild the text field name by connecting the texts in the cell (e.g. "17.", "LIMITATION", "OF", "ABSTRACT" ===> "17. LIMITATION OF ABSTRACT") and match it against defined field name directly. Its worth noting that not all form pages have cell boundary information.

Coverage Type Missing In the Original Document The Title is missing in the Third Field of the PDF document it should contain “REPORT TYPE AND DATES COVERED”

Identified as sf298_1 The current templates identified this form but failed to extract because this was a new kind of form and we can handle this case by writing a new template.

OCR Error In the Date Covered Field OCR has produced a garbage for the Third Field (From-TO) In the Dates Covered Field

OCR Error In the Date Covered Field OCR has produced a garbage for the Third Field (From-TO) In the Dates Covered Field

We are currently handling six types of forms (through templates), five are variations of sf298 form (Report Documentation Page) and one is other type of form. For any new forms templates can be written to handle them. Following are the recall and precision results based on 264 documents. class# documentsRecallPrecision sf298_19295%97% sf298_213792%98% sf298_3293%100% sf298_424100% citation_19100% Results of 264 Documents

Results of 100 Documents class# documentsRecallPrecision sf298_13091%95% sf298_23098%99% sf298_31068% (Problem with sf298_3)(Problem with sf298_3) 96% sf298_410100% Control1096%100% citation_110100%

 Execution Time : The Code took 21 hrs, 58 minutes to process our testbed of 10K pdf documents.  We found that for 10k documents we are getting good results for most of the form classes and relatively poor performance for sf298_3 due to OCR errors. Conclusion