Metadata Extraction Progress Report 12/14/2006
Outline System Overview Detailed Structure with Recent Changes IDM representation of documents validation & post-hoc classification Status of Recent & Upcoming Deliverables Future Directions
System Overview
Detailed Structure with Recent Changes Input Processing Form Processing Post Processing Nonform Processing
Input Processing OCR – Omnipage update radically changed XML output Details later Study of 10188 DTIC documents found none with POINT (Page Of INTerest) pages outside 1st and last 5 suspended efforts at more sophisticated POINT page location
Form Processing Bug fixes and Tuning Omnipage XML converted to IDM Main form template engine rewritten to work from IDM
Independent Document Model (IDM) Platform independent Document Model Motivation Dramatic XML Schema Change between Omnipage 14 and 15 Tie the template engine to stable specification Protects from linking directly to specific OCR product Allows us to include statistics for enhanced feature usage Statistics (i.e. avgDocFontSize, avgPageFontSize, wordCount, avgDocWordCount, etc..)
Generating IDM Use XSLT 2.0 stylesheets to transform Supporting new OCR schema only requires generation of new XSLT stylesheet. -- Engine does not change Chain a series of sheets to add functionality (CleanML) Schema Specification Available (http://dtic.cs.odu.edu/devzone/IDM_Specification.doc)
IDM Usage OmniPage 14 XML Doc Form Based Extraction docTreeModelOmni14.xsl docTreeModelOmni15.xsl docTreeModelCleanML.xsl OmniPage 15 XML Doc IDM XML Doc docTreeModelOther.xsl CleanML XML Doc Each incoming XML schema requires specific XSLT 2.0 Stylesheet Resulting IDM Doc used for “Form Based” templates IDM transformed into CleanML for “Non-form” templates Other OCR Output XML Doc Non Form Extraction
IDM Tool Status Converters completed to generate IDM from Omnipage 14 and 15 XML Omnipage 15 proved to have numerous errors in its representation of an OCR’d document Consequently, not recommended Form-based extraction engine revised to work from IDM Non-form engine still works from our older “CleanXML” convertor from IDM to CleanXML completed as stop-gap measure direct use of IDM deferred pending review of other engine modifications
Post Processing No significant changes
Nonform Processing Bug fixes & tuning Added new validation component Post-hoc classification replaces former a priori classification schemes
Validation Given a set of extracted metadata mark each field with a confidence value indicating how trustworthy the extracted value is mark the set with a composite confidence score Fields and Sets with low confidence scores may be referred for additional processing automated post-processing human intervention and correction
Validating Extracted Metadata Techniques must be independent of the extraction method A validation specification is written for each collection, combining Field-specific validation rules statistical models derived for each field of text length % of words from English dictionary % of phrases from knowledge base prepared for that field pattern matching
Sample Validation Specification Combines results from multiple fields <val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs.dtic.validation.ValidationTagLibrary"> <val:average> <val:field name="UnclassifiedTitle">...</val:field> <val:field name="PersonalAuthor">...</val:field> <val:field name="CorporateAuthor">...</val:field> <val:field name="ReportDate">...</val:field> </val:average> </val:validate>
Validation Spec: Field Tests Each field is subjected to one or more tests … <val:field name="PersonalAuthor"> <val:average> <val:length/> <val:max> <val:phrases length="1"/> <val:phrases length="2"/> <val:phrases length="3"/> </val:max> </val:average> </val:field> <val:field name="ReportDate"> <val:reportFormat/> ...
Sample Input Metadata Set <UnclassifiedTitle>Thesis Title: The Military Extraterritorial Jurisdiction Act</UnclassifiedTitle> <PersonalAuthor>Name of Candidate: LCDR Kathleen A. Kerrigan</PersonalAuthor> <ReportDate>Accepted this 18th day of June 2004 by:</ReportDate> </metadata>
Sample Validator Output <metadata confidence="0.522"> <UnclassifiedTitle confidence="0.943">Thesis Title: The Military Extraterritorial Jurisdiction Act</UnclassifiedTitle> <PersonalAuthor confidence="0.622">Name of Candidate: LCDR Kathleen A. Kerrigan</PersonalAuthor> <ReportDate confidence="0.0" warning="ReportDate field does not match required pattern">Accepted this 18th day of June 2004 by:</ReportDate> </metadata>
Classification (a priori) Previously, we had attempted various schemes for a priori classification x-y trees bin classification Still investigating some visual recognition
Post-Hoc Classification Apply all templates to document results in multiple candidate sets of metadata Score each candidate using the validator Select the best-scoring set
Demo & Experimental Results Results of 157 documents http://128.82.7.147:8080/dtic/validsum157.jsp Class Hand Classification Validation Au 86 Eagle 47 Title 24 Total 157
Future Directions
Status of Recent & Upcoming Deliverables DTIC - Classifier Development (9/19/06) NASA - Enhance classification algorithm for two specific classes (10/31/2006) NASA - Process study for inter-organizational collections – configuration software – (12/1/2006) NASA - Enhance engine to recognize two major classes (Dec 15, 2006)
Classifier Development DTIC - Classifier Development (9/19/06) NASA - Enhance classification algorithm for two specific classes (10/31/2006) Delayed by difficulties with a priori classification schemes Now replaced by post hoc validation-based classification some tuning of validation spec required cleaning of metadata sources for statistical models Demo posted 11/15/2006
Configuration NASA - Process study for inter-organizational collections (12/1/2006) extraction engines differentiate by collection-dependent template sets validation specifications take collection name as a required attribute used to locate distinct statistical models built for that collection Regression test framework established protects against changes or tuning to one collection degrading performance on others
Engine Enhancements NASA - Enhance engine to recognize two major classes (12/15/2006) in many ways, already satisfied most planned enhancements deferred due to work on IDM in short term, emphasis will be on expanding the template set to exploit existing engine features and availability of new post-hoc classifier
END Questions?
Current System (Detailed)