Metadata Extraction Progress Report 12/14/2006.

Metadata Extraction Progress Report 12/14/2006

Outline System Overview Detailed Structure with Recent Changes
IDM representation of documents validation & post-hoc classification Status of Recent & Upcoming Deliverables Future Directions

System Overview

Detailed Structure with Recent Changes
Input Processing Form Processing Post Processing Nonform Processing

Input Processing OCR – Omnipage update radically changed XML output
Details later Study of DTIC documents found none with POINT (Page Of INTerest) pages outside 1st and last 5 suspended efforts at more sophisticated POINT page location

Form Processing Bug fixes and Tuning Omnipage XML converted to IDM
Main form template engine rewritten to work from IDM

Independent Document Model (IDM)
Platform independent Document Model Motivation Dramatic XML Schema Change between Omnipage 14 and 15 Tie the template engine to stable specification Protects from linking directly to specific OCR product Allows us to include statistics for enhanced feature usage Statistics (i.e. avgDocFontSize, avgPageFontSize, wordCount, avgDocWordCount, etc..)

Generating IDM Use XSLT 2.0 stylesheets to transform
Supporting new OCR schema only requires generation of new XSLT stylesheet. -- Engine does not change Chain a series of sheets to add functionality (CleanML) Schema Specification Available (

IDM Usage OmniPage 14 XML Doc Form Based Extraction docTreeModelOmni14.xsl docTreeModelOmni15.xsl docTreeModelCleanML.xsl OmniPage 15 XML Doc IDM XML Doc docTreeModelOther.xsl CleanML XML Doc Each incoming XML schema requires specific XSLT 2.0 Stylesheet Resulting IDM Doc used for “Form Based” templates IDM transformed into CleanML for “Non-form” templates Other OCR Output XML Doc Non Form Extraction

IDM Tool Status Converters completed to generate IDM from Omnipage 14 and 15 XML Omnipage 15 proved to have numerous errors in its representation of an OCR’d document Consequently, not recommended Form-based extraction engine revised to work from IDM Non-form engine still works from our older “CleanXML” convertor from IDM to CleanXML completed as stop-gap measure direct use of IDM deferred pending review of other engine modifications

Post Processing No significant changes

Nonform Processing Bug fixes & tuning Added new validation component
Post-hoc classification replaces former a priori classification schemes

Validation Given a set of extracted metadata
mark each field with a confidence value indicating how trustworthy the extracted value is mark the set with a composite confidence score Fields and Sets with low confidence scores may be referred for additional processing automated post-processing human intervention and correction

Validating Extracted Metadata
Techniques must be independent of the extraction method A validation specification is written for each collection, combining Field-specific validation rules statistical models derived for each field of text length % of words from English dictionary % of phrases from knowledge base prepared for that field pattern matching

Sample Validation Specification
Combines results from multiple fields <val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs.dtic.validation.ValidationTagLibrary"> <val:average> <val:field name="UnclassifiedTitle">...</val:field> <val:field name="PersonalAuthor">...</val:field> <val:field name="CorporateAuthor">...</val:field> <val:field name="ReportDate">...</val:field> </val:average> </val:validate>

Validation Spec: Field Tests
Each field is subjected to one or more tests … <val:field name="PersonalAuthor"> <val:average> <val:length/> <val:max> <val:phrases length="1"/> <val:phrases length="2"/> <val:phrases length="3"/> </val:max> </val:average> </val:field> <val:field name="ReportDate"> <val:reportFormat/> ...

Sample Input Metadata Set
<UnclassifiedTitle>Thesis Title: The Military Extraterritorial Jurisdiction Act</UnclassifiedTitle> <PersonalAuthor>Name of Candidate: LCDR Kathleen A. Kerrigan</PersonalAuthor> <ReportDate>Accepted this 18th day of June 2004 by:</ReportDate> </metadata>

Sample Validator Output
<metadata confidence="0.522"> <UnclassifiedTitle confidence="0.943">Thesis Title: The Military Extraterritorial Jurisdiction Act</UnclassifiedTitle> <PersonalAuthor confidence="0.622">Name of Candidate: LCDR Kathleen A. Kerrigan</PersonalAuthor> <ReportDate confidence="0.0" warning="ReportDate field does not match required pattern">Accepted this 18th day of June 2004 by:</ReportDate> </metadata>

Classification (a priori)
Previously, we had attempted various schemes for a priori classification x-y trees bin classification Still investigating some visual recognition

Post-Hoc Classification
Apply all templates to document results in multiple candidate sets of metadata Score each candidate using the validator Select the best-scoring set

Demo & Experimental Results
Results of 157 documents Class Hand Classification Validation Au 86 Eagle 47 Title 24 Total 157

Future Directions

Status of Recent & Upcoming Deliverables
DTIC - Classifier Development (9/19/06) NASA - Enhance classification algorithm for two specific classes (10/31/2006) NASA - Process study for inter-organizational collections – configuration software – (12/1/2006) NASA - Enhance engine to recognize two major classes (Dec 15, 2006)

Classifier Development
DTIC - Classifier Development (9/19/06) NASA - Enhance classification algorithm for two specific classes (10/31/2006) Delayed by difficulties with a priori classification schemes Now replaced by post hoc validation-based classification some tuning of validation spec required cleaning of metadata sources for statistical models Demo posted 11/15/2006

Configuration NASA - Process study for inter-organizational collections (12/1/2006) extraction engines differentiate by collection-dependent template sets validation specifications take collection name as a required attribute used to locate distinct statistical models built for that collection Regression test framework established protects against changes or tuning to one collection degrading performance on others

Engine Enhancements NASA - Enhance engine to recognize two major classes (12/15/2006) in many ways, already satisfied most planned enhancements deferred due to work on IDM in short term, emphasis will be on expanding the template set to exploit existing engine features and availability of new post-hoc classifier

END Questions?

Current System (Detailed)

Metadata Extraction Progress Report 12/14/2006.

Similar presentations

Presentation on theme: "Metadata Extraction Progress Report 12/14/2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Metadata Extraction Progress Report 12/14/2006.

Similar presentations

Presentation on theme: "Metadata Extraction Progress Report 12/14/2006."— Presentation transcript:

Similar presentations

About project

Feedback