Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf Amrou, Ali Aazhar, Naveen Ratkal
Oct. 12, 2007STEV 2007, Portland OR The Problem Dynamic validation of a program that mimics human behavior mimics human behavior is imprecisely specified is imprecisely specified will vary widely in behavior will vary widely in behavior –by user/installation –over time
Oct. 12, 2007STEV 2007, Portland OR Overall Approach Apply a wide variety of tests on selected output properties Apply a wide variety of tests on selected output properties –deterministic –statistical Combine tests heuristically Combine tests heuristically –Combination controlled by scripts for flexibility
Oct. 12, 2007STEV 2007, Portland OR Outline The Application: Metadata Extraction The Application: Metadata Extraction Dynamic Validation of the Extractor Dynamic Validation of the Extractor Evaluating the Validator Evaluating the Validator Conclusions Conclusions
Oct. 12, 2007STEV 2007, Portland OR The Application: Metadata Extraction Large, diverse, growing government document collections Large, diverse, growing government document collections –DTIC, NASA, GPO (EPA & Congress) Automated system to extract metadata from documents Automated system to extract metadata from documents –Input: scanned page images or “text” PDF –Output: XML containing metadata fields e.g., titles, authors, dates of publication, abstracts, release rights
Oct. 12, 2007STEV 2007, Portland OR Approach Classify documents by layout similarity Classify documents by layout similarity Templates contain rules for extracting metadata from a specific layout Templates contain rules for extracting metadata from a specific layout –To keep templates simple, layout classes must be fairly detailed and specific
Oct. 12, 2007STEV 2007, Portland OR Process Overview
Oct. 12, 2007STEV 2007, Portland OR Sample Metadata Record (including mistakes) <metadata> Thesis Title: Intrepidity, Iron Thesis Title: Intrepidity, Iron Will, and Intellect: General Robert L. Eichelberger and Military Genius Will, and Intellect: General Robert L. Eichelberger and Military Genius Name of Candidate: Major Matthew H. Fath Name of Candidate: Major Matthew H. Fath Accepted this 18th day of June 2004 by: Accepted this 18th day of June 2004 by: Approved by: Thesis Committee Chair Jack D. Kem, Ph.D. Approved by: Thesis Committee Chair Jack D. Kem, Ph.D., Member Mr. Charles S. Soby, M.B.A., Member Lieutenant Colonel John A. Suprin, M.A., Member Mr. Charles S. Soby, M.B.A., Member Lieutenant Colonel John A. Suprin, M.A. Robert F. Baumann, Ph.D. Robert F. Baumann, Ph.D. </metadata>
Oct. 12, 2007STEV 2007, Portland OR Rationale for Dynamic Validation Sources of error Sources of error –Document flaws –OCR software failures –Mis-classified layouts –Template faults –Extraction engine faults Software replaces expensive human-intensive process Software replaces expensive human-intensive process Moderately high (10-20%) failure rate is tolerable if we can identify which output sets are failures Moderately high (10-20%) failure rate is tolerable if we can identify which output sets are failures –route those sets to humans for inspection and correction
Oct. 12, 2007STEV 2007, Portland OR Process Overview
Oct. 12, 2007STEV 2007, Portland OR Dynamic Validation Challenges: imprecise specification imprecise specification low-level internal state not trusted as indicator of correct progress low-level internal state not trusted as indicator of correct progress input characteristics vary from one document collection to another input characteristics vary from one document collection to another input characteristics may vary over time input characteristics may vary over time
Oct. 12, 2007STEV 2007, Portland OR Approach Wide battery of basic tests can be applied to metadata fields Wide battery of basic tests can be applied to metadata fields –deterministic –statistical Basic test results combined heuristically Basic test results combined heuristically –under control of custom scripting language
Oct. 12, 2007STEV 2007, Portland OR Basic Tests - Deterministic date formats date formats regular expressions regular expressions –structured fields, e.g., report numbers
Oct. 12, 2007STEV 2007, Portland OR Basic Tests – Statistical Reference models from prior metadata (human extracted) Reference models from prior metadata (human extracted) –850,000 records in DTIC collection –20,000 records in NASA Measured field lengths Measured field lengths Phrase dictionaries constructed for fields with specialized vocabularies Phrase dictionaries constructed for fields with specialized vocabularies –e.g., author, organization
Oct. 12, 2007STEV 2007, Portland OR Statistics collected (mean & std dev) Field lengths Field lengths –title, abstract, author,.. Dictionary detection rates for words in natural language fields Dictionary detection rates for words in natural language fields –abstract, title,. Phrase recurrence rates for fields with specialized vocabularies Phrase recurrence rates for fields with specialized vocabularies –author and organization
Oct. 12, 2007STEV 2007, Portland OR Field Length (in words), DTIC collection
Oct. 12, 2007STEV 2007, Portland OR Dictionary Detection (% of recognized words), DTIC collection
Oct. 12, 2007STEV 2007, Portland OR Phrase Dictionary Recurrence Rate, DTIC collection Field Phrase Length Mean Std. Dev. PersonalAuthor 197%11% 283%32% 371%45% CorporateAuthor 1100%2.0% 299%6.0% 399%10% 499%13%
Oct. 12, 2007STEV 2007, Portland OR Validation Procedure Selected basic tests are applied to extracted metadata field values Selected basic tests are applied to extracted metadata field values –deterministic tests will pass or fail –Statistical tests compare to norms from reference models standard score computed Test results for same field are combined to form field confidence Test results for same field are combined to form field confidence Field confidences are combined to form overall confidence Field confidences are combined to form overall confidence
Oct. 12, 2007STEV 2007, Portland OR Combining Basic Test Scores Validation specification describes Validation specification describes –which tests to apply to which fields –how to normalize/scale test scores prior to combination –how to combine field tests into field confidence –how to combine field confidences into overall confidence
Oct. 12, 2007STEV 2007, Portland OR Partial Validation Spec – DTIC
Oct. 12, 2007STEV 2007, Portland OR Validation script Specification combined with extracted data Specification combined with extracted data – to form an executable script Apache Jelly project Script executed to produce metadata record annotated with Script executed to produce metadata record annotated with –confidence values for each field –warning/explanations for low-scoring fields –overall confidence for output record
Oct. 12, 2007STEV 2007, Portland OR Sample Output From Validator <metadata confidence="0.460" <metadata confidence="0.460" warning="ReportDate field does not match required pattern"> warning="ReportDate field does not match required pattern"> Thesis Title: Intrepidity, Iron Thesis Title: Intrepidity, Iron Will, and Intellect: General Robert L. Eichelberger and Military Genius Will, and Intellect: General Robert L. Eichelberger and Military Genius <PersonalAuthor confidence="0.4" <PersonalAuthor confidence="0.4" warning="PersonalAuthor: unusual number of words"> warning="PersonalAuthor: unusual number of words"> Name of Candidate: Major Matthew H. Fath Name of Candidate: Major Matthew H. Fath <ReportDate confidence="0.0" <ReportDate confidence="0.0" warning="ReportDate field does not match required pattern"> warning="ReportDate field does not match required pattern"> Accepted this 18th day of June 2004 by: Accepted this 18th day of June 2004 by: Approved by: Thesis Committee Chair Jack D. Kem, Ph.D. Approved by: Thesis Committee Chair Jack D. Kem, Ph.D., Member Mr. Charles S. Soby, M.B.A., Member Lieutenant Colonel John A. Suprin, M.A., Member Mr. Charles S. Soby, M.B.A., Member Lieutenant Colonel John A. Suprin, M.A. Robert F. Baumann, Ph.D. Robert F. Baumann, Ph.D. </metadata>
Oct. 12, 2007STEV 2007, Portland OR Experimental Design How effective is post-hoc classification? How effective is post-hoc classification? Selected 2000 documents recently added to DTIC collection Selected 2000 documents recently added to DTIC collection –Visually classified by humans, comparing to 10 most common layouts from studies of earlier documents discarded documents not in one of those classes 646 documents remained Applied all templates, validated extracted metadata, selected highest confidence as the validator’s choice Applied all templates, validated extracted metadata, selected highest confidence as the validator’s choice Compared validator’s preferred layout to human choices Compared validator’s preferred layout to human choices
Oct. 12, 2007STEV 2007, Portland OR Exp. Design Justification Directly models one source of error Directly models one source of error –Document flaws –OCR software failures –Mis-classified layouts –Template faults –Extraction engine faults Layouts involved include some that are very similar Layouts involved include some that are very similar –single-field failures typical of other error sources Minimizes disputes among human judges Minimizes disputes among human judges Relatively unaffected by continuing changes to extraction software Relatively unaffected by continuing changes to extraction software
Oct. 12, 2007STEV 2007, Portland OR Validation Spec for Experiment Similar to production spec except Similar to production spec except –field scores combined by summation rathater than by minimum or average Simulated post-processing of extracted values Simulated post-processing of extracted values –extractor is WYSIWYG –not always what is desired e.g., “Major Matthew H. Fath” => “Fath, Matthew H.”
Oct. 12, 2007STEV 2007, Portland OR Automatic vs. Human Classifications Post-hoc classifier agreed with human on 91% of cases Post-hoc classifier agreed with human on 91% of cases
Oct. 12, 2007STEV 2007, Portland OR Conclusions Important characteristics of this approach: Aggressively Opportunistic Aggressively Opportunistic –lots of small, simple tests Pragmatic Pragmatic –heuristic combination of simple test results Flexible Flexible –scripting aids in tuning heuristics adaption to different installations & input sets
Oct. 12, 2007STEV 2007, Portland OR Conclusions: Exploiting Validation Internally Agreement rate between validator and humans far exceeds our best prior classifier algorithms Agreement rate between validator and humans far exceeds our best prior classifier algorithms –based on geometric layout of text and graphic blocks New classifier: New classifier: –apply all available templates to document –score all outputs using validator –choose top-scoring output set