Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Testing and Quality Assurance
Software Quality Assurance Plan
Chapter 4 Quality Assurance in Context
Documentation Generators: Internals of Doxygen John Tully.
Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the five essential properties of an algorithm.
FAIRTRADE FOUNDATION OCR Nationals in ICT Unit 1 ICT Skills for Business AO4.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Information Retrieval in Practice
Chapter 3 Simulation Software
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
(c) 2007 Mauro Pezzè & Michal Young Ch 24, slide 1 Documenting Analysis and Test.
Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Oracle Enterprise Data Quality CDEP: Tailoring Parser Configuration.
High Level: Generic Test Process (from chapter 6 of your text and earlier lesson) Test Planning & Preparation Test Execution Goals met? Analysis & Follow-up.
Overview of Search Engines
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
1CMSC 345, Version 4/04 Verification and Validation Reference: Software Engineering, Ian Sommerville, 6th edition, Chapter 19.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.
Chapter 2: Software Process Omar Meqdadi SE 2730 Lecture 2 Department of Computer Science and Software Engineering University of Wisconsin-Platteville.
FCS - AAO - DM COMPE/SE/ISE 492 Senior Project 2 System/Software Test Documentation (STD) System/Software Test Documentation (STD)
The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil
Automated Form processing for DTIC Documents March 20, 2006 Presented By, K. Maly, M. Zubair, S. Zeil.
Metadata ODU for DTIC Presentation to Senior Management May 16, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Bug Localization with Machine Learning Techniques Wujie Zheng
Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.
Approved for Public Release U.S. Government Work (17 USC§105) Not copyrighted in the U.S. Defense Research & Engineering Information for the Warfighter.
Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
Amy Dai Machine learning techniques for detecting topics in research papers.
1 Introduction to Software Engineering Lecture 1.
November 23, 2010 Service Computation Keynote - Lisbon, Portugal Automated Metadata Extraction Services Kurt Maly Contact:
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
1 Tools for Extracting Metadata and Structure from DTIC Documents Digital Library Group Department of Computer Science Old Dominion University December,
Input Design Lecture 11 1 BTEC HNC Systems Support Castle College 2007/8.
Chair of Software Engineering Exercise Session 6: V & V Software Engineering Prof. Dr. Bertrand Meyer March–June 2007.
September 25, 2006 NASA Feasibility Study Status Update.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
Approaches to ---Testing Software Some of us “hope” that our software works as opposed to “ensuring” that our software works? Why? Just foolish Lazy Believe.
May 19-22, 2008 Open Forum for Metadata Registries Sydney Automated Metadata Extraction for Large, Diverse and Evolving Document Collections Kurt Maly.
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
Standards for representing meeting metadata and annotations in meeting databases Standards for representing meeting metadata and annotations in meeting.
HNDIT23082 Lecture 09:Software Testing. Validations and Verification Validation and verification ( V & V ) is the name given to the checking and analysis.
Copyright © 2014 Pearson Addison-Wesley. All rights reserved. Chapter 2 C++ Basics.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
Shuang Wu REU-DIMACS, 2010 Mentor: James Abello. Project description Our research project Input: time data recorded from the ‘Name That Cluster’ web page.
SOFTWARE DESIGN & SOFTWARE ENGINEERING Software design is a process in which data, program structure, interface and their details are represented by well.
PRODUCT VERIFICATION Adapted from the NASA Systems Engineering Handbook for CSULB EE 400D by Alia Bonetti.
Information Retrieval in Practice
SOFTWARE TESTING Date: 29-Dec-2016 By: Ram Karthick.
Search Engine Architecture
Approaches to ---Testing Software
Verification and Validation
Presentation to Senior Management January 7, 2010
Metadata Extraction Progress Report 12/14/2006.
Chapter 9 Requirements Modeling: Scenario-Based Methods
Lecture 12: Data Wrangling
Lecture 09:Software Testing
CSc4730/6730 Scientific Visualization
Lessons Vocabulary Access 2016.
Extracting Full Names from Diverse and Noisy Scanned Document Images
Family History Technology Workshop
Using Uneven Margins SVM and Perceptron for IE
A Green Form-Based Information Extraction System for Historical Documents Tae Woo Kim No HD. I’m glad to present GreenFIE today. A Green Form-…
Extracting Information from Diverse and Noisy Scanned Document Images
Presentation transcript:

Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf Amrou, Ali Aazhar, Naveen Ratkal

Oct. 12, 2007STEV 2007, Portland OR The Problem Dynamic validation of a program that mimics human behavior mimics human behavior is imprecisely specified is imprecisely specified will vary widely in behavior will vary widely in behavior –by user/installation –over time

Oct. 12, 2007STEV 2007, Portland OR Overall Approach Apply a wide variety of tests on selected output properties Apply a wide variety of tests on selected output properties –deterministic –statistical Combine tests heuristically Combine tests heuristically –Combination controlled by scripts for flexibility

Oct. 12, 2007STEV 2007, Portland OR Outline The Application: Metadata Extraction The Application: Metadata Extraction Dynamic Validation of the Extractor Dynamic Validation of the Extractor Evaluating the Validator Evaluating the Validator Conclusions Conclusions

Oct. 12, 2007STEV 2007, Portland OR The Application: Metadata Extraction Large, diverse, growing government document collections Large, diverse, growing government document collections –DTIC, NASA, GPO (EPA & Congress) Automated system to extract metadata from documents Automated system to extract metadata from documents –Input: scanned page images or “text” PDF –Output: XML containing metadata fields  e.g., titles, authors, dates of publication, abstracts, release rights

Oct. 12, 2007STEV 2007, Portland OR Approach Classify documents by layout similarity Classify documents by layout similarity Templates contain rules for extracting metadata from a specific layout Templates contain rules for extracting metadata from a specific layout –To keep templates simple, layout classes must be fairly detailed and specific

Oct. 12, 2007STEV 2007, Portland OR Process Overview

Oct. 12, 2007STEV 2007, Portland OR Sample Metadata Record (including mistakes) <metadata> Thesis Title: Intrepidity, Iron Thesis Title: Intrepidity, Iron Will, and Intellect: General Robert L. Eichelberger and Military Genius Will, and Intellect: General Robert L. Eichelberger and Military Genius Name of Candidate: Major Matthew H. Fath Name of Candidate: Major Matthew H. Fath Accepted this 18th day of June 2004 by: Accepted this 18th day of June 2004 by: Approved by: Thesis Committee Chair Jack D. Kem, Ph.D. Approved by: Thesis Committee Chair Jack D. Kem, Ph.D., Member Mr. Charles S. Soby, M.B.A., Member Lieutenant Colonel John A. Suprin, M.A., Member Mr. Charles S. Soby, M.B.A., Member Lieutenant Colonel John A. Suprin, M.A. Robert F. Baumann, Ph.D. Robert F. Baumann, Ph.D. </metadata>

Oct. 12, 2007STEV 2007, Portland OR Rationale for Dynamic Validation Sources of error Sources of error –Document flaws –OCR software failures –Mis-classified layouts –Template faults –Extraction engine faults Software replaces expensive human-intensive process Software replaces expensive human-intensive process Moderately high (10-20%) failure rate is tolerable if we can identify which output sets are failures Moderately high (10-20%) failure rate is tolerable if we can identify which output sets are failures –route those sets to humans for inspection and correction

Oct. 12, 2007STEV 2007, Portland OR Process Overview

Oct. 12, 2007STEV 2007, Portland OR Dynamic Validation Challenges: imprecise specification imprecise specification low-level internal state not trusted as indicator of correct progress low-level internal state not trusted as indicator of correct progress input characteristics vary from one document collection to another input characteristics vary from one document collection to another input characteristics may vary over time input characteristics may vary over time

Oct. 12, 2007STEV 2007, Portland OR Approach Wide battery of basic tests can be applied to metadata fields Wide battery of basic tests can be applied to metadata fields –deterministic –statistical Basic test results combined heuristically Basic test results combined heuristically –under control of custom scripting language

Oct. 12, 2007STEV 2007, Portland OR Basic Tests - Deterministic date formats date formats regular expressions regular expressions –structured fields, e.g., report numbers

Oct. 12, 2007STEV 2007, Portland OR Basic Tests – Statistical Reference models from prior metadata (human extracted) Reference models from prior metadata (human extracted) –850,000 records in DTIC collection –20,000 records in NASA Measured field lengths Measured field lengths Phrase dictionaries constructed for fields with specialized vocabularies Phrase dictionaries constructed for fields with specialized vocabularies –e.g., author, organization

Oct. 12, 2007STEV 2007, Portland OR Statistics collected (mean & std dev) Field lengths Field lengths –title, abstract, author,.. Dictionary detection rates for words in natural language fields Dictionary detection rates for words in natural language fields –abstract, title,. Phrase recurrence rates for fields with specialized vocabularies Phrase recurrence rates for fields with specialized vocabularies –author and organization

Oct. 12, 2007STEV 2007, Portland OR Field Length (in words), DTIC collection

Oct. 12, 2007STEV 2007, Portland OR Dictionary Detection (% of recognized words), DTIC collection

Oct. 12, 2007STEV 2007, Portland OR Phrase Dictionary Recurrence Rate, DTIC collection Field Phrase Length Mean Std. Dev. PersonalAuthor 197%11% 283%32% 371%45% CorporateAuthor 1100%2.0% 299%6.0% 399%10% 499%13%

Oct. 12, 2007STEV 2007, Portland OR Validation Procedure Selected basic tests are applied to extracted metadata field values Selected basic tests are applied to extracted metadata field values –deterministic tests will pass or fail –Statistical tests compare to norms from reference models  standard score computed Test results for same field are combined to form field confidence Test results for same field are combined to form field confidence Field confidences are combined to form overall confidence Field confidences are combined to form overall confidence

Oct. 12, 2007STEV 2007, Portland OR Combining Basic Test Scores Validation specification describes Validation specification describes –which tests to apply to which fields –how to normalize/scale test scores prior to combination –how to combine field tests into field confidence –how to combine field confidences into overall confidence

Oct. 12, 2007STEV 2007, Portland OR Partial Validation Spec – DTIC

Oct. 12, 2007STEV 2007, Portland OR Validation script Specification combined with extracted data Specification combined with extracted data – to form an executable script  Apache Jelly project Script executed to produce metadata record annotated with Script executed to produce metadata record annotated with –confidence values for each field –warning/explanations for low-scoring fields –overall confidence for output record

Oct. 12, 2007STEV 2007, Portland OR Sample Output From Validator <metadata confidence="0.460" <metadata confidence="0.460" warning="ReportDate field does not match required pattern"> warning="ReportDate field does not match required pattern"> Thesis Title: Intrepidity, Iron Thesis Title: Intrepidity, Iron Will, and Intellect: General Robert L. Eichelberger and Military Genius Will, and Intellect: General Robert L. Eichelberger and Military Genius <PersonalAuthor confidence="0.4" <PersonalAuthor confidence="0.4" warning="PersonalAuthor: unusual number of words"> warning="PersonalAuthor: unusual number of words"> Name of Candidate: Major Matthew H. Fath Name of Candidate: Major Matthew H. Fath <ReportDate confidence="0.0" <ReportDate confidence="0.0" warning="ReportDate field does not match required pattern"> warning="ReportDate field does not match required pattern"> Accepted this 18th day of June 2004 by: Accepted this 18th day of June 2004 by: Approved by: Thesis Committee Chair Jack D. Kem, Ph.D. Approved by: Thesis Committee Chair Jack D. Kem, Ph.D., Member Mr. Charles S. Soby, M.B.A., Member Lieutenant Colonel John A. Suprin, M.A., Member Mr. Charles S. Soby, M.B.A., Member Lieutenant Colonel John A. Suprin, M.A. Robert F. Baumann, Ph.D. Robert F. Baumann, Ph.D. </metadata>

Oct. 12, 2007STEV 2007, Portland OR Experimental Design How effective is post-hoc classification? How effective is post-hoc classification? Selected 2000 documents recently added to DTIC collection Selected 2000 documents recently added to DTIC collection –Visually classified by humans,  comparing to 10 most common layouts from studies of earlier documents  discarded documents not in one of those classes  646 documents remained Applied all templates, validated extracted metadata, selected highest confidence as the validator’s choice Applied all templates, validated extracted metadata, selected highest confidence as the validator’s choice Compared validator’s preferred layout to human choices Compared validator’s preferred layout to human choices

Oct. 12, 2007STEV 2007, Portland OR Exp. Design Justification Directly models one source of error Directly models one source of error –Document flaws –OCR software failures –Mis-classified layouts –Template faults –Extraction engine faults Layouts involved include some that are very similar Layouts involved include some that are very similar –single-field failures typical of other error sources Minimizes disputes among human judges Minimizes disputes among human judges Relatively unaffected by continuing changes to extraction software Relatively unaffected by continuing changes to extraction software

Oct. 12, 2007STEV 2007, Portland OR Validation Spec for Experiment Similar to production spec except Similar to production spec except –field scores combined by summation rathater than by minimum or average Simulated post-processing of extracted values Simulated post-processing of extracted values –extractor is WYSIWYG –not always what is desired  e.g., “Major Matthew H. Fath” => “Fath, Matthew H.”

Oct. 12, 2007STEV 2007, Portland OR Automatic vs. Human Classifications Post-hoc classifier agreed with human on 91% of cases Post-hoc classifier agreed with human on 91% of cases

Oct. 12, 2007STEV 2007, Portland OR Conclusions Important characteristics of this approach: Aggressively Opportunistic Aggressively Opportunistic –lots of small, simple tests Pragmatic Pragmatic –heuristic combination of simple test results Flexible Flexible –scripting aids in  tuning heuristics  adaption to different installations & input sets

Oct. 12, 2007STEV 2007, Portland OR Conclusions: Exploiting Validation Internally Agreement rate between validator and humans far exceeds our best prior classifier algorithms Agreement rate between validator and humans far exceeds our best prior classifier algorithms –based on geometric layout of text and graphic blocks New classifier: New classifier: –apply all available templates to document –score all outputs using validator –choose top-scoring output set