The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil

Slides:

Advertisements

Similar presentations

Testing Relational Database

Advertisements

DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?

T. E. Potok - University of Tennessee Software Engineering Dr. Thomas E. Potok Adjunct Professor UT Research Staff Member ORNL.

1 Lesson 14 Sharing Documents Computer Literacy BASICS: A Comprehensive Guide to IC 3, 3 rd Edition Morrison / Wells.

ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.

Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)

The Experience Factory May 2004 Leonardo Vaccaro.

Information Retrieval in Practice

Publishing Workflow for InDesign Import/Export of XML

Information and Telecommunication Technology Center (ITTC) University of Kansas SmartXAutofill Intelligent Data Entry Assistant for XML Documents Danico.

Agenda Overview Why TransCAD Challenges/tips Initiatives Applications.

Lecture 13 Revision IMS Systems Analysis and Design.

Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.

1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.

Automatic Data Ramon Lawrence University of Manitoba

Software Process and Product Metrics

Overview of Search Engines

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

1 Newspaper Digitisation Workflows Rose Holley- Manager ANDP Presentation to Cultural Heritage Digitisation professionals 26 November 2008.

United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

Overview of Distributed Data Mining Xiaoling Wang March 11, 2003.

Guide to Using Message Maker Robert Snelick National Institute of Standards & Technology (NIST) December 2005

Background on USPS mail forwarding operations Overview of PARS

1 Template-Based Classification Method for Chinese Character Recognition Presenter: Tienwei Tsai Department of Informaiton Management, Chihlee Institute.

Mark Phillips Digital Projects Department University of North Texas Annexation of Texas Project.

A Visual Comparison Approach to Automated Regression Testing (PDF to PDF Compare)

UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data.

DE&T (QuickVic) Reporting Software Overview Term

Website Accessibility Testing. Why consider accessibility People with disabilities – Visual, Hearing, Physical, Cognitive (learning, reading, attention.

Presented by Tienwei Tsai July, 2005

Metadata ODU for DTIC Presentation to Senior Management May 16, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,

Copyright 2010, The World Bank Group. All Rights Reserved. COVERAGE, FRAMES & GIS, Part 2 Quality assurance for census 1.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

United Nations Economic Commission for Europe Statistical Division Seasonal Adjustment Process with Demetra+ Anu Peltola Economic Statistics Section, UNECE.

National Instructional Materials Accessibility Standard (NIMAS) What Districts Need to Know Skip Stahl, Director, NIMAS Development Center.

Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.

Current and Future Applications of the Generic Statistical Business Process Model at Statistics Canada Laurie Reedman and Claude Julien May 5, 2010.

Approved for Public Release U.S. Government Work (17 USC§105) Not copyrighted in the U.S. Defense Research & Engineering Information for the Warfighter.

Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,

November 23, 2010 Service Computation Keynote - Lisbon, Portugal Automated Metadata Extraction Services Kurt Maly Contact:

Developing Policy and Procedure Management System إعداد برنامج سياسات وإجراءات العمل 8 Safar February 2007 HERA GENERAL HOSPITAL.

United Nations Economic Commission for Europe Statistical Division Mapping Data Production Processes to the GSBPM Steven Vale UNECE

1 1 Aeronautical Information Services Brief to AIXM User Group 27 February 2007.

EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.

Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf.

Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

Dictionary based interchanges for iSURF -An Interoperability Service Utility for Collaborative Supply Chain Planning across Multiple Domains David Webber.

September 25, 2006 NASA Feasibility Study Status Update.

May 19-22, 2008 Open Forum for Metadata Registries Sydney Automated Metadata Extraction for Large, Diverse and Evolving Document Collections Kurt Maly.

Microsoft Word 2013 is word processing software included in Microsoft Office Overview of Word Processing Document Formatting Techniques Academic.

Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.

Mr. Munaco Computer Technology TEACHING ADVANCED WORD 2007.

SunGuide SM Software Development Project End of the Year ITS Working Group Meeting December 7, 2005.

General Architecture of Retrieval Systems 1Adrienn Skrop.

1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.

Information Retrieval in Practice

Search Engine Architecture

Presentation to Senior Management January 7, 2010

Metadata Extraction Progress Report 12/14/2006.

Joseph JaJa, Mike Smorul, and Sangchul Song

Can Computer Algorithms Guess Your Age and Gender?

Generic Statistical Business Process Model (GSBPM)

Data Capture Process Stages

Final Design Authorization

Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)

Module 2 - Xtrata Pro Product Overview Module 2 – Product Overview

Presentation transcript:

The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil

Outline 1.Overview 2.Recent Developments A.Independent Document Model B.Validation C.Diversifying – NASA & GPO collections 3.New Issues & Future Directions A.Post-processing B.Image-Based Classification

1. Overview

Input Processing & OCR Select pages of interest Apply Off-The-Shelf OCR software Convert OCR output to XML model format

Form Processing Scan document for form names –Select form template Apply form extraction engine to document and template

Sample RDP

Sample RDP (cont.)

Metadata Extracted from Sample RDP (1/3) Final Report 1 April August 2003 VALIDATION OF IONOSPHERIC MODELS F C F Patricia H. Doherty Leo F. McNamara Susan H. Delay Neil J. Grossbard 1010 IM AC Boston College / Institute for Scientific Research 140 Commonwealth Avenue Chestnut Hill, MA

Metadata Extracted from Sample RDP (2/3) Air Force Research Laboratory 29 Randolph Road Hanscom AFB, MA VSBP AFRL-VS-TR Approved for public release; distribution unlimited. This document represents the final report for work performed under the Boston College contract F I C This contract was entitled Validation of Ionospheric Models. The objective of this contract was to obtain satellite and ground-based ionospheric measurements from a wide range of geographic locations and to utilize the resulting databases to validate the theoretical ionospheric models that are the basis of the Parameterized Real-time Ionospheric Specification Model (PRISM) and the Ionospheric Forecast Model (IFM). Thus our various efforts can be categorized as either observational databases or modeling studies.

Metadata Extracted from Sample RDP (3/3) Ionosphere, Total Electron Content (TEC), Scintillation, Electron density, Parameterized Real-time Ionospheric Specification Model (PRISM), Ionospheric Forecast Model (IFM), Paramaterized Ionosphere Model (PIM), Global Positioning System (GPS) John Retterer U U SAR

Non-Form Processing Classification – compare document against known document layouts –Select template written for closest matching layout Apply non-form extraction engine to document and template

Non-Form Sample (1/2)

Non-Form Sample (2/2)

Template Used for Sample Document AU/ onesection AIR COMMAND | AIR WAR AIR UNIVERSITY CorporateAuthor by …

Metadata Extracted From the Title Page of the Sample Document AU/ACSC/012/ AIR COMMAND AND STAFF COLLEGE AIR UNIVERSITY INTEGRATING COMMERCIAL ELECTRONIC EQUIPMENT TO IMPROVE MILITARY CAPABILITIES Jeffrey A. Bohler LCDR, USN Advisor: CDR Albert L. St.Clair April 1999

Post-Processing Coerce extracted values into standard formats

Validation Estimate quality of extracted metadata Untrusted outputs referred (to humans) for review and correction

Recent Developments A.Independent Document Model B.Validation C.Diversifying – NASA and GPO Collections

A. Independent Document Model (IDM) Platform independent Document Model Motivation –Dramatic XML Schema Change between Omnipage 14 and 15 –Tie the template engine to stable specification –Protects from linking directly to specific OCR product –Allows us to include statistics for enhanced feature usage Statistics (i.e. avgDocFontSize, avgPageFontSize, wordCount, avgDocWordCount, etc..)

Documents in IDM A document consists of pages pages are divided into regions regions may be divided into –blocks of vertical whitespace –paragraphs –tables –images paragraphs are divided into lines lines are divided into words All of these carry standard attributes for size, position, font, etc.

Generating IDM Use XSLT 2.0 stylesheets to transform –Supporting new OCR schema only requires generation of new XSLT stylesheet. -- Engine does not change

IDM Usage OmniPage 14 XML Doc OmniPage 15 XML Doc Other OCR Output XML Doc IDM XML Doc Form Based Extraction Non Form Extraction docTreeModelOther.xsl docTreeModelOmni15.xsl docTreeModelOmni14.xsl

IDM Tool Status Converters completed to generate IDM from Omnipage 14 and 15 XML –Omnipage 15 proved to have numerous errors in its representation of an OCR’d document –Consequently, not recommended Form-based extraction engine revised to work from IDM Non-form engine still works from our older “CleanXML” –convertor from IDM to CleanXML completed as stop-gap measure –direct use of IDM deferred pending review of other engine modifications

B. Validation Given a set of extracted metadata –mark each field with a confidence value indicating how trustworthy the extracted value is –mark the set with a composite confidence score Fields and Sets with low confidence scores may be referred for additional processing –automated post-processing –human intervention and correction

Validating Extracted Metadata Techniques must be independent of the extraction method A validation specification is written for each collection, combining Field-specific validation rules –statistical models derived for each field of text length % of words from English dictionary % of phrases from knowledge base prepared for that field –pattern matching

Sample Validation Specification Combines results from multiple fields <val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs.dtic.validation.ValidationTagLibrary" >...

Validation Spec: Field Tests Each field is subjected to one or more tests …...

Sample Input Metadata Set Thesis Title: The Military Extraterritorial Jurisdiction Act Name of Candidate: LCDR Kathleen A. Kerrigan Accepted this 18th day of June 2004 by:

Sample Validator Output Thesis Title: The Military Extraterritorial Jurisdiction Act Name of Candidate: LCDR Kathleen A. Kerrigan Accepted this 18th day of June 2004 by:

Classification (a priori) Previously, we had attempted various schemes for a priori classification –x-y trees –bin classification Still investigating some –image-based recognition

Post-Hoc Classification Apply all templates to document –results in multiple candidate sets of metadata Score each candidate using the validator –Select the best-scoring set

Experimental Results Manually Assigned Class Number of Documents Validator PreferredTotal Au86000 Eagle Rand Title

Interpretation of Results Validator agreed with human on 125 out of 167 cases Of 42 cases where they disagreed –37 were due to “extra” words in extracted metadata (e.g., military ranks in author names) highlights need for post-processing to clean up metadata –2 were mistakes by template –2 were due to garbled characters by OCR –1 due to a bug in the validator

C. Diversifying – NASA and GPO Collections Document collections differ in whether forms are used and form layout document layout what metadata fields are present & which ones are collected

Changing Collections Porting to a new document collection –identify pages of interest –training classifiers to recognize new document layouts (?) –templates for forms & document layouts –new validation scripts collect statistics for collection model –new post-processing rules No changes required to core engines & other software

NASA Technical Reports Different layouts than DTIC –fewer total –tend to be visually more similar –mixture with and without RDPs

NASA Sample Document

Extracted Metadata for NASA Sample A Computationally Efficient Meshless Local Petrov-Galerkin Method for Axisymmetric Problems I.S. Raju* and T. Chen? NASA Langley Research Center Hampton, VA The Meshless Local Petrov-Galerkin (MLPG) method is one of the recently developed element-free …

Govt. Printing Office Congressional acts & reports EPA reports Preliminary study with Acts of Congress and EPA reports samples suggest layouts are more diverse than DTIC or NASA –metadata actually present in document varies widely

GPO Sample – Act of Congress

Metadata Extracted for Act of Congress 118 STAT PUBLIC LAW 108?493?DEC. 23, 2004 [H.R ] components. 108th Congress An Act Dec. 23, 2004 To amend the Internal Revenue Code of 1986 to modify the taxation of arrow [H.R ] components.

GPO sample report

Metadata Extracted from GPO Sample Report CHINA?S PROLIFERATION PRACTICES AND ROLE IN THE NORTH KOREA CRISIS HEARING BEFORE THE U.S.-CHINA ECONOMIC AND SECURITY REVIEW COMMISSION ONE HUNDRED NINTH CONGRESS FIRST SESSION MARCH 10, 2005 Printed for the use of the U.S.-China Economic and Security Review Commission Available via the World Wide Web:

3. New Issues and Future Directions A.Post-Processing B.Image-Based Classification

Post-processing WYSIWYG WYG != WYW

Post-processing WYSIWYG –What You See is What You Get WYG != WYW

Post-processing WYSIWYG –What You See is What You Get WYG != WYW –What You Get is not What You Want

Example – DTIC Date Format Document may contain: –March 28, 2007 –3/28/2007 –3/28/07 DTIC requires: –28 MAR 2007

Example – Personal Authors

Example – Personal Authors (cont.) We extract: Patricia H. Doherty Leo F. McNamara Susan H. Delay Neil J. Grossbard DTIC requires: Patricia H. Doherty ;Leo F. McNamara ;Susan H. Delay ;Neil J. Grossbard NASA requires Patricia H. Doherty Leo F. McNamara Susan H. Delay Neil J. Grossbard

Post-Processing Requirements Post-processing rules must vary by –metadata field –collection

Post-Processing Architecture

Image-Based Classification filter to find likely candidates for validator-based selection of template Looking at a variety of techaniques inspired by work in image recognition

Example: Image-Based Classification Example: represent a page using various colors to denote images, text, bold text, etc. find visually most similar pages in documents of known classes –“vote” based on 5 most similar documents

Visual Matching Example (1/2)

Visual Matching Example (2/2)

Conclusions Automated metadata extraction can be performed effectively on a wide variety of documents –Coping with heterogeneous collections is a major challenge Much attention must be paid to “support” issues –validation, post-processing, etc.