Metadata Extraction Progress Report 12/14/2006.

Slides:



Advertisements
Similar presentations
Inside an XSLT Processor Michael Kay, ICL 19 May 2000.
Advertisements

A centre of expertise in digital information management Approaches To The Validation Of Dublin Core Metadata Embedded In (X)HTML Documents Background The.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Information Retrieval in Practice
Aki Hecht Seminar in Databases (236826) January 2009
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Oracle Enterprise Data Quality CDEP: Tailoring Parser Configuration.
Overview of Search Engines
Sheet 1XML Technology in E-Commerce 2001Lecture 6 XML Technology in E-Commerce Lecture 6 XPointer, XSLT.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
National Institute of Standards and Technology 1 Testing and Validating OAGi NDRs Puja Goyal Salifou Sidi Presented to OAGi April 30 th, 2008.
Copyright © 2012 Accenture All Rights Reserved.Copyright © 2012 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are.
Presented by Tienwei Tsai July, 2005
© Andrew IrelandSoftware Design F28SD2 Function-oriented Design Andrew Ireland School of Mathematical & Computer Sciences Heriot-Watt University Edinburgh.
The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil
Automated Form processing for DTIC Documents March 20, 2006 Presented By, K. Maly, M. Zubair, S. Zeil.
Metadata ODU for DTIC Presentation to Senior Management May 16, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Current and Future Applications of the Generic Statistical Business Process Model at Statistics Canada Laurie Reedman and Claude Julien May 5, 2010.
Approved for Public Release U.S. Government Work (17 USC§105) Not copyrighted in the U.S. Defense Research & Engineering Information for the Warfighter.
Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
November 23, 2010 Service Computation Keynote - Lisbon, Portugal Automated Metadata Extraction Services Kurt Maly Contact:
Achieving High Software Reliability Using a Faster, Easier and Cheaper Method NASA OSMA SAS '01 September 5-7, 2001 Taghi M. Khoshgoftaar The Software.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Applying eXtensible Style Sheets (XSL) Ellen Pearlman Eileen Mullin Programming.
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
September 25, 2006 NASA Feasibility Study Status Update.
May 19-22, 2008 Open Forum for Metadata Registries Sydney Automated Metadata Extraction for Large, Diverse and Evolving Document Collections Kurt Maly.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Introduction to Machine Learning, its potential usage in network area,
Information Retrieval in Practice
Product Training Program
DHTML.
European Monitoring Platform for Mapping of QoS and QoE
Ricardo EIto Brun Strasbourg, 5 Nov 2015
Machine Learning with Spark MLlib
SOFTWARE TESTING Date: 29-Dec-2016 By: Ram Karthick.
Module 11: File Structure
ONF presentations to ETSI NFV Info Modelling Industry Status ONF Modeling Update 29 March 2016 Note that some points are related to the Multi-SDO Issues.
Search Engine Architecture
Recent trends in estimation methodologies
Presentation to Senior Management January 7, 2010
Chapter 18 Maintaining Information Systems
How does a Requirements Package Vary from Project to Project?
Displaying Form Validation Info
Introduction to Azure Machine Learning Studio
Part of the Multilingual Web-LT Program
Introduction to Systems Analysis and Design
Data Warehousing and Data Mining
Radio Resource Measurements MIB – Seattle Ad Hoc and After
Evaluating Compuware OptimalJ as an MDA tool
Chapter 13 Quality Management
Status for Endeavor 6: Improved Scientific Data Access Infrastructure
S-127 – Marine Traffic Management Release Candidate NIPWG 6 30 January 2019 Raphael Malyankar Eivind Mong Sponsored by IHO.
Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.
The Nonexperimental and Quasi-Experimental Strategies
Bulk Data Task Force Update Government Publishing Office
WP 4 - Revision of Natura 2000 dataflow
CSPA: The Future of Statistical Production
Final Design Authorization
Lightweight tools for on-line course development
Dept. of Computation, UMIST
Code Analysis, Repository and Modelling for e-Neuroscience
Printer Working Group Face-to-Face Meeting 19 February 2007
Function-oriented Design
Module 2 - Xtrata Pro Product Overview Module 2 – Product Overview
Lecture 10 Structuring System Requirements: Conceptual Data Modeling
Presentation transcript:

Metadata Extraction Progress Report 12/14/2006

Outline System Overview Detailed Structure with Recent Changes IDM representation of documents validation & post-hoc classification Status of Recent & Upcoming Deliverables Future Directions

System Overview

Detailed Structure with Recent Changes Input Processing Form Processing Post Processing Nonform Processing

Input Processing OCR – Omnipage update radically changed XML output Details later Study of 10188 DTIC documents found none with POINT (Page Of INTerest) pages outside 1st and last 5 suspended efforts at more sophisticated POINT page location

Form Processing Bug fixes and Tuning Omnipage XML converted to IDM Main form template engine rewritten to work from IDM

Independent Document Model (IDM) Platform independent Document Model Motivation Dramatic XML Schema Change between Omnipage 14 and 15 Tie the template engine to stable specification Protects from linking directly to specific OCR product Allows us to include statistics for enhanced feature usage Statistics (i.e. avgDocFontSize, avgPageFontSize, wordCount, avgDocWordCount, etc..)

Generating IDM Use XSLT 2.0 stylesheets to transform Supporting new OCR schema only requires generation of new XSLT stylesheet. -- Engine does not change Chain a series of sheets to add functionality (CleanML) Schema Specification Available (http://dtic.cs.odu.edu/devzone/IDM_Specification.doc)

IDM Usage OmniPage 14 XML Doc Form Based Extraction docTreeModelOmni14.xsl docTreeModelOmni15.xsl docTreeModelCleanML.xsl OmniPage 15 XML Doc IDM XML Doc docTreeModelOther.xsl CleanML XML Doc Each incoming XML schema requires specific XSLT 2.0 Stylesheet Resulting IDM Doc used for “Form Based” templates IDM transformed into CleanML for “Non-form” templates Other OCR Output XML Doc Non Form Extraction

IDM Tool Status Converters completed to generate IDM from Omnipage 14 and 15 XML Omnipage 15 proved to have numerous errors in its representation of an OCR’d document Consequently, not recommended Form-based extraction engine revised to work from IDM Non-form engine still works from our older “CleanXML” convertor from IDM to CleanXML completed as stop-gap measure direct use of IDM deferred pending review of other engine modifications

Post Processing No significant changes

Nonform Processing Bug fixes & tuning Added new validation component Post-hoc classification replaces former a priori classification schemes

Validation Given a set of extracted metadata mark each field with a confidence value indicating how trustworthy the extracted value is mark the set with a composite confidence score Fields and Sets with low confidence scores may be referred for additional processing automated post-processing human intervention and correction

Validating Extracted Metadata Techniques must be independent of the extraction method A validation specification is written for each collection, combining Field-specific validation rules statistical models derived for each field of text length % of words from English dictionary % of phrases from knowledge base prepared for that field pattern matching

Sample Validation Specification Combines results from multiple fields <val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs.dtic.validation.ValidationTagLibrary"> <val:average> <val:field name="UnclassifiedTitle">...</val:field> <val:field name="PersonalAuthor">...</val:field> <val:field name="CorporateAuthor">...</val:field> <val:field name="ReportDate">...</val:field> </val:average> </val:validate>

Validation Spec: Field Tests Each field is subjected to one or more tests … <val:field name="PersonalAuthor"> <val:average> <val:length/> <val:max> <val:phrases length="1"/> <val:phrases length="2"/> <val:phrases length="3"/> </val:max> </val:average> </val:field> <val:field name="ReportDate"> <val:reportFormat/> ...

Sample Input Metadata Set <UnclassifiedTitle>Thesis Title: The Military Extraterritorial Jurisdiction Act</UnclassifiedTitle> <PersonalAuthor>Name of Candidate: LCDR Kathleen A. Kerrigan</PersonalAuthor> <ReportDate>Accepted this 18th day of June 2004 by:</ReportDate> </metadata>

Sample Validator Output <metadata confidence="0.522"> <UnclassifiedTitle confidence="0.943">Thesis Title: The Military Extraterritorial Jurisdiction Act</UnclassifiedTitle> <PersonalAuthor confidence="0.622">Name of Candidate: LCDR Kathleen A. Kerrigan</PersonalAuthor> <ReportDate confidence="0.0" warning="ReportDate field does not match required pattern">Accepted this 18th day of June 2004 by:</ReportDate> </metadata>

Classification (a priori) Previously, we had attempted various schemes for a priori classification x-y trees bin classification Still investigating some visual recognition

Post-Hoc Classification Apply all templates to document results in multiple candidate sets of metadata Score each candidate using the validator Select the best-scoring set

Demo & Experimental Results Results of 157 documents http://128.82.7.147:8080/dtic/validsum157.jsp Class Hand Classification Validation Au 86 Eagle 47 Title 24 Total 157

Future Directions

Status of Recent & Upcoming Deliverables DTIC - Classifier Development (9/19/06) NASA - Enhance classification algorithm for two specific classes   (10/31/2006) NASA - Process study for inter-organizational collections   – configuration software – (12/1/2006) NASA - Enhance engine to recognize two major classes   (Dec 15, 2006)

Classifier Development DTIC - Classifier Development (9/19/06) NASA - Enhance classification algorithm for two specific classes   (10/31/2006) Delayed by difficulties with a priori classification schemes Now replaced by post hoc validation-based classification some tuning of validation spec required cleaning of metadata sources for statistical models Demo posted 11/15/2006

Configuration NASA - Process study for inter-organizational collections   (12/1/2006) extraction engines differentiate by collection-dependent template sets validation specifications take collection name as a required attribute used to locate distinct statistical models built for that collection Regression test framework established protects against changes or tuning to one collection degrading performance on others

Engine Enhancements NASA - Enhance engine to recognize two major classes   (12/15/2006) in many ways, already satisfied most planned enhancements deferred due to work on IDM in short term, emphasis will be on expanding the template set to exploit existing engine features and availability of new post-hoc classifier

END Questions?

Current System (Detailed)