EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.

Slides:



Advertisements
Similar presentations
Chapter 2 The Process of Experimentation
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
SBSE Course 3. EA applications to SE Analysis Design Implementation Testing Reference: Evolutionary Computing in Search-Based Software Engineering Leo.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Selecting Preservation Strategies for Web Archives Stephan Strodl, Andreas Rauber Department of Software.
Information and Telecommunication Technology Center (ITTC) University of Kansas SmartXAutofill Intelligent Data Entry Assistant for XML Documents Danico.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Experimental Evaluation in Computer Science: A Quantitative Study Paul Lukowicz, Ernst A. Heinz, Lutz Prechelt and Walter F. Tichy Journal of Systems and.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Distinguishing Photographic Images and Photorealistic Computer Graphics Using Visual Vocabulary on Local Image Edges Rong Zhang,Rand-Ding Wang, and Tian-Tsong.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Experimental Evaluation in Computer Science: A Quantitative Study Paul Lukowicz, Ernst A. Heinz, Lutz Prechelt and Walter F. Tichy Journal of Systems and.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.
Experimental Statistics I.  We use data to answer research questions  What evidence does data provide?  How do I make sense of these numbers without.
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Copyright © 2007 Pearson Education Canada 1 Chapter 12: Audit Sampling Concepts.
Evaluating Performance for Data Mining Techniques
CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.
Discrete Event Simulation in Automotive Final Process System Vishvas Patel John Ma Throughput Analysis & Simulations General Motors 1999 Centerpoint Parkway.
Reverse Engineering State Machines by Interactive Grammar Inference Neil Walkinshaw, Kirill Bogdanov, Mike Holcombe, Sarah Salahuddin.
ENDA MOLLOY, ELECTRONIC ENG. FINAL PRESENTATION, 31/03/09. Automated Image Analysis Techniques for Screening of Mammography Images.
Search Engines and Information Retrieval Chapter 1.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
A Taxonomy of Evaluation Approaches in Software Engineering A. Chatzigeorgiou, T. Chaikalis, G. Paschalidou, N. Vesyropoulos, C. K. Georgiadis, E. Stiakakis.
The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil
Automated Form processing for DTIC Documents March 20, 2006 Presented By, K. Maly, M. Zubair, S. Zeil.
Table 3:Yale Result Table 2:ORL Result Introduction System Architecture The Approach and Experimental Results A Face Processing System Based on Committee.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most.
Bug Localization with Machine Learning Techniques Wujie Zheng
Lecture 3 Software Engineering Models (Cont.)
Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.
Advanced Higher Physics Investigation Report. Hello, and welcome to Advanced Higher Physics Investigation Presentation.
Approved for Public Release U.S. Government Work (17 USC§105) Not copyrighted in the U.S. Defense Research & Engineering Information for the Warfighter.
Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
Amy Dai Machine learning techniques for detecting topics in research papers.
Presenter: Shanshan Lu 03/04/2010
November 23, 2010 Service Computation Keynote - Lisbon, Portugal Automated Metadata Extraction Services Kurt Maly Contact:
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Introduction to Inferece BPS chapter 14 © 2010 W.H. Freeman and Company.
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Statistical Expertise for Sound Decision Making Quality Assurance for Census Data Processing Jean-Michel Durr 28/1/20111Fourth meeting of the TCG - Lubjana.
Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf.
Computer Science 1 Mining Likely Properties of Access Control Policies via Association Rule Mining JeeHyun Hwang 1, Tao Xie 1, Vincent Hu 2 and Mine Altunay.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003.
September 25, 2006 NASA Feasibility Study Status Update.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
May 19-22, 2008 Open Forum for Metadata Registries Sydney Automated Metadata Extraction for Large, Diverse and Evolving Document Collections Kurt Maly.
Objectives: Terminology Components The Design Cycle Resources: DHS Slides – Chapter 1 Glossary Java Applet URL:.../publications/courses/ece_8443/lectures/current/lecture_02.ppt.../publications/courses/ece_8443/lectures/current/lecture_02.ppt.
Typing Pattern Authentication Techniques 3 rd Quarter Luke Knepper.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
Brian Lukoff Stanford University October 13, 2006.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
GRAPHS AND CHARTS ..
Ricardo EIto Brun Strasbourg, 5 Nov 2015
Presentation to Senior Management January 7, 2010
Metadata Extraction Progress Report 12/14/2006.
Outlier Processing via L1-Principal Subspaces
Lecture 12: Data Wrangling
Soft Error Detection for Iterative Applications Using Offline Training
Family History Technology Workshop
LINGUA INGLESE 2A – a.a. 2018/2019 Computer-Aided Translation Technology LESSON 3 prof. ssa Laura Liucci –
Presentation transcript:

EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila Real, Portugal October 5-8, 2007

OUTLINE 1.Background: Robust automatic extraction of metadata from heterogeneous collections 2.Validation of extracted metadata 3.Post-hoc classification of document layouts 4.Conclusions

1. Background Diverse, growing government document collections Amount of metadata available varies considerably Automated system to extract metadata from new documents –Classify documents by layout similarity –Template defines extraction process for a layout class

Process Overview

Sample Metadata Record (including mistakes) Thesis Title: Intrepidity, Iron Will, and Intellect: General Robert L. Eichelberger and Military Genius Name of Candidate: Major Matthew H. Fath Accepted this 18th day of June 2004 by: Approved by: Thesis Committee Chair Jack D. Kem, Ph.D., Member Mr. Charles S. Soby, M.B.A., Member Lieutenant Colonel John A. Suprin, M.A. Robert F. Baumann, Ph.D.

Issue: Layout Classification Key to keeping extraction templates simple Previously explored a variety of techniques based upon geometric position of text and graphics –e.g., MX-Y trees, learning machines(??) Generally unsatisfactory in either accuracy or in compatibility with template approach

Issue: Robustness Sources of errors –OCR software failures –Poor document quality –Classification errors –Template errors –Extraction engine faults Need to detect dubious outputs –refer to human for inspection & correction

2. Validation Exploit statistical and heuristic approaches to evaluate quality of extracted metadata Reference Models Validation Process –tests –specifications

Reference Models From previously extracted metadata –specific to document collection Phrase dictionaries constructed for fields with specialized vocabularies –e.g., author, organization Statistics collected –mean and standard deviation –permits detection of outputs that are significantly different from collection norms

Statistics collected Field length statistics –title, abstract, author,.. Phrase recurrence rates for fields with specialized vocabularies –author and organization Dictionary detection rates for words in natural language fields –abstract, title,.

Field Length (in words), DTIC collection

Dictionary Detection (% of recognized words), DTIC collection

Phrase Dictionary Hit Percentage, DTIC collection

Validation Process Extracted outputs for fields are subjected to a variety of tests –Test results are normalized to obtain confidence value in range Test results for same field are combined to form field confidence Field confidences are combined to form overall confidence

Validation Tests Deterministic –Regular patterns such as date, report numbers Probabilistic –Length: if value of metadata is close to average -> high score –Vocabulary: recurrence rate according to field’s phrase dictionary –Dictionary: detection rate of words in English dictionary

Combining results Validation specification describes –which tests to apply to which fields –how to combine field tests into field confidence –how to combine field confidences into overall confidence

Validation Specification for DTIC Collection

Validation Specification - continued

<metadata confidence="0.460" warning="ReportDate field does not match required pattern"> Thesis Title: Intrepidity, Iron Will, and Intellect: General Robert L. Eichelberger and Military Genius <PersonalAuthor confidence="0.4" warning="PersonalAuthor: unusual number of words"> Name of Candidate: Major Matthew H. Fath <ReportDate confidence="0.0" warning="ReportDate field does not match required pattern"> Accepted this 18th day of June 2004 by: Approved by: Thesis Committee Chair Jack D. Kem, Ph.D., Member Mr. Charles S. Soby, M.B.A., Member Lieutenant Colonel John A. Suprin, M.A. Robert F. Baumann, Ph.D. Sample Output from the Validator

3. Classification Post hoc classification Experimental Results

Post hoc Classification Previously attempted a priori classification –choose one layout based on geometry of page –apply template for that chosen layout Alternative: exploit validator for post hoc selection of layout –Apply all templates to given document –Score each output using validator –Select template which scored highest

Experimental Design How effective is post-hoc classification? Selected several hundred documents recently added to DTIC collection –Visually classified by humans, comparing to 4 most common layouts from studies of earlier documents discarded documents not in one of those classes 167 documents remained Applied all templates, validated extracted metadata, selected highest confidence as the validator’s choice Compared validator’s preferred layout to human choices

Automatic vs. Human Classifications Post-hoc classifier agreed with human on 74% of cases

Post hoc Classification Problem: –WYSIWYG extraction often results in extra words in extracted data E.g., in author field ( ‘name of candidate’, “Major’) –Not desired in final output post-processing to remove these anticipated but not yet implemented – Artificially reduce validator scores not part of phrase dictionary Solutions: –Post-processing must be done prior to validation

Re-interpreting the experiment Subjected author metadata to simulated post-processing –scripts to remove known extraneous phrases specific to the document layouts military ranks and other honorifics Agreement between post-hoc classifier and human classification rose to 99% –far exceeds our best a priori classifiers to date

Conclusions Creating statistical model of existing metadata is very useful tool to validate extracted metadata from new documents Validation can be used to classify documents and select the right template for the automated extraction process