ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly

Slides:



Advertisements
Similar presentations
FUNCTION FITTING Student’s name: Ruba Eyal Salman Supervisor:
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
SVM—Support Vector Machines
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Information Retrieval in Practice
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Decision Tree Algorithm
INFO 624 Week 3 Retrieval System Evaluation
Data Mining – Intro.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
KDD for Science Data Analysis Issues and Examples.
Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Educator’s Guide Using Instructables With Your Students.
Classification with Hyperplanes Defines a boundary between various points of data which represent examples plotted in multidimensional space according.
Efficient Model Selection for Support Vector Machines
1 Validation & Verification Chapter VALIDATION & VERIFICATION Very Difficult Very Important Conceptually distinct, but performed simultaneously.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
1 Metadata Extraction Experiments with DTIC Collections Department of Computer Science Old Dominion University 2/25/2005 Work in Progress Metadata Extraction.
Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Universit at Dortmund, LS VIII
Approved for Public Release U.S. Government Work (17 USC§105) Not copyrighted in the U.S. Defense Research & Engineering Information for the Warfighter.
Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
Presenter: Shanshan Lu 03/04/2010
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf.
1 Tools for Extracting Metadata and Structure from DTIC Documents Digital Library Group Department of Computer Science Old Dominion University December,
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Data Mining and Decision Support
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Information Retrieval in Practice
Information Organization: Overview
Deep Learning Amin Sobhani.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Metadata Extraction Progress Report 12/14/2006.
CS 430: Information Discovery
Machine Learning Basics
Information Organization: Overview
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

ICCTA September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly

ICCTA September, Alexandria 2 Outline Background and Motivation Challenges and Approaches Metadata Extraction Experience at ODU CS Architecture for Metadata Extraction Experiments with DTIC Documents Experiments with limited GPO Documents Conclusions

ICCTA September, Alexandria 3 Digital Libraries Content Creation New Content Publication Tools Kepler, Compopt (NSF, US Navy) Process Existing Content (DTIC) Content Sharing Centralized Model Harvesting OAI-PMH Arc/Archon (NSF) Kepler (NSF) TRI (NASA,LANL, SANDIA) DL Grid (Andrew Mellon) Secure DL (NSF, IBM) Real Time LFDL Distributed Model – P2P (NSF) Digital Library Research at ODU

ICCTA September, Alexandria 4 Motivation Metadata enhances the value of a document collectionMetadata enhances the value of a document collection –Using metadata helps resource discovery It may save about $8,200 per employee for a company to use metadata in its intranet to reduce employee time for searching, verifying and organizing the files. (estimation made by Mike Doane on DCMI 2003 workshop) –Using metadata helps make collections interoperable with OAI-PMH Manual metadata extraction is costly and time-consumingManual metadata extraction is costly and time-consuming –It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop). Automatic metadata extraction tools are essential to reduce the cost. –Automatic extraction tools are essential for rapid dissemination at reasonable cost OCR is not sufficient for making ‘legacy’ documents searchable.OCR is not sufficient for making ‘legacy’ documents searchable.

ICCTA September, Alexandria 5 Challenges A successful metadata extraction system must: extract metadata accurately scale to large document collections cope with heterogeneity within a collection maintain accuracy, with minimal reprogramming/training cost, as the collection evolves over time have a validation/correction process

ICCTA September, Alexandria 6 Approaches Machine Learning –HMM –SVM Rule-Based –Ad Hoc –Expert Systems –Template-Based (ODU CS)

ICCTA September, Alexandria 7 Comparison Machine-Learning Approach –Good adaptability but it has to be trained from samples – very time consuming –Performance degrades with increasing heterogeneity –Difficult to add new fields to be extracted –Difficult to select the right features for training Rule-based –No need for training from samples –Can extract different metadata from different documents –Rule writing may require significant technical expertise

ICCTA September, Alexandria 8 Metadata Extraction Experience at ODU CS DTIC (2004, 2005) –developed software to automate the task of extracting metadata and basic structure from DTIC PDF documents explored alternatives including SVM, HMM, expert systems origin of the ODU template-based engine GPO (in progress) NASA (in progress) –Feasibility study to apply template-based approach to CASI collection

ICCTA September, Alexandria 9 Meeting the Challenges All techniques achieved reasonable accuracy for small collections –possible to scale to large homogeneous collections Heterogeneity remains a problem –Ad hoc rule-based tend to complex monoliths –Expert systems tend to large rule sets with complex, poorly-understood interactions –Machine-learning must choose between reduced accuracy and confidence or state explosion Evolution problematic for machine-learning approaches –older documents may have higher rate of OCR errors –expensive retraining required to accommodate changes in collection –potential lag time during which accuracy decays until sufficient training instances acquired Validation: A largely unexplored area. –Machine-learning approaches offer some support via confidence measures

ICCTA September, Alexandria 10 Architecture for Metadata Extraction

ICCTA September, Alexandria 11 Our Approach: Meeting the Challenges Bi-level architecture –Classification based upon document similarity –Simple templates (rule-based) written for each emerging class

ICCTA September, Alexandria 12 Our Approach: Meeting the Challenges Heterogeneity –Classification, in effect, reduces the problem to multiple homogeneous collections –Multiple templates required, but each template is comparatively simple only needs to accommodate one class of documents that share a common layout and style Evolution –New classes of documents accommodated by writing a new template templates are comparatively simple no lengthy retraining required potentially rapid response to changes in collection –Enriching the template engine by introducing new features to reduce complexity of templates Validation –Exploring a variety of techniques drawn from automated software testing & validation

ICCTA September, Alexandria 13 Metadata Extraction – Template-based Template-based approach –Classify documents into classes based on similarity –For each document class, create a template, or a set of rules –Decoupling rules from coding A template is kept in a separate file Advantages –Easy to extend For a new document class, just create a template –Rules are simpler –Rules can be refined easily

ICCTA September, Alexandria 14 Classes of documents

ICCTA September, Alexandria 15 Template engine

ICCTA September, Alexandria 16 Document features Layout features –Boldness, i.e., whether text is in bold font or not; –Font size, i.e., the font size used in text, e.g. font size 12, font size 14, etc; –Alignment, i.e. whether text is left, right, central, or adjusted alignment; –Geometric location, for example, a block starting with coordinates (0, 0) and ending with coordinates (100, 200); –Geometric relation, for example, a block located below the title block.

ICCTA September, Alexandria 17 Document features Textual features –Special words, for example, a string starting with “abstract”; –Special patterns, for example, a string with regular expression “[1-2][0-9][0-9][0-9]”; –Statistics features, for example, a string with more than 20 words, a string with more than 100 letters, and a string with more than 50% letters in upper case; –Knowledge features, for example, a string containing a last name from a name dictionary.

ICCTA September, Alexandria 18 Template language XML based Related to document features XML schema Simple document model –Document –page-zone-region-column-row- paragraphs-lines-words-character

ICCTA September, Alexandria 19 Template sample

ICCTA September, Alexandria 20 Sample document pdf

ICCTA September, Alexandria 21 Scan OCR output

ICCTA September, Alexandria 22 ‘Clean XML output

ICCTA September, Alexandria 23 Template (part)

ICCTA September, Alexandria 24 Metadata extracted

ICCTA September, Alexandria 25 Results Summary from DTIC Project

ICCTA September, Alexandria 26 Experiment with Limited GPO Documents 14 GPO Documents having Technical Report Documentation Page 57 GPO Documents without Technical Report Documentation Page 16 Congressional Reports 16 Public Law Documents

ICCTA September, Alexandria 27 GPO Report Documentation Page

ICCTA September, Alexandria 28 GPO Document

ICCTA September, Alexandria 29 Congressional Report

ICCTA September, Alexandria 30 Public Law Document

ICCTA September, Alexandria 31 Conclusions OCR software works very well on current documents Template based approach allows automatic metadata extraction from –Dynamically changing collections –Heterogeneous, large collections –Report document pages –High degree of accuracy Feasibility of structure (e.g., table of contents, tables, equations, sections) metadata extraction

ICCTA September, Alexandria 32 Metadata extraction Part II: Automatic Categorization

ICCTA September, Alexandria 33 Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most pertinent to each document Indexing: Select a set of keywords / index terms appropriate to each document

ICCTA September, Alexandria 34 Classification Techniques Manual (a.k.a. Knowledge Engineering) –typically, rule-based expert systems Machine Learning –Probabalistic (e.g., Naïve Bayesian) –Decision Structures (e.g., Decision Trees) –Profile-Based compare document to profile(s) of subject classes similarity rules similar to those employed in I.R. –Support Machines (e.g., SVM)

ICCTA September, Alexandria 35 Classification via Machine Learning Usually train-and-test –Exploit an existing collection in which documents have already been classified a portion used as the training set another portion used as a test set –permits measurement of classifier effectiveness –allows tuning of classifier parameters to yield maximum effectiveness Single- vs. multi-label –can 1 document be assigned to multiple categories?

ICCTA September, Alexandria 36 Automatic Indexing Assign to each document up to k terms drawn from a controlled vocabulary Typically reduced to a multi-label classification problem –each keyword corresponds to a class of documents for which that keyword is an appropriate descriptor

ICCTA September, Alexandria 37 Case Study: SVM categorization Document Collection from DTIC –10,000 documents previously classified manually –Taxonomy of 25 broad subject fields, divided into a total of 251 narrower groups –Document lengths average 2705  1464 words, 623  274 significant unique terms. –Collection has significant unique terms

ICCTA September, Alexandria 38 Document Collection

ICCTA September, Alexandria 39

ICCTA September, Alexandria 40 Sample: Broad Subject Fields 01--Aviation Technology 02--Agriculture 03--Astronomy and Astrophysics 04--Atmospheric Sciences 05--Behavioral and Social Sciences 06--Biological and Medical Sciences 07--Chemistry 08--Earth Sciences and Oceanography

ICCTA September, Alexandria 41 Sample: Narrow Subject Groups Aviation Technology 01 Aerodynamics 02 Military Aircraft Operations 03 Aircraft 0301 Helicopters 0302 Bombers 0303 Attack and Fighter Aircraft 0304 Patrol and Reconnaissance Aircraft

ICCTA September, Alexandria 42 Distribution among Categories

ICCTA September, Alexandria 43

ICCTA September, Alexandria 44 Baseline Establish baseline for state-of-the-art machine learning techniques –classification –training SVM for each subject area “off-the-shelf” document modelling and SVM libraries

ICCTA September, Alexandria 45 Why SVM? Prior studies have suggested good results with SVM relatively immune to “overfitting” – fitting to coincidental relations encountered during training few model parameters –avoids problems of optimizing in high- dimension space

ICCTA September, Alexandria 46 Machine Learning: Support Vector Machines Binary Classifier –Finds the plane with largest margin to separate the two classes of training samples –Subsequently classifies items based on which side of line they fall Font size Line number hyperplane margin

ICCTA September, Alexandria 47 SVM Evaluation

ICCTA September, Alexandria 48 Baseline SVM Evaluation (Interim Report) –Training & Testing process repeated for multiple subject categories –Determine accuracy overall positive (ability to recognize new documents that belong in the class the SVM was trained for) negative (ability to reject new documents that belong to other classes) –Explore Training Issues

ICCTA September, Alexandria 49 SVM “Out of the Box” 16 broad categories with 150 or more documents Lucene library extracting terms and forming weighted term vectors LibSVM for SVM training & testing –no normalization or parameter tuning Training set of 100/100 (positive/negative samples) Test set of 50/50

ICCTA September, Alexandria 50 Accuracy

ICCTA September, Alexandria 51 “OotB” Interpretation Reasonable performance on broad categories given modest training set size. –Accuracy measured as (# correct decisions / test set size) Related experiment showed that with normalization and optimized parameter selection, accuracy could be improved as much as an additional 10%

ICCTA September, Alexandria 52 Training Set Size

ICCTA September, Alexandria 53 Training Set Size accuracy plateaus for training set sizes well under the number of terms in the document model

ICCTA September, Alexandria 54 Training Issues Training Set Size –Concern: detailed subject groups may have too few known examples to perform effective SVM training in that subject –Possible Solution: collection may have few positive examples, but has many, many negative example Positive/Negative Training Mixes –effects on accuracy

ICCTA September, Alexandria 55 Increased Negative Training

ICCTA September, Alexandria 56 Training Set Composition experiment performed with 50 positive training examples –OotB SVM training increasing the number of negative training examples has little effect on overall accuracy but positive accuracy reduced

ICCTA September, Alexandria 57 Interpretation may indicate a weakness in SVM –or simply further evidence of the importance of optimizing SVM parameters may indicate unsuitability of treating SVM output as simple boolean decision –might do better as “best fit” in a multi-label classifier

ICCTA September, Alexandria 58 Conclusions State of the art for DTIC like collections will give on the order of 75% accuracy Key problems that need to be addressed –establish baseline for other methods –validation: recognizing trusted results to fall back on human intervention –improve on baseline by more sophisticated methods possible application for knowledge bases

ICCTA September, Alexandria 59 Additional Slides

ICCTA September, Alexandria 60 Metadata Extraction: Machine-Learning Approach Learn the relationship between input and output from samples and make predictions for new data This approach has good adaptability but it has to be trained from samples. HMM (hidden Markov Model) & SVM (Support Vector Machine)

ICCTA September, Alexandria 61 Machine Learning - Hidden Markov Models “ Hidden Markov Modeling is a probabilistic technique for the study of observed items arranged in discrete-time series ” -- Alan B Poritz : Hidden Markov Models : A Guided Tour, ICASSP 1988 HMM is a probabilistic finite state automaton –Transit from state to state –Emit a symbol when visit each state –States are hidden ABCD

ICCTA September, Alexandria 62 Hidden Markov Models A Hidden Markov Model consists of A set of hidden states (e.g. coin1, coin2, coin3) A set of observation symbols ( e.g. H and T) Transition probabilities: the probabilities from one state to another Emission probabilities: probability of emitting each symbol in each state Initial probabilities: probability of each state to be chosen as the first state

ICCTA September, Alexandria 63 HMM - Metadata Extraction –A document is a sequence of words that is produced by some hidden states (title, author, etc.) –The parameters of HMM was learned from samples in advance. –Metadata Extraction is to find the most possible sequence of states (title, author, etc.) for a given sequence of words.

ICCTA September, Alexandria 64 Machine Learning: Support Vector Machines Binary Classifier (classify data into two classes) –It represents data with pre-defined features –It finds the plane with largest margin to separate the two classes from samples –It classifies data into two classes based on which side they located. Font size Line number hyperplane margin The figure shows a SVM example to classify a line into two classes: title, not title by two features: font size and line number (1, 2, 3, etc). Each dot represents a line. Red dot: title; Blue dot: not title.

ICCTA September, Alexandria 65 SVM - Metadata Extraction Widely used in pattern recognition areas such as face detection, isolated handwriting digit recognition, gene classification, etc. Basic idea –Classes  metadata elements –Extract metadata from a document  classify each line (or block) into appropriate classes. –For example Extract document title from a document  Classify each line to see whether it is a part of title or not

ICCTA September, Alexandria 66 Metadata Extraction: Rule-based Basic idea: –Use a set of rules to define how to extract metadata based on human observation. –For example, a rule may be “ The first line is title”. Advantage –Can be implemented straightforwardly –No need for training Disadvantage –Lack of adaptability (work for similar document) –Difficult to work with a large number of features –Difficult to tune the system when errors occur because rules are usually fixed

ICCTA September, Alexandria 67 Metadata Extraction - Rule-based Expert system approach –Build a large rule base by using standard languages such as prolog –Use existed expert system engine (for example, SWI- prolog) Advantages –Can use existing engine Disadvantages –Building rule base is time- consuming Doc Parser Expert System Engine Knowledge Base Facts metadata

ICCTA September, Alexandria 68 Metadata Extraction Experience at ODU CS We have knowledge database obtained from analyzing Arc and DTIC collections – Authors (4Mill strings from –Organizations (79 from DTIC250, 200 from DTIC 600) –Universities (52 from DTIC250)