1 Tools for Extracting Metadata and Structure from DTIC Documents Digital Library Group Department of Computer Science Old Dominion University December,

Slides:



Advertisements
Similar presentations
Contextual Linking Architecture Christophe Blanchi June Corporation for National Research Initiatives Approved for.
Advertisements

Delivering textual resources. Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Rapid Visual OAI Tool S. Kothamasa, K. Maly, M. Zubair (Old Dominion University) X. Liu (Los Alamos National Laboratory) RCDL 2003, St. Petersburg.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Information Retrieval in Practice
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Educause October 29, 2001 A GEM of a Resource: The Gateway to Educational Materials Copyright Nancy Virgil Morgan, This work is the intellectual.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
METS-Based Cataloging Toolkit for Digital Library Management System Dong, Li Tsinghua University Library
: Chapter 1: Introduction 1 Montri Karnjanadecha ac.th/~montri Principles of Pattern Recognition.
1 © Netskills Quality Internet Training, University of Newcastle Metadata Explained © Netskills, Quality Internet Training.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Rapid Visual OAI Tool S. Kothamasa, K. Maly, M. Zubair (Old Dominion University) X. Liu (Los Alamos National Laboratory) RCDL 2003, St. Petersburg.
Metadata Repositories for Interoperable/Shareable Metadata.
Contactforum: Digitale bibliotheken voor muziek. 3/6/2005 Real music libraries in the virtual future: for an integrated view of music and music information.
The Metadata Object Description Schema (MODS) NISO Metadata Workshop May 20, 2004 Rebecca Guenther Network Development and MARC Standards Office Library.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
1 Metadata Extraction Experiments with DTIC Collections Department of Computer Science Old Dominion University 2/25/2005 Work in Progress Metadata Extraction.
Dec 9-11, 2003ICADL Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
Developing a Concept Extraction Technique with Ensemble Pathway Prat Tanapaisankit (NJIT), Min Song (NJIT), and Edward A. Fox (Virginia Tech) Abstract.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Approved for Public Release U.S. Government Work (17 USC§105) Not copyrighted in the U.S. Defense Research & Engineering Information for the Warfighter.
Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
Introduction to metadata
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
Kurt Maly Department of Computer Science Old Dominion University Norfolk, Virginia 23529, USA Digital Libraries, OAI and Free Software.
Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida
ICCTA September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly
Open Archive Initiative – Protocol for metadata Harvesting (OAI-PMH) Surinder Kumar Technical Director NIC, New Delhi
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
OAI Overview DLESE OAI Workshop April 29-30, 2002 John Weatherley
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
May 26-28ICNEE 2003 ARCHON: BUILDING LEARNING ENVIRONMENTS THROUGH EXTENDED DIGITAL LIBRARY SERVICES Hesham Anan, Kurt Maly, Mohammad Zubair,et al. Digital.
Oct 12-14, 2003NSDL Challenges in Building Federation Services over Harvested Metadata Kurt Maly, Michael Nelson, Mohammad Zubair Digital Library.
1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
Sharing Digital Scores: Will the Open Archives Initiative Protocol for Metadata Harvesting Provide the Key? Constance Mayer, Harvard University Peter Munstedt,
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
June 3-6, 2003E-Society Lisbon Automatic Metadata Discovery from Non-cooperative Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science.
Describing resources II: Dublin Core CERN-UNESCO School on Digital Libraries Rabat, Nov 22-26, 2010 Annette Holtkamp CERN.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Information Retrieval in Practice
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
VI-SEEM Data Repository
Outline Pursue Interoperability: Digital Libraries
Using Uneven Margins SVM and Perceptron for IE
Presentation transcript:

1 Tools for Extracting Metadata and Structure from DTIC Documents Digital Library Group Department of Computer Science Old Dominion University December, 2004

2 Problem Statement Manual metadata extraction and logical structure extraction is expensive Metadata improves discovery and interoperability (OAI-PMH). Logical structure for preservation and supporting different presentation formats (e.g., mobile devices in future)

3 Motivations – Metadata Extraction Using metadata helps resource discovery It may save about $8,200 per employee for a company to use metadata in its intranet to reduce employee time for searching, verifying and organizing the files. (estimation made by Mike Doane on DCMI 2003 workshop) Using metadata helps make collections interoperable with OAI- PMH On the other hand, creating metadata manually for a large collection is expensive It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop)

4 Motivations – Logical Structure Extraction Converting a document into XML format with logical structure helps information preservation Information in a document can still be accessible and the document can still be presented in appropriate way when the software to open the document is not available any more. Converting a document into XML format with logical structure helps information presentation With different XSL, a XML document can be presented differently A XML document can be presented differently to different devices such as web browsers, PDA, etc. It allows different users have different accesses. For example, a registered user may see all parts of a document while a Guest can only access introduction section. Converting a document into XML format with logical structure helps information discovery It allows logical component based retrieval, for example, searching only in introduction. It allows some special searches such as equation search.

5 Objectives To develop a flexible and adaptable approach for extracting metadata from physical collections with focus on DTIC (Defense Technical Information Center) collections. To develop techniques to extract basic logical structure of the scanned full text documents. To develop techniques to extract and represent complex objects such as equations, figures, etc.

6 Background Metadata Extraction Rule-based approach Machine-Learning approach Hidden Markov Model Support Vector Machine Logical Structure Extraction Basic logical structure extraction Reference Extraction & Reference Linking OAI and Digital Library Note: OAI, Open Archive Initiatives Protocols for Metadata Harvesting, is a framework to provide interoperability among distributed collections.

7 Background - Metadata Extraction Rule-based approach Machine-Learning approach Hidden Markov Model Support Vector Machine

8 Metadata Extraction: Rule-based Basic idea: Use a set of rules to define how to extract metadata based on human observation. For example, a rule may be “ The first line is title”. Advantage Can be implemented straightforward No need for training Disadvantage Lack of adaptabilities, (work for similar document) Difficult to work with a large number of features Difficult to tune the system when errors occur because rules are usually fixed

9 Metadata Extraction: Rule-based Related works Automated labeling algorithms for biomedical document images (Kim J, 2003 ) Extract metadata from first pages of biomedical journals Accuracy: title 100%, author 95.64%, abstract 95.85%, affiliation 63.13% (76 articles are used for test) Document Structure Analysis Based on Layout and Textual Features (Stefan Klink, 2000) Extract metadata from U-Wash document corpus with 979 journal pages Good results for some elements (such as page-number has 90% recall and 98% precision) but bad results for others( abstract: 35% recall and 90% precision; biography: 80% recall and 35% precision)

10 Metadata Extraction: Machine-Learning Approach Basic idea: Learn the relationship between input and output from samples and make predictions for new data This approach has good adaptability but it has to be trained from samples.

11 HMM - Metadata Extraction A document is a sequence of words that is produced by some hidden states (title, author, etc.) The parameters of HMM was learned from samples in advance. Metadata Extraction is to find the most possible sequence of states (title, author, etc.) for a given sequence of words.

12 Challenges in Building Federation Services over Harvested Metadata, Kurt Maly, Mohammad Zubair, 2003 Challenges in Building Federation … Kurt Maly … 2003 …

13 HMM - Metadata Extraction Related work K. Seymore, A. McCallum, and R. Rosenfeld. Learning hidden Markov model structure for information extraction. Result: overall accuracy 90.1% was reported

14 Support Vector Machine - general Overview It was introduced by Vapnik in late 70s It is now receiving increasing attentions It is widely used in pattern recognition areas such as face detection, isolated handwriting digit recognition, gene classification, etc. A list of SVM applications is available at It is also used in text analysis ( Joachims 1998, etc.) and metadata extraction (Han 2003).

15 Support Vector Machine - general Many decision boundaries can separate these two classes Which one should we choose? Class 1 Class 2 Courtesy: Martin Law

16 Support Vector Machine - general Class 1 Class 2 Basic idea Choose the one to separate two classes with largest margin margin hyperplane Support Vector

17 Support Vector Machine - general Binary Classifier (classify data into two classes) It represents data with pre- defined features It finds the plane with largest margin to separate the two classes from samples It classifies data into two classes based on which side they located. Font size Line number hyperplane margin The figure shows a SVM example to classify a line into two classes: title, not title by two features: font size and line number (1, 2, 3, etc). Each dot represents a line. Red dot: title; Blue dot: not title.

18 Multi-Class SVMs Combining into multi-class classifier One-vs-rest Classes: in this class or not in this class Positive training samples: data in this class Negative training samples: the rest K binary SVM (k the number of the classes) One-vs-One Classes: in class one or in class two Positive training samples: data in this class Negative training samples: data in the other class K(K-1)/2 binary SVM

19 SVM - Metadata Extraction Basic idea Classes  metadata elements Extract metadata from a document  classify each line (or block) into appropriate classes. For example Extract document title from a document  Classify each line to see whether it is a part of title or not Related work Automatic Document Metadata Extraction Using Support Vector Machine (H. Han, 2003) Overall accuracy 92.9% was reported

20 Logical Structure Extraction Physical Structure

21 Structure Extraction Logical Structure

22 Digital Library and OAI Digital Library (DL) A DL is a network accessible and searchable collection of digital information. DL provides a way to store, organize, preserve and share information. Interoperability problem DLs are usually created separately by using different technologies and different metadata schemas.

23 Open Archive Initiatives (OAI) Open Archive Initiatives Protocol for Metadata Harvesting (OAI-PMH) is a framework to to provide interoperability among heterogeneous DLs. It is based on metadata harvesting: a services provider can harvest metadata from a data provider. Data provider accepts OAI-PMH requests and provides metadata through network Service provider issues OAI-PMH requests to get metadata and build services on them. Each Data Provider can support its own metadata formats, but it has to support at least Dublin Core(DC) metadata set.

24

25 Dublin Core Metadata Set It supports 15 elements Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Relation, Coverage, Rights All fields are optional

26 What does it mean making an existing digital library OAI enabled ? Digital Library Storage OAI Layer Exposing metadata to OAI service providers – DC and Parallel metadata sets ONLY METADATA

27 - OAI Request for Metadata is embedded in HTTP. - OAI Response to OAI Request is encoded in XML. - XML Schema specification for OAI Response is provided in OAI-PMH document. RCDL 2003, St. Petersburg OAI Request and OAI Response

28 OAI Mechanics Request is encoded in http Response is encoded in XML XML Schemas for the responses are defined in the OAI-PMH document Courtesy: Michael Nelson

29 Overall Approach & Architecture* *This is our overall vision and only some components of this architecture are being implemented as part of the current contract

30 Overall Approach & Architecture Main components: Scan and OCR: Commercial OCR software is used to scan the documents. Metadata Extractor: Extract metadata by using rules and machine learning techniques. The extracted metadata are stored in a local database. In order to support Dublin Core, it may be necessary to map extracted metadata to Dublin Core format. Object Digitization: Convert documents into XML format for better preservation and better presentation. The main works: Extraction of complex objects such as figures Extraction of document logical structure Extraction of references and reference linking. OAI layer: Make the digital collection interoperable. The OAI layer accepts all OAI requests, get the information from database and encode metadata into XML format as responses. Search Engine

31 Metadata Extraction A challenge is how to reach desirable accuracy for a large heterogeneous collection Humanly defining a set of rules to cover all situations in advance is difficult Machine Learning Required a lot of labeled samples, for example, an HMM-based name recognizer used data with about 1.2 million words for training in order to achieve high accuracy (According to Douglas E. Appelt). Accuracy is the ratio of the number of those tagged correctly over the total number.

32 Metadata Extraction Feasible solution Classify documents into classes Documents in a same class have similar layout Work on each document class only instead of working on the whole large collection.

33 Metadata Extraction (Cont.) Overall Approach for Handling a Large Collection Manual Classification This approach assumes it is possible to humanly classify the large set of documents into classes ( based on time period, source organizations, etc. ) For each class, randomly select, say 100, documents develop a template. Evaluate the template by statistically sampling and refine the template till error is under a tolerance level. Next apply the refined template to the whole set. Auto-Classification This approach assumes it is not humanly possible to classify the large set of documents. In this case we develop a higher- set of rules on a smaller sample for classification. Evaluate the classification approach based on statistical sampling. Next develop the template for each class, apply, and refine as outlined in the manual classification approach.

34 Metadata Extraction (Cont.)

35 Preliminary Experiments Performance Measures SVM Experiments with different data sets Pure rule-based experiment

36 DEMO