Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February, 2005
Feb 21-25, 2005ICM 2005 Mumbai2 Contents Introduction Background Overall Architecture Metadata Extraction Approach Experiments Conclusion
Feb 21-25, 2005ICM 2005 Mumbai3 Documents SCAN & OCR Online Documents ?
Feb 21-25, 2005ICM 2005 Mumbai4 Introduction Why need go further Lack of metadata available for these resources hampers their discovery and dispersion over the Web. Lack of metadata available for these resources hampers the interoperability between them and resources from other organizations. Benefits of using metadata Using metadata helps resource discovery It may save about $8,200 per employee for a company to use metadata in its intranet to reduce employee time for searching, verifying and organizing the files. (estimation made by Mike Doane on DCMI 2003 workshop) Using metadata helps make collections interoperable with OAI- PMH
Feb 21-25, 2005ICM 2005 Mumbai5 Introduction (cont.) How to get these metadata Creating metadata manually for a large collection is expensive It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop) These enormous costs for manual metadata creation make a great demand of the automated metadata extraction tools.
Feb 21-25, 2005ICM 2005 Mumbai6 Introduction (cont.) Our main objective is to automate the task of building an interoperable digital library starting with a legacy collection consisting of printed version of documents or scanned version of documents in TIFF or PDF formats To develop a flexible and adaptable approach for extracting metadata from physical collections with focus on DTIC (Defense Technical Information Center) collections. To develop efficient ways of integrating OCR extraction processes with an interoperable digital library. To integrate the techniques and tools developed for metadata extraction to develop a test bed that moves the DTIC legacy collection into an interoperable digital library framework To evaluate the effectiveness of the automation process.
Feb 21-25, 2005ICM 2005 Mumbai7 Background OAI and Digital Library Metadata Extraction Approaches
Feb 21-25, 2005ICM 2005 Mumbai8 Digital Library and OAI Digital Library (DL) A DL is a network accessible and searchable collection of digital information. DL provides a way to store, organize, preserve and share information. Interoperability problem DLs are usually created separately by using different technologies and different metadata schemas.
Feb 21-25, 2005ICM 2005 Mumbai9 Open Archive Initiatives (OAI) Open Archive Initiatives Protocol for Metadata Harvesting (OAI-PMH) is a framework to to provide interoperability among heterogeneous DLs. It is based on metadata harvesting: a services provider can harvest metadata from a data provider. Data provider accepts OAI-PMH requests and provides metadata through network Service provider issues OAI-PMH requests to get metadata and build services on them. Each Data Provider can support its own metadata formats, but it has to support at least Dublin Core(DC) metadata set.
Feb 21-25, 2005ICM 2005 Mumbai10
Feb 21-25, 2005ICM 2005 Mumbai11 Dublin Core Metadata Set It supports 15 elements Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Relation, Coverage, Rights All fields are optional
Feb 21-25, 2005ICM 2005 Mumbai12 Metadata Extraction: Rule-based Basic idea: Use a set of rules to define how to extract metadata based on human observation. For example, a rule may be “ The first line is title”. Advantage Can be implemented straightforward No need for training Disadvantage Lack of adaptabilities (work for similar document) Difficult to work with a large number of features Difficult to tune the system when errors occur because rules are usually fixed
Feb 21-25, 2005ICM 2005 Mumbai13 Metadata Extraction: Rule-based Related works Automated labeling algorithms for biomedical document images (Kim J, 2003 ) Extract metadata from first pages of biomedical journals Accuracy: title 100%, author 95.64%, abstract 95.85%, affiliation 63.13% (76 articles are used for test) Document Structure Analysis Based on Layout and Textual Features (Stefan Klink, 2000) Extract metadata from U-Wash document corpus with 979 journal pages Good results for some elements (such as page-number has 90% recall and 98% precision) but bad results for others( abstract: 35% recall and 90% precision; biography: 80% recall and 35% precision)
Feb 21-25, 2005ICM 2005 Mumbai14 Metadata Extraction: Machine-Learning Approach Basic idea: Learn the relationship between input and output from samples and make predictions for new data This approach has good adaptability but it has to be trained from samples.
Feb 21-25, 2005ICM 2005 Mumbai15 System Architecture
Feb 21-25, 2005ICM 2005 Mumbai16 System Architecture (cont.) Main components: Scan and OCR: Commercial OCR software is used to scan the documents. Metadata Extractor: Extract metadata by using rules and machine learning techniques. The extracted metadata are stored in a local database. In order to support Dublin Core, it may be necessary to map extracted metadata to Dublin Core format. OAI layer: Make the digital collection interoperable. The OAI layer accepts all OAI requests, get the information from database and encode metadata into XML format as responses. Search Engine
Feb 21-25, 2005ICM 2005 Mumbai17 Metadata Extraction (Cont.)
Feb 21-25, 2005ICM 2005 Mumbai18 Metadata Extraction (cont.) Rule-based module Classify documents into classes based on similarity For each document class, create a template, or a set of rules Decoupling rules from coding A template is kept in a separate file Benefits Easy to extend For a new document class, just create a template Rules are simpler Rules can be refined easily Doc3 template2 Metadata Extraction Doc1 template1 Doc2 template2 metadata
Feb 21-25, 2005ICM 2005 Mumbai19 Metadata Extraction (cont.) Machine-learning module -- SVM with HMM SVM is good at working with a large number of features but is not good at catching correlated features a section before an author section is most possible a title section HMM is good at working with events in a sequence but is expensive to handle a large number of features Integration SVM works with a large number of features to produce probabilistic results ( title 54%, author 30%, abstract 16%) HMM works with results from SVM and the probabilities transiting from one metadata element to another element to produce final results.
Feb 21-25, 2005ICM 2005 Mumbai20 Metadata Extraction Approach (Cont.) Integration Rule-based Approach with Machine-learning Approach integrate machine-learning approach with our rule-based approach is to overcome two drawbacks of rule-based system Lack of auto-correction ability Lack of statistical fundamentals: Integrate the results from two modules directly
Feb 21-25, 2005ICM 2005 Mumbai21 Experiments Performance Measures SVM Experiments with different data sets Pure rule-based experiment
Feb 21-25, 2005ICM 2005 Mumbai22 Performance Measure For individual metadata element Precision=TT/(TT+FT) Recall=TT/(TT+TF) Accuracy=(TT+FF)/(TT+TF+FT+FF) Overall accuracy is the ratio of the number of data that are classified correctly over the total number of data. TTTF FTFF Original Classified In class Not In class In classNot In class
Feb 21-25, 2005ICM 2005 Mumbai23 SVM Experiments with different data sets Objective: Evaluate the performances of SVM for different data sets to see how well SVM works for metadata extraction. Data Sets Data Set 1: Seymore935 Download from manually tagged document headers Using the first 500 for training and the rest for test Data Set 2: DTIC100 Selected 100 PDF files from DTIC website based on Z39.18 standard OCR the first pages and convert to text format Manually tagged these 100 document headers Using the first 75 for training and the rest for test
Feb 21-25, 2005ICM 2005 Mumbai24 SVM Experiments with different data sets Data Set 3: DTIC33 A subset of DTIC tagged document headers with identical layout Using the first 24 for training and the rest for test DTIC33 Seymore945 DTIC100 More heterogeneous
Feb 21-25, 2005ICM 2005 Mumbai25 SVM Experiments with different data sets Overall accuracy of title, author, affiliation and date
Feb 21-25, 2005ICM 2005 Mumbai26 Pure rule-based experiment Objective Evaluate the performance of our rule-based approach – defining a template for each class. Experiment Use data set DTIC100: 100 XML files with font size and bold information It is divided into 7 classes according to layout information For each class, a template is developed after checking the first one or two documents in this class. This template is applied to the remaining documents to get performance data
Feb 21-25, 2005ICM 2005 Mumbai27 Pure rule-based experiment
Feb 21-25, 2005ICM 2005 Mumbai28 Pure rule-based experiment
Feb 21-25, 2005ICM 2005 Mumbai29 Screenshots – OAI
Feb 21-25, 2005ICM 2005 Mumbai30 Screenshots – Search Engine
Feb 21-25, 2005ICM 2005 Mumbai31 Conclusion It is feasible to extract metadata with higher accuracy from scanned documents of a homogeneous collection Future Issues: Heterogeneous Collection Extracting whole structure including complex objects