ELPUB 2006 June 14-16 Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.

Slides:



Advertisements
Similar presentations
Improving Learning Object Description Mechanisms to Support an Integrated Framework for Ubiquitous Learning Scenarios María Felisa Verdejo Carlos Celorrio.
Advertisements

Capacity Building Passing on the Experience Dr. Noha Adly World Digital Library Arab Peninsula Regional Group meeting.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
ClearTK: A Framework for Statistical Biomedical Natural Language Processing Philip Ogren Philipp Wetzler Department of Computer Science University of Colorado.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
Advanced Manufacturing Laboratory Department of Industrial Engineering Sharif University of Technology Session # 12.
June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.
Jianwei Lu1 Information Extraction from Event Announcements Student: Jianwei Lu ( ) Supervisor: Robert Dale.
Selbo 2 SCORM Editor for eLearning Based on Ontologies Part of eLSE project Damyan Mitev University of Plovdiv “Paisii Hilendarski”
The FAO Open Archive Enhancing the Access to FAO Publications Using International Standards and Exchange Protocols Claudia Nicolai, Imma Subirats and.
OAI Standards for Sheet Music Meeting March 28-29, 2002 Basic OAI Principals How They Apply to Sheet Music Presenter: Curtis Fornadley, Senior Programmer/Analyst.
Metadata: Its Functions in Knowledge Representation for Digital Collections 1 Summary.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law Library Rutgers-Newark School of Law.
Dienst Distributed Networked Publishing Carl Lagoze Digital Library Scientist Cornell University.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Dr. Kurt Fendt, Comparative Media Studies, MIT MetaMedia An Open Platform for Media Annotation and Sharing Workshop "Online Archives:
1 The NSDL: A Case Study in Interoperability William Y. Arms Cornell University.
JINR DOCUMENT SERVER: Current Status and Future Plans I. Filozova 1, S. Kuniaev 2, G. Musulmanbekov 1, R. Semenov 1, G. Shestakova 1, P. Ustenko 2, T.Zaikina.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
5-7 November 2014 DR Workflow Practical Digital Content Management from Digital Libraries & Archives Perspective.
Connecting to Ensemble: AlgoViz. AlgoViz Community  Sharing educational resources Visualizations for data structure and algorithms  Sharing experience.
Getting Started with CONTENTdm Corey Harper, University of Oregon Terry Reese, Oregon State University OLA - April 8, 2005.
OCLC Research: an update Lorcan Dempsey
1 Metadata Extraction Experiments with DTIC Collections Department of Computer Science Old Dominion University 2/25/2005 Work in Progress Metadata Extraction.
Dec 9-11, 2003ICADL Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
Developing a Concept Extraction Technique with Ensemble Pathway Prat Tanapaisankit (NJIT), Min Song (NJIT), and Edward A. Fox (Virginia Tech) Abstract.
19/10/20151 Semantic WEB Scientific Data Integration Vladimir Serebryakov Computing Centre of the Russian Academy of Science Proposal: SkTech.RC/IT/Madnick.
Approved for Public Release U.S. Government Work (17 USC§105) Not copyrighted in the U.S. Defense Research & Engineering Information for the Warfighter.
Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.
SCIELO AS AN OPEN ARCHIVE: the development of SciELO / OpenArchives data provider interface Prof. Carlos H. Marcondes Federal Fluminense University/ Information.
CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Discovery Metadata for Special Collections Concepts, Considerations, Choices William E. Moen School of Library and Information Sciences Texas Center for.
Introduction to metadata
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
Kurt Maly Department of Computer Science Old Dominion University Norfolk, Virginia 23529, USA Digital Libraries, OAI and Free Software.
Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida
ICCTA September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly
Open Archive Initiative – Protocol for metadata Harvesting (OAI-PMH) Surinder Kumar Technical Director NIC, New Delhi
Slavic Digital Text Workshop 2006 The Open Archives Initiative Protocol for Metadata Harvesting: an Opportunity for Sharing Content in a Distributed Environment.
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
1 Tools for Extracting Metadata and Structure from DTIC Documents Digital Library Group Department of Computer Science Old Dominion University December,
OAI Overview DLESE OAI Workshop April 29-30, 2002 John Weatherley
Agenda Why discuss Digital Libraries What is a digital Library History Meta-data FEDORA NSDL D Space.
Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
May 26-28ICNEE 2003 ARCHON: BUILDING LEARNING ENVIRONMENTS THROUGH EXTENDED DIGITAL LIBRARY SERVICES Hesham Anan, Kurt Maly, Mohammad Zubair,et al. Digital.
Oct 12-14, 2003NSDL Challenges in Building Federation Services over Harvested Metadata Kurt Maly, Michael Nelson, Mohammad Zubair Digital Library.
1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
June 3-6, 2003E-Society Lisbon Automatic Metadata Discovery from Non-cooperative Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science.
Automation Living in a Paper Oriented World and The Steps to Automation.
Developing our Metadata: Technical Considerations & Approach Ray Plante NIST 4/14/16 NMI Registry Workshop BIPM, Paris 1 …don’t worry ;-) or How we concentrate.
Joseph JaJa, Mike Smorul, and Sangchul Song
Outline Pursue Interoperability: Digital Libraries
Panagiotis G. Ipeirotis Tom Barry Luis Gravano
Introduction to DSpace
Metadata to fit your needs... How much is too much?
Presentation transcript:

ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer Science Old Dominion University May, 2006

ELPUB 2006 June Bansko Bulgaria2 Contents Introduction Background System Architecture Metadata Extraction Approach Experiments Screenshots

ELPUB 2006 June Bansko Bulgaria3 Introduction Key problem : Extracting Metadata from a legacy collection Key problem : Extracting Metadata from a legacy collection OCR is not sufficient for making ‘legacy’ documents searchable. OCR is not sufficient for making ‘legacy’ documents searchable. Manual metadata extraction is costly and time- consuming Manual metadata extraction is costly and time- consuming It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop). It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop). Automatic extraction tools are essential for rapid dissemination at reasonable cost Automatic extraction tools are essential for rapid dissemination at reasonable cost

ELPUB 2006 June Bansko Bulgaria4 Background : Digital Library and OAI-PMH Digital Library (DL) A DL is a network accessible and searchable collection of digital information. DL provides a way to store, organize, preserve and share information. Open Archive Initiatives Protocol for Metadata Harvesting (OAI-PMH) is a framework to to provide interoperability among heterogeneous DLs. Based on metadata harvesting Data Providers and Service Providers

ELPUB 2006 June Bansko Bulgaria5 Background : Metadata Extraction Rule-based Approach Basic idea Use a set of rules to define how to extract metadata based on human observation. For example, a rule may be “ The first line is title ”. Pros & Cons No need for training from samples Can extract different metadata from different documents Rule writing may require significant technical expertise

ELPUB 2006 June Bansko Bulgaria6 Background : Metadata Extraction (cnt.) Machine-Learning Approach Basic idea Learn the relationship between input and output from samples and make predictions for new data Pros & Cons Good adaptability but it has to be trained from samples – time consuming Performance degrades with increasing heterogeneity Difficult to add new fields to be extracted Difficult to select the right features for training

ELPUB 2006 June Bansko Bulgaria7 Background : Document Classification Classify document pages into groups based on their visual similarity: the geometrical arrangement of components the typographic features such as font Existing Approaches MXY-Tree recursively cuts a page into blocks by separators (e.g. lines) as well as white spaces. A page is converted to a tree. M*N bins cuts a page into m*n equal size bins; a bin is either a text bin (if more than half are text) or white space bin

ELPUB 2006 June Bansko Bulgaria8 System Architecture

ELPUB 2006 June Bansko Bulgaria9 System Architecture (cont.) Main components: Scan and OCR: Commercial OCR software is used to scan the documents. Metadata Extractor: Extract metadata by using rules and machine learning techniques. The extracted metadata are stored in a local database. In order to support Dublin Core, it may be necessary to map extracted metadata to Dublin Core format. OAI layer: Make the digital collection interoperable. The OAI layer accepts all OAI requests, get the information from database and encode metadata into XML format as responses. Search Engine

ELPUB 2006 June Bansko Bulgaria10 Template-Based Metadata Extraction

ELPUB 2006 June Bansko Bulgaria11 Template-Based Metadata Extraction- Document Classification classify documents into groups based on the visual similarity of their metadata pages ( page with richness in metadata ). the geometrical arrangement of metadata fields on the metadata page the typographic features such as font size, text alignment, and text height The identification of metadata pages by a set of rules

ELPUB 2006 June Bansko Bulgaria12 Template-Based Metadata Extraction- Document Classification Document Pages MXY-Tree m*n bins Similarity Integration

ELPUB 2006 June Bansko Bulgaria13 Template sample

ELPUB 2006 June Bansko Bulgaria14 Experiments- Document Classification downloaded 7413 documents from the DTIC collection randomly selected 200, 400, 800, 1200, 2000, 3000, 4000, 5000, 6000 documents & Classified them into groups

ELPUB 2006 June Bansko Bulgaria15 Selected 100 documents from DTIC; divided them into 7 classes; created a template for each class Experiments- Metadata Extraction

ELPUB 2006 June Bansko Bulgaria16 Template-based experiment

ELPUB 2006 June Bansko Bulgaria17 Screenshots – OAI

ELPUB 2006 June Bansko Bulgaria18 Screenshots – Search Engine

ELPUB 2006 June Bansko Bulgaria19 Conclusions We describe how to automate the task of converting existing corpus into an OAI-compliant repository We propose our metadata extraction approach to address the challenge of getting desirable accuracy for a large heterogeneous collection of documents