ELPUB 2006 June 14-16 Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.

ELPUB 2006 June 14-16 Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Maly@cs.odu.edu Department of Computer Science Old Dominion University May, 2006

ELPUB 2006 June 14-16 Bansko Bulgaria2 Contents Introduction Background System Architecture Metadata Extraction Approach Experiments Screenshots

ELPUB 2006 June 14-16 Bansko Bulgaria3 Introduction Key problem : Extracting Metadata from a legacy collection Key problem : Extracting Metadata from a legacy collection OCR is not sufficient for making ‘legacy’ documents searchable. OCR is not sufficient for making ‘legacy’ documents searchable. Manual metadata extraction is costly and time- consuming Manual metadata extraction is costly and time- consuming It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop). It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop). Automatic extraction tools are essential for rapid dissemination at reasonable cost Automatic extraction tools are essential for rapid dissemination at reasonable cost

ELPUB 2006 June 14-16 Bansko Bulgaria4 Background : Digital Library and OAI-PMH Digital Library (DL) A DL is a network accessible and searchable collection of digital information. DL provides a way to store, organize, preserve and share information. Open Archive Initiatives Protocol for Metadata Harvesting (OAI-PMH) is a framework to to provide interoperability among heterogeneous DLs. Based on metadata harvesting Data Providers and Service Providers

ELPUB 2006 June 14-16 Bansko Bulgaria5 Background : Metadata Extraction Rule-based Approach Basic idea Use a set of rules to define how to extract metadata based on human observation. For example, a rule may be “ The first line is title ”. Pros & Cons No need for training from samples Can extract different metadata from different documents Rule writing may require significant technical expertise

ELPUB 2006 June 14-16 Bansko Bulgaria6 Background : Metadata Extraction (cnt.) Machine-Learning Approach Basic idea Learn the relationship between input and output from samples and make predictions for new data Pros & Cons Good adaptability but it has to be trained from samples – time consuming Performance degrades with increasing heterogeneity Difficult to add new fields to be extracted Difficult to select the right features for training

ELPUB 2006 June 14-16 Bansko Bulgaria7 Background : Document Classification Classify document pages into groups based on their visual similarity: the geometrical arrangement of components the typographic features such as font Existing Approaches MXY-Tree recursively cuts a page into blocks by separators (e.g. lines) as well as white spaces. A page is converted to a tree. M*N bins cuts a page into m*n equal size bins; a bin is either a text bin (if more than half are text) or white space bin

ELPUB 2006 June 14-16 Bansko Bulgaria8 System Architecture

ELPUB 2006 June 14-16 Bansko Bulgaria9 System Architecture (cont.) Main components: Scan and OCR: Commercial OCR software is used to scan the documents. Metadata Extractor: Extract metadata by using rules and machine learning techniques. The extracted metadata are stored in a local database. In order to support Dublin Core, it may be necessary to map extracted metadata to Dublin Core format. OAI layer: Make the digital collection interoperable. The OAI layer accepts all OAI requests, get the information from database and encode metadata into XML format as responses. Search Engine

ELPUB 2006 June 14-16 Bansko Bulgaria10 Template-Based Metadata Extraction

ELPUB 2006 June 14-16 Bansko Bulgaria11 Template-Based Metadata Extraction- Document Classification classify documents into groups based on the visual similarity of their metadata pages ( page with richness in metadata ). the geometrical arrangement of metadata fields on the metadata page the typographic features such as font size, text alignment, and text height The identification of metadata pages by a set of rules

ELPUB 2006 June 14-16 Bansko Bulgaria12 Template-Based Metadata Extraction- Document Classification Document Pages MXY-Tree m*n bins Similarity Integration

ELPUB 2006 June 14-16 Bansko Bulgaria13 Template sample

ELPUB 2006 June 14-16 Bansko Bulgaria14 Experiments- Document Classification downloaded 7413 documents from the DTIC collection randomly selected 200, 400, 800, 1200, 2000, 3000, 4000, 5000, 6000 documents & Classified them into groups

ELPUB 2006 June 14-16 Bansko Bulgaria15 Selected 100 documents from DTIC; divided them into 7 classes; created a template for each class Experiments- Metadata Extraction

ELPUB 2006 June 14-16 Bansko Bulgaria16 Template-based experiment

ELPUB 2006 June 14-16 Bansko Bulgaria17 Screenshots – OAI

ELPUB 2006 June 14-16 Bansko Bulgaria18 Screenshots – Search Engine

ELPUB 2006 June 14-16 Bansko Bulgaria19 Conclusions We describe how to automate the task of converting existing corpus into an OAI-compliant repository We propose our metadata extraction approach to address the challenge of getting desirable accuracy for a large heterogeneous collection of documents

ELPUB 2006 June 14-16 Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.

Similar presentations

Presentation on theme: "ELPUB 2006 June 14-16 Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ELPUB 2006 June 14-16 Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.

Similar presentations

Presentation on theme: "ELPUB 2006 June 14-16 Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer."— Presentation transcript:

Similar presentations

About project

Feedback