Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,

Slides:



Advertisements
Similar presentations
Easily retrieve data from the Baan database
Advertisements

Configuration management
Configuration management
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Hydrological information systems Svein Taksdal Head of section, Section for Hydroinformatics Hydrology department Norwegian Water Resources and Energy.
Paperless Online Payroll, Integrated HR & Report Generating System.
Data Dictionary What does “Backordered item” mean? What does “New Customer info.” contain? How does the “account receivable report” look like?
Dr Gordon Russell, Napier University Unit Data Dictionary 1 Data Dictionary Unit 5.3.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Living in a Digital World Discovering Computers 2010.
Introducing Symposia : “ The digital repository that thinks like a librarian”
Data Warehouse success depends on metadata
بسم الله الرحمن الرحيم معالج الحروف الضوئي OCR. Introduction Definition : OCR stands for O ptical C haracter R ecognition refers to the branch of computer.
Software Documentation Written By: Ian Sommerville Presentation By: Stephen Lopez-Couto.
Advanced Workgroup System. RED Advanced Workgroup Systems: Scan Features Copy Print Scan DNSG Software Our Customers Documents Our Customers Documents.
Braille Converter For Exam Background What is Braille? Braille is a series of raised dots that can be read with the fingers by people who are.
Library Automation: Planning and Implementation
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
Overview of Mini-Edit and other Tools Access DB Oracle DB You Need to Send Entries From Your Std To the Registry You Need to Get Back Updated Entries From.
Chapter 22 Systems Design, Implementation, and Operation Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 22-1.
L/O/G/O Metadata Business Intelligence Erwin Moeyaert.
Systems Analysis – Analyzing Requirements.  Analyzing requirement stage identifies user information needs and new systems requirements  IS dev team.
Database Systems COMSATS INSTITUTE OF INFORMATION TECHNOLOGY, VEHARI.
DE&T (QuickVic) Reporting Software Overview Term
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
AS Module 2 Information; Management and Management and Manipulation or what to do with data, how to do it, and……... ensure it provides useful information.
Objectives Overview Define the term, database, and explain how a database interacts with data and information Define the term, data integrity, and describe.
Eric Westfall – Indiana University Jeremy Hanson – Iowa State University Building Applications with the KNS.
Introduction to SPSS Edward A. Greenberg, PhD
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil
Automated Form processing for DTIC Documents March 20, 2006 Presented By, K. Maly, M. Zubair, S. Zeil.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
Metadata ODU for DTIC Presentation to Senior Management May 16, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
Developing a Concept Extraction Technique with Ensemble Pathway Prat Tanapaisankit (NJIT), Min Song (NJIT), and Edward A. Fox (Virginia Tech) Abstract.
Approved for Public Release U.S. Government Work (17 USC§105) Not copyrighted in the U.S. Defense Research & Engineering Information for the Warfighter.
1 UNOG Library Digitization and Microform Unit (DMU) – December 2009.
November 23, 2010 Service Computation Keynote - Lisbon, Portugal Automated Metadata Extraction Services Kurt Maly Contact:
CiNii Articles is a service that provides information on scholastic articles, with an emphasis on Japanese papers. It allows users to find the articles.
Freelib: A Self-sustainable Digital Library for Education Community Ashraf Amrou, Kurt Maly, Mohammad Zubair Computer Science Dept., Old Dominion University.
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
This presentation describes the development and implementation of WSU Research Exchange, a permanent digital repository system that is being, adding WSU.
Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf.
An OAI-Compliant Federated Physics Digital Library for the NSDL Department of Computer Science Old Dominion University, Norfolk, VA In Collaboration.
1 Tools for Extracting Metadata and Structure from DTIC Documents Digital Library Group Department of Computer Science Old Dominion University December,
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
September 25, 2006 NASA Feasibility Study Status Update.
Corporation For National Research Initiatives Technical Issues in Electronic Publishing Corporation for National Research Initiatives William Y. Arms.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
Oct 12-14, 2003NSDL Challenges in Building Federation Services over Harvested Metadata Kurt Maly, Michael Nelson, Mohammad Zubair Digital Library.
May 19-22, 2008 Open Forum for Metadata Registries Sydney Automated Metadata Extraction for Large, Diverse and Evolving Document Collections Kurt Maly.
Software Engineering Issues Software Engineering Concepts System Specifications Procedural Design Object-Oriented Design System Testing.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
Lecture #1: Introduction to Algorithms and Problem Solving Dr. Hmood Al-Dossari King Saud University Department of Computer Science 6 February 2012.
The information systems lifecycle Far more boring than you ever dreamed possible!
Zach Miller Computer Sciences Department University of Wisconsin-Madison Supporting the Computation Needs.
GNU EPrints 2 Overview Christopher Gutteridge 19 th October 2002 CERN. Geneva, Switzerland.
Advanced Higher Computing Science
The effort-saving, cost-cutting, low-overhead, cloud capture platform.
Easily retrieve data from the Baan database
Presentation to Senior Management January 7, 2010
Metadata Extraction Progress Report 12/14/2006.
Chapter Ten Managing a Database.
Software Documentation
Database Management Systems
Mobility Based Last Mile Banking Solution For
5/8/2019 3:20 AM bQuery-Tool 3.0 A new and elegant way to create queries and ad-hoc reports on your Baan/Infor ERP LN data. This Baan session is a query.
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Presentation transcript:

Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,

Outline Metadata Extraction Project  System overview  Demo What can ODU do for NASA Current Status and Required enhancements Why ODU Cost Estimate

ODU Metadata Extraction System Input: pdf documents  processed through OCR (Optical Character Recognition) Output: metadata in XML format  easily processed for uploading into any database (demo: 1 st document)

System Overview Processing has two main branches:  Documents with forms (RDPs)  Documents without forms

System Overview

Demo (additional documents)

What Can ODU do for NASA Automate form containing document NASA site Automate document processing for 80% of collection with minimal set of metadata Provide Interface for Human Intervention for remaining 20% Develop general reporting tool for management on accuracy of process

Current Status Completely Automated Software for:  Drop in pdf file  Process and produce output metadata in XML format Easy (less than 5 minutes) installation process Default set of templates for:  RDP containing documents  Non-form documents Statistical models of NASA collection (30,000 documents)  Phrase dictionaries: personal authors, corporate authors  Length and English word presence for title and abstract  Structure of dates, report numbers

Current Status Metadata Extraction Results for 25 documents that were randomly selected from the NASA Collection * Notes 1.Accuracy is defined as successful completion of the extractor with reasonable metadata values extracted 2.“Reasonable” implies that values could be automatically processed (see required enhancements) into standard format 3.Accuracy for documents without RDP could be enhanced with additional templates, (see required enhancements)

Current Status Documents with RDP forms  Extracts high-quality metadata for 2 variants of SF-298  Tested on 154 NASA documents Documents without RDP forms  Extracts moderate-quality metadata for 9 common document layouts  Tested on 574 NASA documents

Required Enhancements  Develop complete template set  Standardize output and integrate with existing process at NASA site  Provide tutorial for operation and template writing

Required Enhancements  Develop statistical model of target collection  Write default template set to cover at least 80% of known collection  Provide oracle for detection of problem cases

Required Enhancements  Develop interface for showing scoring of output and location in document  Develop interactive modules for correcting metadata  Develop driver for creating output in desired format

Required Enhancements  Develop statistical description of input flow of documents  Develop statistical descriptions of output flow of metadata records Accuracy Computer time to process Human time to validate/correct

Why - software from ODU Research, new technology  ODU digital library research group is world class and has made many contributions to advancing field. $2.5M funding in last five years from various agencies National Science Foundation, Andrew Mellon Foundation, Los Alamos, Sandia National Laboratory, Air Force Research Laboratory, NASA Langley, DTIC, and IBM  State of art in automated metadata extraction is good for homogenous collection but not effective for large, evolving, heterogeneous collections (such as NASA’s)  Need for new methods, techniques and processes

Why - software from ODU Inexpensive (relatively)  ODU is university with low overhead (43%)  Universities can use students and pay them assistantships rather than fulltime salaries  Department adds matching tuition waivers for research assistants which is big incentives for students to apply for research work  Faculty are among best in field, require partial funding.

Why - software from ODU Long term software maintenance through department  Department commits continuity independent of faculty on projects  Department will find and assign faculty and student who can become conversant with code and maintain it (not evolve it)  Likely that there would be other faculty who are interested in evolving code for appropriate funding

Cost of Possible Project For a 15month project for a significant collection best estimate if it were done in isolation, cost for NASA: $160,000 For the same 15 month project if done in parallel with DTIC (and possibly GPO), cost for NASA $90,000