Download presentation
Presentation is loading. Please wait.
Published byAshley Harrison Modified over 9 years ago
1
Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu
2
Outline Metadata Extraction Project System overview Demo What can ODU do for NASA Current Status and Required enhancements Why ODU Cost Estimate
3
ODU Metadata Extraction System Input: pdf documents processed through OCR (Optical Character Recognition) Output: metadata in XML format easily processed for uploading into any database (demo: 1 st document)
4
System Overview Processing has two main branches: Documents with forms (RDPs) Documents without forms
5
System Overview
6
Demo (additional documents)
7
What Can ODU do for NASA Automate form containing document processing @ NASA site Automate document processing for 80% of collection with minimal set of metadata Provide Interface for Human Intervention for remaining 20% Develop general reporting tool for management on accuracy of process
8
Current Status Completely Automated Software for: Drop in pdf file Process and produce output metadata in XML format Easy (less than 5 minutes) installation process Default set of templates for: RDP containing documents Non-form documents Statistical models of NASA collection (30,000 documents) Phrase dictionaries: personal authors, corporate authors Length and English word presence for title and abstract Structure of dates, report numbers
9
Current Status Metadata Extraction Results for 25 documents that were randomly selected from the NASA Collection * Notes 1.Accuracy is defined as successful completion of the extractor with reasonable metadata values extracted 2.“Reasonable” implies that values could be automatically processed (see required enhancements) into standard format 3.Accuracy for documents without RDP could be enhanced with additional templates, (see required enhancements)
10
Current Status Documents with RDP forms Extracts high-quality metadata for 2 variants of SF-298 Tested on 154 NASA documents Documents without RDP forms Extracts moderate-quality metadata for 9 common document layouts Tested on 574 NASA documents
11
Required Enhancements Develop complete template set Standardize output and integrate with existing process at NASA site Provide tutorial for operation and template writing
12
Required Enhancements Develop statistical model of target collection Write default template set to cover at least 80% of known collection Provide oracle for detection of problem cases
13
Required Enhancements Develop interface for showing scoring of output and location in document Develop interactive modules for correcting metadata Develop driver for creating output in desired format
14
Required Enhancements Develop statistical description of input flow of documents Develop statistical descriptions of output flow of metadata records Accuracy Computer time to process Human time to validate/correct
15
Why - software from ODU Research, new technology ODU digital library research group is world class and has made many contributions to advancing field. $2.5M funding in last five years from various agencies National Science Foundation, Andrew Mellon Foundation, Los Alamos, Sandia National Laboratory, Air Force Research Laboratory, NASA Langley, DTIC, and IBM State of art in automated metadata extraction is good for homogenous collection but not effective for large, evolving, heterogeneous collections (such as NASA’s) Need for new methods, techniques and processes
16
Why - software from ODU Inexpensive (relatively) ODU is university with low overhead (43%) Universities can use students and pay them assistantships rather than fulltime salaries Department adds matching tuition waivers for research assistants which is big incentives for students to apply for research work Faculty are among best in field, require partial funding.
17
Why - software from ODU Long term software maintenance through department Department commits continuity independent of faculty on projects Department will find and assign faculty and student who can become conversant with code and maintain it (not evolve it) Likely that there would be other faculty who are interested in evolving code for appropriate funding
18
Cost of Possible Project For a 15month project for a significant collection best estimate if it were done in isolation, cost for NASA: $160,000 For the same 15 month project if done in parallel with DTIC (and possibly GPO), cost for NASA $90,000
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.