A Framework for Relationship Discovery Among Files of Different Types Michal Ondrejcek, Jason Kastner and Peter Bajcsy National Center for Supercomputing.

Slides:



Advertisements
Similar presentations
File Format Identification and Archival Processing
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Meta Data Larry, Stirling md on data access – data types, domain meta-data discovery Scott, Ohio State – caBIG md driven architecture semantic md Alexander.
Organising and Documenting Data Stuart Macdonald EDINA & Data Library DIY Research Data Management Training Kit for Librarians.
1 UIM with DAML-S Service Description Team Members: Jean-Yves Ouellet Kevin Lam Yun Xu.
Digital Preservation - Its all about the metadata right? “Metadata and Digital Preservation: How Much Do We Really Need?” SAA 2014 Panel Saturday, August.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Date: January 21 st, 2009 Appraisal of 3D Data Conversions and.
Preservation Metadata Extraction and Collection : Tools and Techniques Mat Black National Library of New Zealand Te Puna Matauranga o Aotearoa.
MobiShare: Sharing Context-Dependent Data & Services from Mobile Sources Efstratios Valavanis, Christopher Ververidis, Michalis Vazirgianis, George C.
BitstreamFormat Renovation: DSpace Gets Real Technical Metadata.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System Supervisor: Prof Michael Lyu Presented by: Lewis Ng,
Basic Concepts Architecture Topology Protocols Basic Concepts Open e-Print Archive Open Archive -- generalization of e-print Data Provider and Service.
M.Lautenschlager (WDCC / MPI-M) / / 1 GO-ESSP at LLNL Livermore, June 19th – 21st, 2006 World Data Center Climate: Status and Portal Integration.
Introduction to DBMS Purpose of Database Systems View of Data
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
A summary of the report written by W. Alink, R.A.F. Bhoedjang, P.A. Boncz, and A.P. de Vries.
ADeNu Research Group The Tracking and Auditing Module for the OpenACS Framework Jorge Couchet - Olga Santos - Emmanuelle Raffenne.
1 © Netskills Quality Internet Training, University of Newcastle Metadata Explained © Netskills, Quality Internet Training.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Digital Object Architecture
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
SAN DIEGO SUPERCOMPUTER CENTER HDF5/SRB Integration August 28, 2006 Mike Wan SRB, SDSC Peter Cao
Development of Front End Tools for Semantic Grid Services Dr.S.Thamarai Selvi, Professor & Head, Dept. of Information Technology, Madras Institute of Technology,
Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications.
Architecture for a Database System
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Scalable Metadata Definition Frameworks Raymond Plante NCSA/NVO Toward an International Virtual Observatory How do we encourage a smooth evolution of metadata.
Lecturer: Gareth Jones. How does a relational database organise data? What are the principles of a database management system? What are the principal.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign The Conversion Software Registry Michal Ondrejcek, Kenton McHenry,
Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
©Silberschatz, Korth and Sudarshan1.1Database System Concepts Chapter 1: Introduction Purpose of Database Systems View of Data Data Models Data Definition.
Information Systems & Databases 2.2) Organisation methods.
Exploitation of Dynamic Information Relations in the Service-Oriented AFRL Information Management Systems Andrzej Uszok, Larry Bunch, Jeffrey M. Bradshaw.
Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.
Event-Based Hybrid Consistency Framework (EBHCF) for Distributed Annotation Records Ahmet Fatih Mustacoglu Advisor: Prof. Geoffrey.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
Best Practices NMFS EDM June 18, 2013M. Brady. Context June 18, 2013M. Brady.
Alternative Architecture for Information in Digital Libraries Onno W. Purbo
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Funded by: © AHDS Preservation in Institutional Repositories Preliminary conclusions of the SHERPA DP project Gareth Knight Digital Preservation Officer.
Modul 4 Struktur Informasi Mata Kuliah Preservasi Informasi Digital.
Digital Library The networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today’s Internet.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
A Technical Overview Bill Branan DuraCloud Technical Lead.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Discovery of Relationships between 2D Engineering Drawings and.
Metadata for the SKA - Niruj Mohan Ramanujam, NCRA.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Drill Workflow- Make a workflow using the task and decision boxes on the board to simulate a student getting up and going to school in the morning. Use.
Architecture Review 10/11/2004
Introduction to DBMS Purpose of Database Systems View of Data
Digital Object Architecture (DOA) in ITU
VI-SEEM Data Discovery Service
CS 501: Software Engineering Fall 1999
GSAF Grid Storage Access Framework
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Introduction to Database Systems
The Re3gistry software and the INSPIRE Registry
Introduction to DBMS Purpose of Database Systems View of Data
Metadata The metadata contains
Technical Issues in Sustainability
Presentation transcript:

A Framework for Relationship Discovery Among Files of Different Types Michal Ondrejcek, Jason Kastner and Peter Bajcsy National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign (UIUC) {mondrejc,jkastner, Acknowledgments This research was partially supported by a National Archive and Records Administration supplement to NSF PACI cooperative agreement CA #SCI Abstract We present a framework for relationship discovery from heterogeneous data systems. The framework consists of modules for automated file system analysis, file content analysis, integration of the results from analyses, storage of metadata and data-driven decision support for discovering relationships among files. The file content analysis includes filtering for file type detection (e.g., file format identification using DROID and PRONOM) and type-specific content analysis (such as, information extraction from 2D engineering drawings using Optical Character Recognition (OCR), and keyword based extraction of information from 3D CAD models). The integration component consolidates metadata extracted from the file system and from the file content using metadata Resource Description Framework (RDF)-based representations. These are stored using Tupelo in an underlying content repository. We report our preliminary design of the framework and the performance of prototype modules for a test collection of electronic records documenting the Torpedo Weapon Retriever (TWR 841). This test collection presents a problem of unknown relationships among files that currently include 784 2D image drawings and 22 CAD models. Aperture, a Java framework has been used for metadata extraction from File systems. It saves the metadata following the Nepomuk ontology. We studied the size of extracted metadata and developed prediction capabilities to estimate additional storage requirements. Framework Design <rdf:RDF xmlns:rdf=" xmlns:rdfs=" xmlns:dc=" xmlns:tdrw=“’path’/NARA/titleBlockRDF/" 120 TORPEDO WEAPONS RETRIEVER TRANSVERSE BULKHEADS BELOW MAIN DECK OfPAO'MtN* Of »NE **v* NAVAL SEA SYSTEMS COMMAND 1/2"-1'-0"& AS SHOWN H A LDOBSON 4-I0-86 File System Information Extraction File Format Identification This component calls DROID, a file format identification program. The results are metadata about each file including the registered PRONOM universal ID. PRONOM is a resource registry (information) about the file formats, software products and other technical components. An overall design to discovering relationships among multiple sources of electronic records. U2110_BHD12_Autocad.dwg Positive AutoCAD Drawing image/vnd.dwg RDF triples generated for two engineering drawings in tiff and Autocad formats with PRONOM Unique IDs highlighted. An UUID is used as a key for storing a set of triples about the same file. We study the extraction of content information to discover relationships between engineering drawings (tiff files) with the Title Block and corresponding AutoCAD 3D models (dwg files) of the TWR841 ship deck. Content Information Extraction Conclusions Several 3D file formats are not supported by PRONOM and DROID returns the unidentified file format flag. Those files are then checked against an internal list of 3D file types. The results are converted into RDF triples and stored in a metadata context repository. Metadata size as a function of number of files in a File system. The test systems were, divided based on the Operating System (OS) type to: (c1 ) LINUX based 8 CPU Intel Xeon with 2.5GHz and 8GB RAM and (c2) WindowsXP 1 CPU 2GHz Intel and 2GB RAM. While the dots corresponds to concrete File systems, the blue line represents the metadata size prediction based on simulated File system topology. Table shows an example of information extracted from 3D CAD model stored in STEP file formats of the TWR841 ship deck. STEP METADATA SPECIFICATIONEXPECTED STEP METADATAPARSED STEP METADATA FILE_DESCRIPTION( /* description */ (''), /* implementation_level */ '2;1'); FILE_NAME( /* name */ '', /* time_stamp */'', /* author */ (''), /* organization */ (''), /* preprocessor_version */ ' ', /* originating_system */ '', /* authorization */ ' '); FILE_DESCRIPTION((''), /* implementation_level */ '2;1'); FILE_NAME( '120 TORPEDO WEAPONS RETRIEVER, TRANSVERSE BULKHEADS BELOW, MAIN DECK', ‘ ', ('LDOBSON'), ('NAVAL SEA SYSTEMS COMMAND'), ' ', 'IDA-STEP', ' '); FILE_DESCRIPTION((''), '2;1'); FILE_NAME( 'D:\\NARA\\Archieve_data_samples\\BHD_FR12 \\U2110_BHD12_2007_05_09.stp', ' T13:45:37', ('rakowpj'), (''), 'Autodesk Inventor 11', ''); We have prototyped a framework for file system and file content metadata extraction. The relationship discovery from metadata is in progress. We developed the metadata size prediction capability for File systems. We empirically observed the number of generated RDF triples for relationship discovery to be on average about per file leading to the total number of 8-12 million RDF triples for an average size server. 3D CAD Model Information in engineering drawings: The title block is cropped. Information is extracted using Optical character recognition (OCR) software. The extracted information is corrected and encoded into about RDF triples using a developed ontology. Engineering Drawing RELATIONSHIP Information in 3D CAD files: The 3D CAD models in STEP file format are searched for any ASCII strings matching English dictionary. The information is again encoded by about 8-10 RDF triples. Cropped Title Block Information from OCR Editing and Ontology Definition RDF representation of information extracted