Presentation is loading. Please wait.

Presentation is loading. Please wait.

Robust Technologies for Automated Ingestion and Long-Term Preservation of Digital Information PI: Joseph JaJa Co-PIs: Allison Druin and Doug Oard Major.

Similar presentations


Presentation on theme: "Robust Technologies for Automated Ingestion and Long-Term Preservation of Digital Information PI: Joseph JaJa Co-PIs: Allison Druin and Doug Oard Major."— Presentation transcript:

1 Robust Technologies for Automated Ingestion and Long-Term Preservation of Digital Information PI: Joseph JaJa Co-PIs: Allison Druin and Doug Oard Major Collaborators: Library of Congress, National Archives, Shoah Visual History Foundation, ICDL, SDSC, Georgia Tech, SLAC

2 Scientific Research Objectives Development of tools and technologies for automated ingestion and management of preservation processes. Evaluation and demonstration of tools on widely different collections. Overall layered architecture based on distributed repositories using open standards, web and data grid technologies Overall approach captures all essential elements of the Open Archival Information System (OAIS) Reference Framework.

3 Accomplishments Development of a Global Digital Format Registry prototype based on scalable and secure web technologies – FOCUS (FOrmat CUration Service). Automated ingestion tools and testing on the ICDL collection – ICDL Book Builder. Preliminary design of policy driven integrity auditing for distributed archives. Detailed design of a deep archive based on erasure-resilient codes.

4 Format Obsolescence Handling of digital formats is an essential part of long-term preservation Preservation of any object must include ways to render and transform the object if necessary. Needs to preserve Different essential aspects of objects. Tools for capturing the essential format characteristics of information stored as digital objects.

5 FOrmat CUration Service Maintains persistent information on digital formats and applications to access and manipulate them. Accessible either Directly through LDAP Or indirectly through SOAP (Web Services) Web Service Agent Format Registry LDAP SOAP

6 FOCUS on LDAP/SOAP Interoperability LDAP and SOAP provide the standard models and protocols, being platform independent. Scalability LDAP is a proven scalable technology. LDAP schema can be extended and server can be replicated with ease. SOAP server side can be extended without affecting client sides. Security SOAP can be on top of SSL (https). LDAP also provides its own secure authentication and authorization methods.

7 FOCUS Data Model dc=umiacs, dc=umd, dc=edu ou=Format- Registry ou=Applications Adobe Acrobat v6.0 Adobe Photoshop v7.0 Jhove 1.0 ou=Formats Adobe PDF v1.4 CompuServ GIF 1989a JPEG Image Format 2000  General descriptive properties.  Processing: rendering, editing, conversion and validation services/systems.  General descriptive properties.  Processing : format taken as input and/or output.

8 Validation Service Conversio n Service Web Service Agent Identificatio n Service Rendering Service Rendering Service FOCUS Service Model Format Registry Identifies format of a specific DO using the internal signature Determines a verification service to verify the format of a specific DO Identifies current rendering conditions for specific digital format. Locates transformation services to convert DO from source format to format of interest.

9

10 International Children’s Digital Library (ICDL) Joint project between UMD and the Internet Archive funded by NSF and IMLS (Allison Druin). Goal: efficient search, browsing, and reading of a collection of 10,000 books in 100 languages. Current holdings almost 1000 books in over 30 languages, with innovative book readers and browsing tools. Books are digitized in TIFF format, and processed in 6 sizes of JPEG2000 for each page of each book.

11 Producer – Archive Workflow Network (PAWN) Distributed and secure ingestion of digital objects into the archive. Use of web/grid technologies – platform independent Ease of integration with data grids or digital libraries. XML Representation of metadata and bitstream Self describing bitstream submissions Accountability of transfer and guarantee of data integrity Currently being used to ingest SLAC data into the National Archives.

12 More About PAWN Bitstream Validation Service Digital Archive Scheduler Producer1 Producer n Producer2

13 ICDL Book Builder Purpose: archive digital book collection of ICDL (International Children’s Digital Library). Builder allows users to: Select books from ICDL Map metadata from ICDL database Create Submission Information Packets(SIP) and transfer into PAWN

14 ICDL Ingestion Steps 5 step process: 1. User queries ICDL database with under given criteria (eg. Book id, title, # of pages, etc…) 2. Select books to ingest. 3. Choose mapping of ICDL metadata into Dublin Core 4. Download book contents and create SIP 5. Send packet to PAWN PAWN transfers to archive

15 Submission Information Packet (SIP) METS Handles all areas of a SIP except Physical Object and Descriptive Information Descriptive Information can be embedded into METS as 3 rd party XML schema Submission agreement constrains how a SIP is structured and described.

16 Collaboration Examples and Success Stories A prototype Producer-Archive Workflow Network (PAWN) is currently being used to ingest SLAC collection at NARA-II. Several parties have expressed interest in collaborating with us to further develop the design and implementation of the Global Digital Format Registry. Designed and built a “grid brick” for NARA-I, which is currently in use for demonstrating the distributed pilot persistent archive linking UMD, SDSC, NARA-II, and Georgia Tech.

17 Broad Impact A workshop organized by R. Moore, J. JaJa, and A. Rajasekar to assess the suitability of the SRB for long term preservation was held on Dec 8-9, 2005. Over 70 people from the archiving, digital library, and grid communities participated in the workshop. Interactions with NDIIPP partners, NARA partners, Don Sawyer’s group at NASA Goddard, etc.

18 Challenges We have to work with constantly changing requirements and assumptions as most of the non- technical issues are still open-ended in addition to the core problem of dealing with technology evolution. Graduate students would rather work in core disciplines. Open-ended research issues – no rigorous methodology to distinguish between different approaches, and no clear way to measure progress.


Download ppt "Robust Technologies for Automated Ingestion and Long-Term Preservation of Digital Information PI: Joseph JaJa Co-PIs: Allison Druin and Doug Oard Major."

Similar presentations


Ads by Google