Automated Archiving of DVD Content Esteva, Vega, Nieto, Scott, Gunnels, Kumar, Lamphear, Henriksen, Lee, Martin TCDL 2013
Motivation and perspective Find a file based preservation solution for video art in DVD media Create a SIP including files to fulfill museum functions Availability of high performance distributed storage Next generation display systems This project presents the design and implementation of an automated workflow that produces Submission Information Packages (SIP) for archiving video art in DVD only format. The project was elicited by the need of the Blanton Museum of Art at UT Austin to preserve their video art collection and deal with the risk of obsolescence, corruption and security of the media but also to fulfill with exhibition, outreach, publication, loans, research and education functions at a university museum. The project was designed in the context of the availability of high performance distributed storage for UT Austin and UT System faculty and researchers which is provided through the Texas Advanced Computing Center. It is also informed by the center’s experience in visualization technologies and work in research and development of next generation display systems.
Considerations Study the role of the DVD in the works technical history Options for conversion Usage of the DVD in the museum To begin the project we had various issues to consider. DVDs are an optical storage medium in which video data is encoded in MPEG 2 lossy format [1]. Popular due to their high storage capacity and portability, a little more than a decade after their release DVDs are becoming obsolete. In terms of their life expectancy , and despite different manufacturing qualities, DVDs do not constitute a long-term archival storage solution due to their vulnerability and risk of obsolescence [1] [6] In addition, from a digital assets management perspective, DVDs’ physicality is inconvenient in contrast to the benefits of file-based archival solutions now preferred in cultural heritage and commercial settings for storage, long term preservation and distribution. All of these considerations justified moving the media-dependent collection to a file based system, but it was the media dependency of the DVDs and their compression which raised a set of questions. Was it reasonable to preserve works that had been already compressed? Some of these videos are already on Yu tube? Couldn’t we find the masters? To what kind of digital files we were going to convert the DVDs? We needed to understand the technical trajectory of the collection, the role of the DVDs and their usage at the museum.
Considerations UT Research Storage Infrastructure and Services Example of next generation display systems What we did have was the infrastructure required to store the files. The University of Texas at Austin through TACC, designs and maintains hp storage resources that are accessible remotely and can be configured by the user or with aid from the data and collections management team. In this environment users can“build their collections” using different technologies available as such as rdbms, gis, irods, web services and different access protocols, security and 24/7 maintenance and replication facilities. Researchers and curators can create, develop, and manage data and collections throughout their lifecycle and create functioanlities needed for access, research, analysis, and display functions. We decided to develop a workflow to convert and ingest the collection into the repository where it could be shared, documented and move forward. We also had the example of what next generation display tools may look like in the near future. One of TACC’s key resources is its scientific visualization laboratory (Vislab), whose 2900-square foot space contains several cutting-edge display systems. These include the highest resolution tiled-display in the world (nicknamed Stallion) a 4K projection system a multi touch tiled display which was developed through a research collaboration with NARA as well as stereoscopic displays capable of playing video and rendered graphics in 3D using active shutter glasses. All of the display systems in the lab can be controlled through a single computer or node, and through this “head node” all other machines including storage resources can be accessed.
Workflow requirements Only for DVD-based works Transcoding should not involve further compression Fixity to establish authenticity of disk image and production quality/preservation master Quality metric to assess the quality of the transcoding Metadata and provenance documentation Easy to implement SIP contents should fulfill preservation and access functions Using the information gathered from the survey and after conducting some transcoding tests we established requirements for the workflow. The workflow would be used only for DVD based works for which no other high quality media existed. We definitely did not want to further compress the DVD media. In our transcoding tests we realized that using default settings would reduce the quality of the final product, we wanted to preserve the content and to introduce a quality metric to asses the outcome of the process. We wanted to provide the museum with a functional best quality copy that they could use for exhibition purposes as well with streaming or other low quality accessible files that they needed to distribute to curators or to teachers if needed. Having gone through difficulties finding out the technical transitions of the pieces, we wanted to make sure that we were recording our intervention. Such workflow has many steps and involves different pieces of software, we wanted to automate the process as much as possible, to make it easy for museum staff to undertake it.
Preservation roadmap Best practices in video and digital preservation Few references for DVD preservation ISO disk images as identical copy of the DVD Preservation master Base for metadata extraction and further conversions Not playable High quality access file (also preservation file) Matroska container and ffv1 codec Quality metric Structural Similarity Index (SSIM) Documentation – metadata and processing provenance In the literature we did not find many recommendations in terms of what to do with DVD based art. To design our workflow we adapted recommendations from outstanding digital preservation and from analogue video reformatting projects. As we move through the workflow implementation and we learn more about the process, we are still considering the components of the Submission Information Packages (SIP), and we have changed our minds since we wrote the paper as to the terminology and roles of the different components. So, this is our roadmap for now. ISO disk images are open standard, uncompressed files, considered archival because they constitute an identical copy of the content and structure of an optical disc medium. we capture a disk image of the DVD and use it as the source for transcoding to other preservation and access formats. The ISO image is not playable but it can be burn into another DVD which at the museum is important given that some of the exhibit copies are already unreadable in some DVD players and computer drivers. To generate a highest possible quality preservation and access file, we transcode the ISO image containing all the media streams to a Matroska container with FFV1 codec to create a lossless conversion with respect to the lossy MPEG2. The choice of wrappers and codecs in different institutions is decided by emerging standards and usage needs. We are aware of recommendations to use Material eXchange Format (MXF) wrapper and JPG2000 codec [12]. However, our choice of open source software for transcoding limits our possibilities as it does not support MXF or Jpeg2000 at the moment. Like MXF, the Matroska container allows encoding of audio and any accompanying materials. Matroska is an open standard free container format for general use. We expect to provide the MXF and JPG2000 combination in the future. Optionally, the workflow can generate online access and streaming files, in which case a lossy codec is acceptable. Visual quality control is routinely used to determine the differences between the analog original and the digital file in digitization projects. In the same manner, we use a full reference video quality metric, which assesses the quality of a transcoded video using the original video as reference, to assess the quality of the transcoding step. The metric selected is the Structure Similarity (SSIM) index The DVDs are already a highly compressed format, the video quality metric allows us to determine if the conversion was lossy or lossless by providing a measurement of quality between the original and the converted file. Documentation is produced at every step.
Workflow step by step Metadata extraction ISO disk image Checksum Transcoding Conversion to other access files Quality metric Visual evaluation To illustrate a workflow we use the case of the video “Winchester” part of the Winchester trilogy” by Jeremy Blake The artist describes the three works as time-based paintings containing digital animations and mash-ups of retouched images. The master DVDs are signed, dated, and numbered and come with a corresponding viewing disc. The Blanton’s files contain instructions on displaying the piece in 16:9 aspect ratio with a 42-inch plasma screen. The artist also indicated that this aspect ratio should be preserved in any future conversions. Extract MetaData w/ MediaInfo Technical metadata is extracted from the disk image and the reformatted files using MediaInfo [17]. This includes the container, frame size, frame rate, overall bit rate, duration, aspect ratio, video and audio format and types and number of streams. Through this step the script can later verify the exact technical characteristics of the content so that these are not changed during the transcoding process. Provenance metadata, containing information about the kinds of conversion and their outcomes as results of the video quality control metrics, are generated by the workflow script as each step begins and ends. Generate ISO & MD5 checksum A disk image of the DVD and its checksum are produced. The disk image contains all the contents of the optical disc and can be used to reproduce a new copy of the master DVD and to convert it to preservation and access files. It does not allow for straightforward content playback. Its checksum are created. Generate MKV using FFMPEG In order to create the Matroska high quality preservation and access file, we use the metadata of the original file to ensure that the technical characteristics of the video are preserved accurately . We had to make sure though that the right information is parsed, as the technical metadata also includes aspect ratios for other video streams such as the top menu. We use the FFV1 codec, a lossless video codec, to allow exact original data to be reconstructed from the compressed DVD. If available, other streams such as audio and subtitles are copied in their original format. Generate MP4 streaming file Finally, the metadata of the master file is also extracted. The streaming is created to give the museum the option to stream the content electronically. The transcoding software, HandBrakeCLI, is used to generate an MP4 container along with a lossy codec such as H264. Quality Control The last step is the video quality metric is used to compare the original disk image to the preservation files and optionally to the streaming file.. The output of the SSIM algorithm is recorded as metadata.
Quality Control Full reference quality metric Structural Similarity Index Aim for result of 1 Algorithm available in different programming languages Modeled closely to human perception Visual inspection of the Matroska file The first table shows the output of the video quality metric for one of the test cases. The table shows the average SSIM index for the production quality file (MKV) and the streaming copy (MP4). A SSIM Index of one means that the two images being compared have the same quality, and thus that the encoding was lossless. In the conversion from disk image to production quality file we aim to for a result of one. A SSIM Index below one indicates a loss of quality. However, in our experience we still have to determine an acceptable threshold for lossy conversion. As seen in the table, the streaming copy has a small loss in quality due to the lossy encoder used to create it. When visually inspected, the loss in quality is imperceptible.
Software choices Informed by testing Open source, command line tools Support technical documentation Our choice of software was informed by testing, preference for open source well documented command line tools and cross platform pieces that are actively contributed by the community and support varied codecs and containers, can extract metadata, can handle audio files and menus and are highly configurable beyond their default settings. The main software pieces behind the program are mediainfo, ffmpeg, and python
Testing and Implementing the Automated Workflow at UT Libraries
UT Libraries and Audiovisual Digitization Audiovisual Digitization Unit DVD Archiving Preservation Reformatting Digital Preservation Example DVD archiving projects Benson Latin American Collection Fine Arts Library Streaming Video Project
DVD Archiving Requirements ISO image of original Streaming MP4 DVD access copy 2 checksums Descriptive metadata Source metadata Technical metadata
Current Workflow SIP BagIt Copy Create streaming MP4 from main ISO media stream using Handbrake Quality control of MP4 file Move and delete file from local workstation to remote server Compile descriptive metadata as MODS XML from catalog MARC record using MARCedit Compile source metadata as XML from Digitization record using Oxygen Compile technical metadata as XML from ISO and MP4 using MediaInfo Move all XML documents to remote server Create decrypted ISO disc image using DVDShrink Quality control of ISO file Move and delete file from local workstation to remote server Update Project Record with transfer notes and status Compile ISO file, MP4 file, and all XML files into a single directory folder Run BagIt script to contribute verifyvalid data, checksum data, manifest and tagmanifest data, as well as well baginfo data. Systems confederates XML into single UTVideoMD XML document Create physical derivatives from ISO Quality control of physical derivatives Label physical derivatives Send physical derivatives and originals to cataloging or return to branch and denote return in Project Record SIP BagIt Copy
Partnering with TACC TACC approached UT Libraries to expand testing and implementation of automated workflow Benefits of automation for the Libraries: Streamline and unify current workflow Less staff-intensive Reduce opportunities for error
Testing and Results Test DVDs Workstation Results 18 single-stream, unencrypted discs 3 multi-stream, encrypted discs Workstation Mac Pro running OS 10.8.3 Mountain Lion Results Encryption errors MP4 streaming compatibility Saving files to host directory SSIM and Python Library limitations Checksums Enthusiasm for the next phase!
UT Libraries Modifications Addition of drag and drop functionality DVD Project Folder Addition of web-optimized MP4 Elimination of Matroska file
Future Modifications Decryption 2 Checksums SSIM quality measure Expand metadata collection UTVideoMD Integrate BagIt
Manual Workflow SIP BagIt Copy Create streaming MP4 from main ISO media stream using Handbrake Quality control of MP4 file Move and delete file from local workstation to remote server Compile descriptive metadata as MODS XML from catalog MARC record using MARCedit Compile source metadata as XML from Digitization record using Oxygen Compile technical metadata as XML from ISO and MP4 using MediaInfo Move all XML documents to remote server Create decrypted ISO disc image using DVDShrink Quality control of ISO file Move and delete file from local workstation to remote server Update Project Record with transfer notes and status Compile ISO file, MP4 file, and all XML files into a single directory folder Run BagIt script to contribute verifyvalid data, checksum data, manifest and tagmanifest data, as well as well baginfo data. Systems confederates XML into single UTVideoMD XML document Create physical derivatives from ISO Quality control of physical derivatives Label physical derivatives Send physical derivatives and originals to cataloging or return to branch and denote return in Project Record SIP BagIt Copy
Automated Workflow BagIt SIP
Thank You Maria Esteva and Karla Vega Texas Advanced Computing Center Vandy Henriksen, Jennifer Lee, Wendy Martin University of Texas Libraries Sue Ellen Jeffers and Meredith Sutton Blanton Museum of Art Bethany Scott Charlotte Mecklenburg Library Kertana Kumar University of Texas College of Natural Sciences Summer Gunnels University of Texas Cockrell School of Engineering