Data Compression for PDS4 Lisa Gaddis, Sue LaVoie, Jeff Anderson, Elizabeth Rye PDS Imaging Node March 26, 2010
Syntax Data Compression Image Compression File Compression Encodes information using fewer bits Reduces consumption of expensive resources Data storage and/or transmission bandwidth Requires decompression Trade-offs degree of compression amount of ‘distortion’ introduced computational resources required for decompression Image Compression Application of data compression to digital images Reduces redundancy in images to improve efficiency of storage and transmission Lossless and lossy methods Preserve image quality at a given bit- or compression-rate File Compression Reduces redundancy at the file level Many available tools ZIP GZIP BZIP2 Imaging Node Data Compression
Why image compression? Image compression for data providers and archivists NASA missions deliver significant numbers of large image files Need to support and/or reduce storage costs and data transmission times of images Promotes exchange between different users and systems Athough falling in cost, storage is expensive for many TB of data and multiple copies FY10: ~$750/TB for RAID storage with network infrastructure Imaging Node Data Compression
Image Compression Lossless compression Lossy compression Exploits data redundancy Image can be recovered exactly ‘Run-length encoding’ makes use of redundant patterns or ‘runs’ ‘LZW (Lempel Ziv Welch) encoding’ also address strings of characters; builds up a table of strings and their corresponding codes ‘Huffman coding’ uses a binary encoding tree to represent commonly occurring values in few bits and less frequently occurring values in more bits Best for documents, computer programs, line drawings, etc. JPEG2000 has a lossless option, approved for use by PDS Lossy compression Exploits data redundancy and ‘irrelevant’ data Image data are not recovered exactly JPEG JPEG2000 (lossy) Best for digital images, audio, video Not approved for PDS archive data Exceptions: Browse and some EDR images (e.g., Clementine UVVIS and NIR) are lossy JPEG images (5.5 ave. compression rate) Imaging Node Data Compression
MRO and LRO images Not your typical images MESSENGER MDIS, Viking Orbiter, Galileo SSI, etc. Framing cameras 800 samples x 800 lines to 1024 samples x 1024 lines Roughly one megabyte (MB) per observation PDS Imaging Node combined archive requirements for all missions other than LRO and MRO is <25 TB MRO/HiRISE, LRO/LROC Line-scan cameras 10,000-20,000 samples x 50,000-100,000 lines Roughly 500 to 2,000 MB per observation Combined expected archive total for MRO and LRO is 500 TB 20X larger than sum total of all other Imaging Node holdings Imaging Node Data Compression
Image Compression for HiRISE RDRs Why image compression was needed Enormous volume of HiRISE archive, 1 yr EDR – 12,100 Gb (~1.5 TB) RDR – 92,500 Gb (11.3 TB) Very large Standard Data Products EDR (2048 X 64,000, 16-bit) = 262 MB RDR (40,000 x 64,000, after reprojection, 16-bit) = ~500 to 1000 MB Advantages for delivery of RDR data in JPEG2000 format Losslessly recompressed format Wavelet compression greatly improves speed of web access Fast browse, zoom, pan capabilities for handling large files Volume projections EDR DVD volumes: 321 (losslessly recompressed) vs 482 (uncompressed) (1.5 compression ratio) RDR DVD volumes: 2400 (losslessly compressed) vs 7300 (uncompressed) (assuming 3.0 compression ratio) Imaging Node Data Compression
HiRISE Example JPEG2000 image compression applied to map-projected RDR images only lots of null pixels Nulls are highly compressed as a result of the lossless compression using JPEG2000 Projected ~3:1 compression ratios Achieved 15:1 in recent tests Imaging Node Data Compression
Past Experience Problems with compression Voyager, Viking, and MGS-MOC PDS archives contain losslessly compressed data Decompression algorithms (e.g., in ISIS) break due to New compilers New operating systems Changes in hardware architecture (32-bit vs 64-bit) JPEG2000 compressed HiRISE RDR images are supported by ISIS3 But, when JPEG2000 format reaches end-of-life, software maintenance to read this format will be much more difficult than the existing Voyager/Viking/MGS-MOC algorithms A proliferation of image compression formats in PDS would be a problem for long-term archiving and usability of the images Imaging Node Data Compression
Data Storage Costs: MRO & LRO Expected PDS storage requirements for the MRO nominal mission are 75TB High capacity RAID storage & network infrastructure costs ~$750 per TB The hardware cost to store a single copy of the MRO data is ~$56K Only one copy of the three required by PDS Does not include data from an extended mission Archive includes JPEG2000 compressed images LRO archive volume is projected to be ~400 TB Hardware cost for one copy is ~$300K Same caveats as above apply Imaging Node Data Compression
PDS3 Compressed Image Formats Clem-JPEG (not in PDS Standards Reference) Huffman First Difference (“) JPEG2000 Improved compression efficiency (vs. JPEG) Highly scalable embedded data streams Progressive lossy to lossless compression within a single data stream Arbitrarily crop images in the compressed domain Selectively enhance quality of spatial “regions of interest” Support for very large images Used for HiRISE & LROC RDRs Previous Pixel (“) Run Length (“) Zip, gzip = GNU zip Widely used open-source tool Runs on a variety of common computer platforms Available since 1992 Imaging Node Data Compression
Possible Solution for PDS4 Allow File Compression Use standard, non-patented algorithms (e.g., Lempel-Ziv 77, Huffman coding) Use stable, open-source, well-maintained software (e.g., gzip) Tests using gzip, HiRISE data RDRs HiRISE RDR, JPEG2000 = 454 MB Uncompressed, converted to raw format = 6.6 GB (15x larger) Compressed using gzip = 1.1 GB (2.5x larger) EDRs Not compressed, typical file size = 250 MB gzipped versions = 100 MB (2.5x smaller) Overall the HiRISE archive would be 5% smaller gzip EDRs Convert RDRs to raw, then gzip Imaging Node Data Compression
Recommendation Allow file-based compression (such as gzip, bzip2) in PDS4 Stable, free, widely used open-source software tool Works on a variety of common computer platforms Macs, PCs, Solaris, MSDOS, VAX, etc. Maintained by open-source community Consistent with PDS3 history, PDS4 plans for simplification Reduces storage costs Improves data transfer rates over internet Supports management and delivery of high-volume data sets for providers and users Imaging Node Data Compression
Policy Questions Do we permit compression at all in the PDS4 archive? If so: Do we want a mixture of compressed and uncompressed data? One copy is uncompressed, two are compressed Do we distinguish between EDRs and RDRs and other derived products? Do we distinguish between frequently accessed data and those offline and/or in ‘deep archive’ storage? Store deep archive data in uncompressed form or use one approved compression format (e.g., gzip) Permit nodes to use and maintain other compression methods as needed for one or more copies Whatever we decide, do we require older, compressed data to be ‘restored’ to meet requirements of the new compression policy? Imaging Node Data Compression