Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of Technology Friday Jul 31, 2015 NCAR Supervisors: John Clyne and Alan Nortan HSS, CISL-NCAR
Problem we are trying to solve: Due to advancement in technology, large data is collected by the supercomputers, satellites, etc. There are two problems with Big Data:- The hard-disk which collects the data might not have enough disk-space. The speed at which the data can be read might be much lesser than the required speed. For example:
To tackle this problem, we compress the data. One way to compress the data is using Wavelets. Because of their multi-resolution and information compaction properties, wavelets are widely used for lossy compression in numerous consumer multimedia applications (e.g. images, music, and video). For example:
The parrot is compressed in the ratio 1:35 and the rose 1:18 using wavelets Source: http://arxiv.org/ftp/arxiv/papers/1004/1004.3276.pdf
What is Lossy Compression? Lossy Compression is the class of data encoding methods that uses inexact approximations (or partial data discarding) to represent the content. These techniques are used to reduce data size for storage, handling, and transmitting content. Source: Wikipedia
In lossy compression: Advantage Disadvantage Compressed data can be stored in hard disk and it also saves a lot of computation time While reconstructing back the data, some data is permanently lost.
Project Goal To determine compression parameters that: minimize distortion for a desired output file size. reduce the computation time and come with the best possible outcome.
Experiments done To achieve the project goal, we have been attempting to experimentally determine the optimal parameter choices for compressing numerical simulation data using wavelets. For this we experimented on three different big data sets, viz., two wrf hurricane data sets Katrina and Sandy and one turbulence data set Taylor Green.
Sandy ——— Grid resolution; 5320 x 5000 x 149 (= 16 Gigabytes / 3D variable) # 3D variables : 15 Time steps ~100 Total data set size: ~24Terabytes Katrina ————— Grid resolution; 316 x 310 x 35 (= 10 Megabytes / 3D variable) # 3D variables : 12 Time steps ~60 Total data set size: ~9 Gigabytes TG — Grid resolution; 1024^3 (= 4 Gigabytes / 3D variable) # 3D variables : 6 Total data set size: ~2.5Terabytes
Images of Hurricane Katrina which occurred on 29th August, 2005.
Images of Hurricane Sandy which occurred on October 25, 2012
Measurements: lmax, rmse, time Image of vortex iso-surfaces in a viscous flow starting from Taylor-Green initial conditions. Source : http://www.galcit.caltech.edu/research/highlights We constructed a python framework that allowed us to change various compression parameters like wavelet type and block size each time. Measurements: lmax, rmse, time Compression ratios:1,2,4,16,32,64,128,256
Compression parameters we wanted to explore: 1) Compare wavelet-types Bior3.3 and Bior4.4 The wavelet Bior4.4 is also called CDF9/7 wavelet which is widely used in the digital signal processing and image compression. The wavelet Bior3.3 is traditionally used in Vapor software. Goal: Determine if Bior4.4 is better than Bior3.3
Compression parameters we wanted to explore: 2) Compare block size 64x64x64 with other block sizes. 64 256
Compression parameters we wanted to explore: 2) Compare block size 64x64x64 with other block sizes. Determine if smaller blocks are better than the larger blocks. The two contrasting features are:- Smaller blocks are more computationally efficient than larger blocks. Larger blocks introduce less artefacts than the smaller blocks.
b) If the block sizes are not in integral multiples of the 64, some extra data is introduced to cover up the gap. This is called padding. The problem with padding is that while we are looking to compress the data, an extra data is introduced. For TG data, there is no padding but for Katrina and Sandy data, we have 50% and 30% padding respectively. Goal: Determine if the aligned data has comparable errors with the padded data. Example to illustrate padding:
64 50 64 196 149 150 50 64 50
We did the following three experiments: We compared the wavelet types Bior3.3 and bior4.4 for all the three data sets. We compared larger blocks with smaller blocks. For TG: 64x64x64 vs 128x128x128 vs 256x256x256 We compared padded data with aligned data. a) For Katrina: 64x64x64 vs 64x64x35 b) For Sandy: 64x64x64 vs 64x64x50
The plots for Katrina data illustrating Experiment 1. Bior3.3 vs bior4.4
Aligned data vs padded data The plot for sandy data illustrating Experiment 2. Aligned data vs padded data
Bigger blocks vs smaller blocks The plots for TG data illustrating Experiment 3. Bigger blocks vs smaller blocks
Lmax error for the wx variable of TG data set for the block sizes 64x64x64,128x128x128 and 256x256x256
RMSE error for the wx variable of TG data set for the block sizes 64x64x64,128x128x128 and 256x256x256
Source: Pablo Mininni, U. of Buenos Aires. When using a larger block size (256^3 vs 64^3) for the vx component of the TG data set( the data is compressed 512:1), we see improved compression quality as illustrated above:
Time taken for the wx variable of TG data set to construct the raw data for the block sizes 64x64x64,128x128x128 and 256x256x256
Conclusion: Bior4.4 is in some cases better than Bior3.3 Surprisingly larger block (say 256x256x256) is better than 64x64x64 in terms of both the computation time and error. The errors of the aligned data and the padded data are comparable.
Acknowledgements: My Supervisors John Clyne and Alan Nortan for their continued support. Dongliang Chu, Samuel Li and Kim Zhang. Delilah Gail Rutledge NCAR