IPDPS Boston Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications Tekin Bicer, Jian Yin, David Chiu, Gagan Agrawal and Karen Schuchardt Ohio State University Washington State University Pacific Northwest National Laboratories 1
IPDPS Boston Introduction Scientific simulations and instruments can generate large amount of data –E.g. Global Cloud Resolving Model 1PB data for 4km grid-cell –Higher resolutions, more and more data –I/O operations become bottleneck Problems –Storage, I/O performance Compression 2
IPDPS Boston Motivation Generic compression algorithms –Good for low entropy sequence of bytes –Scientific dataset are hard to compress Floating point numbers: Exponent and mantissa Mantissa can be highly entropic Using compression in applications is challenging –Suitable compression algorithms –Utilization of available resources –Integration of compression algorithms 3
IPDPS Boston Outline Introduction Motivation Compression Methodology Online Compression Framework Experimental Results Related Work Conclusion 4
IPDPS Boston Compression Methodology Common properties of scientific datasets –Multidimensional arrays –Consist of floating point numbers –Relationship between neighboring values Domain specific solutions can help Approach: –Prediction-based differential compression Predict the values of neighboring cells Store the difference 5
IPDPS Boston Example: GCRM Temperature Variable Compression E.g.: Temperature record The values of neighboring cells are highly related X table (after prediction): X compressed values –5bits for prediction + difference Lossless and lossy comp. Fast and good compression ratios 6
IPDPS Boston Compression Framework Improve end-to-end application performance Minimize the application I/O time –Pipelining I/O and (de)comp. operations Hide computational overhead –Overlapping app. computation with comp. framework Easy implementation of diff. comp. alg. Easy integration with applications –Similar API to POSIX I/O 7
IPDPS Boston A Compression Framework for Data Intensive Applications Chunk Resource Allocation (CRA) Layer Initialization of the system Generate chunk requests, enqueue processing Converting original offset and data size requests to compressed 8 Parallel Compression Engine (PCE) Applies encode(), decode() functions to chunks Manages in-memory cache with informed prefetching Creates I/O requests Parallel I/O Layer (PIOL) Creates parallel chunk requests to storage medium Each chunk request is handled by a group of threads Provides abstraction for different data transfer protocols
IPDPS Boston Compression Framework API User defined functions: –encode_t(…): (R) Code for compression –decode_t(…): (R) Code for decompression –prefetch_t(…): (O) Informed prefetching function Application can use below functions –comp_read: Applies decode_t to comp. chunk –comp_write: Applies encode_t to original chunk comp_seek: Mimics fseek, also utilizes prefetch_t –comp_init: Init. system (thread pools, cache etc.) 9
IPDPS Boston Prefetching and In-Memory Cache Overlapping application layer computation with I/O Reusability of already accessed data is small Prefetching and caching the prospective chunks –Default is LRU –User can analyze history and provide prospective chunk list Cache uses row-based locking scheme for efficient consecutive chunk requests 10 Informed Prefetching prefetch(…)
IPDPS Boston Integration with a Data-Intensive Computing System MapReduce style API –Remote data processing –Sensitive to I/O bandwidth Processes data in… –local cluster –cloud –or both (Hybrid Cloud) 11
IPDPS Boston Outline Introduction Motivation Compression Methodology Online Compression Framework Experimental Results Related Work Conclusion 12
IPDPS Boston Experimental Setup Two datasets: –GCRM: 375GB (L:270 + R:105) –NPB: 237GB (L:166 + R:71) 16x8 cores (Intel Xeon 2.53GHz) Storage of datasets –Lustre FS (14 storage nodes) –Amazon S3 (Northern Virginia) Compression algorithms –CC, FPC, LZO, bzip, gzip, lzma Applications: AT, MMAT, KMeans 13
IPDPS Boston Performance of MMAT 14 Breakdown of Performance Overhead (Local): 15.41% Read Speedup: 1.96
IPDPS Boston Lossy Compression (MMAT) 15 Lossy #e: # dropped bits Error bound: 5x(1/10^5)
IPDPS Boston 16 Performance of KMeans NPB dataset Comp ratio: 24.01% (180GB) More computation –More opportunity to fetch and decompression
IPDPS Boston Conclusion Management and analysis of scientific datasets are challenging –Generic compression algorithms are inefficient for scientific datasets We proposed a compression framework and methodology –Domain specific compression algorithms are fast and space efficient 51.68% compression ratio 53.27% improvement in exec. time –Easy plug-and-play of compression –Integration of the proposed framework and methodology with a data analysis middleware 17
IPDPS Boston Thanks! 18
IPDPS Boston 19 Multithreading & Prefetching Diff. # PCE and I/O Threads 2P – 4IO –2 PCE threads, 4 I/O threads One core is assigned to comp. framework
IPDPS Boston Related Work (Scientific) data management –NetCDF, PNetCDF, HDF5 –Nicolae et al. (BlobSeer) Distributed data management service for efficient reading, writing and appending ops. Compression –Generic: LZO, bzip, gzip, szip, LZMA etc. –Scientific Schendel and Jin et al. (ISOBAR) –Organizes highly entropic data into compressible data chunks Burtscher et al. (FPC) –Efficient double-precision floating point compression Lakshminarasimhan et al. (ISABELA) 20