Download presentation
Presentation is loading. Please wait.
Published byMercedes Tolliver Modified over 10 years ago
1
IPDPS 2013 - Boston Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications Tekin Bicer, Jian Yin, David Chiu, Gagan Agrawal and Karen Schuchardt Ohio State University Washington State University Pacific Northwest National Laboratories 1
2
IPDPS 2013 - Boston Introduction Scientific simulations and instruments can generate large amount of data –E.g. Global Cloud Resolving Model 1PB data for 4km grid-cell –Higher resolutions, more and more data –I/O operations become bottleneck Problems –Storage, I/O performance Compression 2
3
IPDPS 2013 - Boston Motivation Generic compression algorithms –Good for low entropy sequence of bytes –Scientific dataset are hard to compress Floating point numbers: Exponent and mantissa Mantissa can be highly entropic Using compression in applications is challenging –Suitable compression algorithms –Utilization of available resources –Integration of compression algorithms 3
4
IPDPS 2013 - Boston Outline Introduction Motivation Compression Methodology Online Compression Framework Experimental Results Related Work Conclusion 4
5
IPDPS 2013 - Boston Compression Methodology Common properties of scientific datasets –Multidimensional arrays –Consist of floating point numbers –Relationship between neighboring values Domain specific solutions can help Approach: –Prediction-based differential compression Predict the values of neighboring cells Store the difference 5
6
IPDPS 2013 - Boston Example: GCRM Temperature Variable Compression E.g.: Temperature record The values of neighboring cells are highly related X table (after prediction): X compressed values –5bits for prediction + difference Lossless and lossy comp. Fast and good compression ratios 6
7
IPDPS 2013 - Boston Compression Framework Improve end-to-end application performance Minimize the application I/O time –Pipelining I/O and (de)comp. operations Hide computational overhead –Overlapping app. computation with comp. framework Easy implementation of diff. comp. alg. Easy integration with applications –Similar API to POSIX I/O 7
8
IPDPS 2013 - Boston A Compression Framework for Data Intensive Applications Chunk Resource Allocation (CRA) Layer Initialization of the system Generate chunk requests, enqueue processing Converting original offset and data size requests to compressed 8 Parallel Compression Engine (PCE) Applies encode(), decode() functions to chunks Manages in-memory cache with informed prefetching Creates I/O requests Parallel I/O Layer (PIOL) Creates parallel chunk requests to storage medium Each chunk request is handled by a group of threads Provides abstraction for different data transfer protocols
9
IPDPS 2013 - Boston Compression Framework API User defined functions: –encode_t(…): (R) Code for compression –decode_t(…): (R) Code for decompression –prefetch_t(…): (O) Informed prefetching function Application can use below functions –comp_read: Applies decode_t to comp. chunk –comp_write: Applies encode_t to original chunk comp_seek: Mimics fseek, also utilizes prefetch_t –comp_init: Init. system (thread pools, cache etc.) 9
10
IPDPS 2013 - Boston Prefetching and In-Memory Cache Overlapping application layer computation with I/O Reusability of already accessed data is small Prefetching and caching the prospective chunks –Default is LRU –User can analyze history and provide prospective chunk list Cache uses row-based locking scheme for efficient consecutive chunk requests 10 Informed Prefetching prefetch(…)
11
IPDPS 2013 - Boston Integration with a Data-Intensive Computing System MapReduce style API –Remote data processing –Sensitive to I/O bandwidth Processes data in… –local cluster –cloud –or both (Hybrid Cloud) 11
12
IPDPS 2013 - Boston Outline Introduction Motivation Compression Methodology Online Compression Framework Experimental Results Related Work Conclusion 12
13
IPDPS 2013 - Boston Experimental Setup Two datasets: –GCRM: 375GB (L:270 + R:105) –NPB: 237GB (L:166 + R:71) 16x8 cores (Intel Xeon 2.53GHz) Storage of datasets –Lustre FS (14 storage nodes) –Amazon S3 (Northern Virginia) Compression algorithms –CC, FPC, LZO, bzip, gzip, lzma Applications: AT, MMAT, KMeans 13
14
IPDPS 2013 - Boston Performance of MMAT 14 Breakdown of Performance Overhead (Local): 15.41% Read Speedup: 1.96
15
IPDPS 2013 - Boston Lossy Compression (MMAT) 15 Lossy #e: # dropped bits Error bound: 5x(1/10^5)
16
IPDPS 2013 - Boston 16 Performance of KMeans NPB dataset Comp ratio: 24.01% (180GB) More computation –More opportunity to fetch and decompression
17
IPDPS 2013 - Boston Conclusion Management and analysis of scientific datasets are challenging –Generic compression algorithms are inefficient for scientific datasets We proposed a compression framework and methodology –Domain specific compression algorithms are fast and space efficient 51.68% compression ratio 53.27% improvement in exec. time –Easy plug-and-play of compression –Integration of the proposed framework and methodology with a data analysis middleware 17
18
IPDPS 2013 - Boston Thanks! 18
19
IPDPS 2013 - Boston 19 Multithreading & Prefetching Diff. # PCE and I/O Threads 2P – 4IO –2 PCE threads, 4 I/O threads One core is assigned to comp. framework
20
IPDPS 2013 - Boston Related Work (Scientific) data management –NetCDF, PNetCDF, HDF5 –Nicolae et al. (BlobSeer) Distributed data management service for efficient reading, writing and appending ops. Compression –Generic: LZO, bzip, gzip, szip, LZMA etc. –Scientific Schendel and Jin et al. (ISOBAR) –Organizes highly entropic data into compressible data chunks Burtscher et al. (FPC) –Efficient double-precision floating point compression Lakshminarasimhan et al. (ISABELA) 20
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.