Download presentation
Presentation is loading. Please wait.
Published byPhyllis Walker Modified over 8 years ago
1
IMPACT OF ORC COMPRESSION BUFFER SIZE Prasanth Jayachandran Member of Technical Staff – Apache Hive
2
ORC Layout ORC writer contain 1 or more child tree writers 1 tree writer per primitive column Each tree writer have 1 or more streams (ByteBuffers) depending on the type Integers Row index stream Present stream (will be absent if there are no nulls in the column) Data stream Strings Row index stream Present stream (will be absent if there are no nulls in the column) Data stream Length stream Dictionary data stream Each stream have the following buffers Uncompressed buffer Compressed buffer (created only if compression is enabled) Overflow buffer (created only if compression buffer overflows) Runtime memory requirement = compression buffer size * number of columns * number of streams * number of partitions (in case of dynamic partitioning) * number of buckets * 2 (if compression is enabled)
3
Test Setup Test data 10 million rows 14 string columns Test environment Single node 16GB RAM Default JVM heap size used for hive and hadoop Default for Hive – 256MB Default for Hadoop – 1000MB (child JVMs inherit this)
4
Impact on file size
5
Explanation Each compressed block is preceded with 3 byte header that contains the length of compressed block Lesser the compression buffer size, more the number of compressed blocks and hence more the file size (additional bytes for header)
6
Impact on load time
7
Explanation ZLIB uses deflate compression method with a default window size of 32KB [1] DEFLATE [2] = LZ77 + Huffman coding when ORC compression buffer size is >32KB multiple windows needs to be processed and hence increased compression and load time From the graph there is ~10s increase for buffer sizes >32KB SNAPPY is only LZ77 [3] compresses complete buffer (no window requirement) compression time/load time is almost same for all buffer sizes
8
Impact on query execution time
9
Explanation ZLIB decompression (INFLATE) is fast http://bashitout.com/2009/08/30/Linux-Compression-Comparison- GZIP-vs-BZIP2-vs-LZMA-vs-ZIP-vs-Compress.html http://bashitout.com/2009/08/30/Linux-Compression-Comparison- GZIP-vs-BZIP2-vs-LZMA-vs-ZIP-vs-Compress.html Query used insert overwrite directory '/tmp/foo' select c10, c11, c12, c13, c1, c2, c3, c4, c5, c6, c7, c8, c9 from test_8k_zlib where c14 > '0'; Does not have significant impact on query execution time
10
Impact in runtime memory
11
Explanation Max JVM heap memory = 1000MB 14 string columns 4 streams (no null values, present stream will be suppressed) 100 partitions 8KB compression buffer size Memory requirement = 8 * 1024 * 14 * 4 * 100 * 2 ~= 92MB 16KB compression buffer size Memory requirement = 16 * 1024 * 14 * 4 * 100 * 2 ~= 184MB 256KB memory requirement >1000MB and hence job failed with OOM exception
12
References 1. http://tools.ietf.org/html/rfc1950 http://tools.ietf.org/html/rfc1950 2. http://tools.ietf.org/html/rfc1951 http://tools.ietf.org/html/rfc1951 3. https://code.google.com/p/snappy/source/browse/trunk/f ormat_description.txt https://code.google.com/p/snappy/source/browse/trunk/f ormat_description.txt 4. http://bashitout.com/2009/08/30/Linux-Compression- Comparison-GZIP-vs-BZIP2-vs-LZMA-vs-ZIP-vs- Compress.html
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.