IMPACT OF ORC COMPRESSION BUFFER SIZE Prasanth Jayachandran Member of Technical Staff – Apache Hive
ORC Layout ORC writer contain 1 or more child tree writers 1 tree writer per primitive column Each tree writer have 1 or more streams (ByteBuffers) depending on the type Integers Row index stream Present stream (will be absent if there are no nulls in the column) Data stream Strings Row index stream Present stream (will be absent if there are no nulls in the column) Data stream Length stream Dictionary data stream Each stream have the following buffers Uncompressed buffer Compressed buffer (created only if compression is enabled) Overflow buffer (created only if compression buffer overflows) Runtime memory requirement = compression buffer size * number of columns * number of streams * number of partitions (in case of dynamic partitioning) * number of buckets * 2 (if compression is enabled)
Test Setup Test data 10 million rows 14 string columns Test environment Single node 16GB RAM Default JVM heap size used for hive and hadoop Default for Hive – 256MB Default for Hadoop – 1000MB (child JVMs inherit this)
Impact on file size
Explanation Each compressed block is preceded with 3 byte header that contains the length of compressed block Lesser the compression buffer size, more the number of compressed blocks and hence more the file size (additional bytes for header)
Impact on load time
Explanation ZLIB uses deflate compression method with a default window size of 32KB [1] DEFLATE [2] = LZ77 + Huffman coding when ORC compression buffer size is >32KB multiple windows needs to be processed and hence increased compression and load time From the graph there is ~10s increase for buffer sizes >32KB SNAPPY is only LZ77 [3] compresses complete buffer (no window requirement) compression time/load time is almost same for all buffer sizes
Impact on query execution time
Explanation ZLIB decompression (INFLATE) is fast GZIP-vs-BZIP2-vs-LZMA-vs-ZIP-vs-Compress.html GZIP-vs-BZIP2-vs-LZMA-vs-ZIP-vs-Compress.html Query used insert overwrite directory '/tmp/foo' select c10, c11, c12, c13, c1, c2, c3, c4, c5, c6, c7, c8, c9 from test_8k_zlib where c14 > '0'; Does not have significant impact on query execution time
Impact in runtime memory
Explanation Max JVM heap memory = 1000MB 14 string columns 4 streams (no null values, present stream will be suppressed) 100 partitions 8KB compression buffer size Memory requirement = 8 * 1024 * 14 * 4 * 100 * 2 ~= 92MB 16KB compression buffer size Memory requirement = 16 * 1024 * 14 * 4 * 100 * 2 ~= 184MB 256KB memory requirement >1000MB and hence job failed with OOM exception
References ormat_description.txt ormat_description.txt 4. Comparison-GZIP-vs-BZIP2-vs-LZMA-vs-ZIP-vs- Compress.html