IMPACT OF ORC COMPRESSION BUFFER SIZE Prasanth Jayachandran Member of Technical Staff – Apache Hive.

Slides:



Advertisements
Similar presentations
new database engine component fully integrated into SQL Server 2014 optimized for OLTP workloads accessing memory resident data achive improvements.
Advertisements

Hive Index Yongqiang He Software Engineer Facebook Data Infrastructure Team.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark Gavin Li, Jaebong Kim, Andy Feng Yahoo.
Lecture 6 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Dos and don’ts of Columnstore indexes The basis of xVelocity in-memory technology What’s it all about The compression methods (RLE / Dictionary encoding)
Image Compression, Transform Coding & the Haar Transform 4c8 – Dr. David Corrigan.
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
Efficient Storage and Retrieval of Data
M.P. Johnson, DBMS, Stern/NYU, Sp20041 C : Database Management Systems Lecture #25 Matthew P. Johnson Stern School of Business, NYU Spring, 2004.
Insertion into a B+ Tree Null Tree Ptr Data Pointer * Tree Node Ptr After Adding 8 and then 5… 85 Insert 1 : causes overflow – add a new level * 5 * 158.
1 CMSC 132: Object-Oriented Programming II Java Constructs Department of Computer Science University of Maryland, College Park.
Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.
1 A Comparison of Approaches to Large-Scale Data Analysis Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, Stonebraker, SIGMOD’09 Shimin Chen Big data reading.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Project By: Anuj Shetye Vinay Boddula. Introduction Motivation HBase Our work Evaluation Related work. Future work and conclusion.
Hypertable Doug Judd Background  Zvents plan is to become the “Google” of local search  Identified the need for a scalable DB 
Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Hadoop File Formats and Data Ingestion
CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop
Cloud Computing Other High-level parallel processing languages Keke Chen.
Architecture Rajesh. Components of Database Engine.
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.
DBI313. MetricOLTPDWLog Read/Write mixMostly reads, smaller # of rows at a time Scan intensive, large portions of data at a time, bulk loading Mostly.
File Systems in Real-Time Embedded Applications March 5th Eric Julien Understanding How the File Allocation Table (FAT) Operates 1.
1 HDF5 Life cycle of data Boeing September 19, 2006.
1 Chapter 7 Skip Lists and Hashing Part 2: Hashing.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Page 1 © Hortonworks Inc – All Rights Reserved Hive: Data Organization for Performance Gopal Vijayaraghavan.
Cloudera Kudu Introduction
CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
Fall 2015CISC/CMPE320 - Prof. McLeod1 CISC/CMPE320 Assignment 3 is due Sunday, the 8 th at 7pm. Today: –Two simple binding examples. –Function Hiding.
Computer Performance. Hard Drive - HDD Stores your files, programs, and information. If it gets full, you can’t save any more. Measured in bytes (KB,
Understanding the Data Page Structure Sathya Nekkanti Senior Manager/Database Architect.
Matrix Multiplication in Hadoop
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
 It is a pure oops language and a high level language.  It was developed at sun microsystems by James Gosling.
Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.
Dumps PDF Perform Data Engineering on Microsoft Azure HD Insight dumps.html Complete PDF File Download From.
A Look Under The Hood The fastest, in-memory analytic database in the world Dan Andrei STEFAN
Hadoop.
File Format Benchmark - Avro, JSON, ORC, & Parquet
INLS 623– Database Systems II– File Structures, Indexing, and Hashing
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Lecture 16: Data Storage Wednesday, November 6, 2006.
SQL Server 2000 and Access 2000 limits
Scaling SQL with different approaches
Oracle SQL*Loader
External Methods Chapter 15 (continued)
Lab #2 - Create a movies dataset
Sergey Vojtovich Software MariaDB Foundation
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
File Systems Implementation
Module 11: Data Storage Structure
Azure SQL DWH: Optimization
Memory – and Changing Data
Chapter 13: Data Storage Structures
ICOM 5016 – Introduction to Database Systems
Part 1: Shrink your data to (almost) nothing with. Trained Compression
CSE 326: Data Structures Lecture #14
Chapter 13: Data Storage Structures
Chapter 13: Data Storage Structures
CPS 296.3:Algorithms in the Real World
Lecture 20: Representing Data Elements
Presentation transcript:

IMPACT OF ORC COMPRESSION BUFFER SIZE Prasanth Jayachandran Member of Technical Staff – Apache Hive

ORC Layout ORC writer contain 1 or more child tree writers 1 tree writer per primitive column Each tree writer have 1 or more streams (ByteBuffers) depending on the type Integers Row index stream Present stream (will be absent if there are no nulls in the column) Data stream Strings Row index stream Present stream (will be absent if there are no nulls in the column) Data stream Length stream Dictionary data stream Each stream have the following buffers Uncompressed buffer Compressed buffer (created only if compression is enabled) Overflow buffer (created only if compression buffer overflows) Runtime memory requirement = compression buffer size * number of columns * number of streams * number of partitions (in case of dynamic partitioning) * number of buckets * 2 (if compression is enabled)

Test Setup Test data 10 million rows 14 string columns Test environment Single node 16GB RAM Default JVM heap size used for hive and hadoop Default for Hive – 256MB Default for Hadoop – 1000MB (child JVMs inherit this)

Impact on file size

Explanation Each compressed block is preceded with 3 byte header that contains the length of compressed block Lesser the compression buffer size, more the number of compressed blocks and hence more the file size (additional bytes for header)

Impact on load time

Explanation ZLIB uses deflate compression method with a default window size of 32KB [1] DEFLATE [2] = LZ77 + Huffman coding when ORC compression buffer size is >32KB multiple windows needs to be processed and hence increased compression and load time From the graph there is ~10s increase for buffer sizes >32KB SNAPPY is only LZ77 [3] compresses complete buffer (no window requirement) compression time/load time is almost same for all buffer sizes

Impact on query execution time

Explanation ZLIB decompression (INFLATE) is fast GZIP-vs-BZIP2-vs-LZMA-vs-ZIP-vs-Compress.html GZIP-vs-BZIP2-vs-LZMA-vs-ZIP-vs-Compress.html Query used insert overwrite directory '/tmp/foo' select c10, c11, c12, c13, c1, c2, c3, c4, c5, c6, c7, c8, c9 from test_8k_zlib where c14 > '0'; Does not have significant impact on query execution time

Impact in runtime memory

Explanation Max JVM heap memory = 1000MB 14 string columns 4 streams (no null values, present stream will be suppressed) 100 partitions 8KB compression buffer size Memory requirement = 8 * 1024 * 14 * 4 * 100 * 2 ~= 92MB 16KB compression buffer size Memory requirement = 16 * 1024 * 14 * 4 * 100 * 2 ~= 184MB 256KB memory requirement >1000MB and hence job failed with OOM exception

References ormat_description.txt ormat_description.txt 4. Comparison-GZIP-vs-BZIP2-vs-LZMA-vs-ZIP-vs- Compress.html