File Format Benchmark - Avro, JSON, ORC, & Parquet

Slides:



Advertisements
Similar presentations
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Advertisements

Using the Optimizer to Generate an Effective Regression Suite: A First Step Murali M. Krishna Presented by Harumi Kuno HP.
Pig Contributors Workshop Agenda Introductions What we are working on Usability Howl TLP Lunch Turing Completeness Workflow Fun (Bocci ball)
Hive - A Warehousing Solution Over a Map-Reduce Framework.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Dos and don’ts of Columnstore indexes The basis of xVelocity in-memory technology What’s it all about The compression methods (RLE / Dictionary encoding)
Project Management Database and SQL Server Katmai New Features Qingsong Yao
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Huffman Encoding 16-Apr-17.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
CS561-S2004 strategies for processing ad hoc queries 1 Strategies for Processing Ad Hoc Queries on Large Data Warehouses Presented by Fan Wu Instructor:
Putting the Sting in Hive Page 1 Alan F.
Distributed storage for structured data
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
Hadoop File Formats and Data Ingestion
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Advanced Tables Lesson 9. Objectives Creating a Custom Table When a table template doesn’t suit your needs, you can create a custom table in Design view.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Avro Apache Course: Distributed class Student ID: AM Name: Azzaya Galbazar
Hadoop File Formats and Data Ingestion
MongoDB An introduction. What is MongoDB? The name Mongo is derived from Humongous To say that MongoDB can handle a humongous amount of data Document.
Performance and Insights on File Formats – 2.0 Luca Menichetti, Vag Motesnitsalis.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
Presented by John Dougherty, Viriton 4/28/2015 Infrastructure and Stack.
One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP There are many good software modules available today that provide big.
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Nov 2006 Google released the paper on BigTable.
Student Centered ODS ETL Processing. Insert Search for rows not previously in the database within a snapshot type for a specific subject and year Search.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
Page 1 © Hortonworks Inc – All Rights Reserved Hive: Data Organization for Performance Gopal Vijayaraghavan.
Cloudera Kudu Introduction
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Apache Avro CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
IMPACT OF ORC COMPRESSION BUFFER SIZE Prasanth Jayachandran Member of Technical Staff – Apache Hive.
Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.
Databases.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Algorithmic complexity: Speed of algorithms
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Finding more space for your tight environment
CSE-291 (Cloud Computing) Fall 2016
Projects.
Open Source on .NET A real world use case.
Implementation Issues & IR Systems
Powering real-time analytics on Xfinity using Kudu
Hadoop EcoSystem B.Ramamurthy.
Lab #2 - Create a movies dataset
RELATIONAL DATABASE MODEL
Lesson 1: Introduction to Trifacta Wrangler
Apache Cassandra for the SQLServer DBA
Lesson 1: Introduction to Trifacta Wrangler
1 Demand of your DB is changing Presented By: Ashwani Kumar
Lecture 22: Compressed Linear Algebra for Large Scale ML
Lesson 1 – Chapter 1B Chapter 1B – Terminology
Microsoft Official Academic Course, Access 2016
Algorithmic complexity: Speed of algorithms
Four Rules For Columnstore Query Performance
Implementation of Relational Operations
Building a Threat-Analytics Multi-Region Data Lake on AWS
Huffman Encoding.
Algorithmic complexity: Speed of algorithms
HDInsight & Power BI By Łukasz Gołębiewski.
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
SDMX meeting Big Data technologies
Presentation transcript:

File Format Benchmark - Avro, JSON, ORC, & Parquet Owen O’Malley owen@hortonworks.com @owen_omalley September 2016

Who Am I? Worked on Hadoop since Jan 2006 MapReduce, Security, Hive, and ORC Worked on different file formats Sequence File, RCFile, ORC File, T-File, and Avro requirements

Goal Seeking to discover unknowns Use real & diverse data sets How do the different formats perform? What could they do better? Best part of open source is looking inside! Use real & diverse data sets Over-reliance on similar datasets leads to weakness Open & reviewed benchmarks

The File Formats

Avro Cross-language file format for Hadoop Schema evolution was primary goal Schema segregated from data Unlike Protobuf and Thrift Row major format

JSON Serialization format for HTTP & Javascript Text-format with MANY parsers Schema completely integrated with data Row major format Compression applied on top

ORC Originally part of Hive to replace RCFile Now top-level project Schema segregated into footer Column major format with stripes Rich type model, stored top-down Integrated compression, indexes, & stats

Parquet Design based on Google’s Dremel paper Schema segregated into footer Column major format with stripes Simpler type-model with logical types All data pushed to leaves of the tree Integrated compression and indexes

Data Sets

NYC Taxi Data Every taxi cab ride in NYC from 2009 Publically available http://tinyurl.com/nyc-taxi-analysis 18 columns with no null values Doubles, integers, decimals, & strings 2 months of data – 22.7 million rows

Github Logs All actions on Github public repositories Publically available https://www.githubarchive.org/ 704 columns with a lot of structure & nulls Pretty much the kitchen sink 1/2 month of data – 10.5 million rows

Finding the Github Schema The data is all in JSON. No schema for the data is published. We wrote a JSON schema discoverer. Scans the document and figures out the type Available at https://github.com/hortonworks/hive-json Schema is huge (12k)

Sales Generated data 55 columns with lots of nulls 25 million rows Real schema from a production Hive deployment Random data based on the data statistics 55 columns with lots of nulls A little structure Timestamps, strings, longs, booleans, list, & struct 25 million rows

Storage costs

Compression Data size matters! ORC and Parquet use RLE & Dictionaries Hadoop stores all your data, but requires hardware Is one factor in read speed ORC and Parquet use RLE & Dictionaries All the formats have general compression ZLIB (GZip) – tight compression, slower Snappy – some compression, faster

Taxi Size Analysis Don’t use JSON Use either Snappy or Zlib compression Avro’s small compression window hurts Parquet Zlib is smaller than ORC Group the column sizes by type

Taxi Size Analysis ORC did better than expected String columns have small cardinality Lots of timestamp columns No doubles  Need to revalidate results with original Improve random data generator Add non-smooth distributions

Github Size Analysis Surprising win for JSON and Avro Worst when uncompressed Best with zlib Many partially shared strings ORC and Parquet don’t compress across columns Need to investigate shared dictionaries.

Use Cases

Full Table Scans Read all columns & rows All formats except JSON are splitable Different workers do different parts of file

Taxi Read Performance Analysis JSON is very slow to read Large storage size for this data set Needs to do a LOT of string parsing Tradeoff between space & time Less compression is sometimes faster

Sales Read Performance Analysis Read performance is dominated by format Compression matters less for this data set Straight ordering: ORC, Avro, Parquet, & JSON Garbage collection is important ORC 0.3 to 1.4% of time Avro < 0.1% of time Parquet 4 to 8% of time

Github Read Performance Analysis Garbage collection is critical ORC 2.1 to 3.4% of time Avro 0.1% of time Parquet 11.4 to 12.8% of time A lot of columns needs more space Suspect that we need bigger stripes Rows/stripe - ORC: 18.6k, Parquet: 88.1k

Column Projection Often just need a few columns Only ORC & Parquet are columnar Only read, decompress, & deserialize some columns Dataset format compression us/row projection Percent time github orc zlib 21.319 0.185 0.87% parquet 72.494 0.585 0.81% sales 1.866 0.056 3.00% 12.893 0.329 2.55% taxi 2.766 0.063 2.28% 3.496 0.718 20.54%

Projection & Predicate Pushdown Sometimes have a filter predicate on table Select a superset of rows that match Selective filters have a huge impact Improves data layout options Better than partition pruning with sorting ORC has added optional bloom filters

Metadata Access ORC & Parquet store metadata Provides O(1) Access Stored in file footer File schema Number of records Min, max, count of each column Provides O(1) Access

Conclusions

Recommendations Disclaimer – Everything changes! Both these benchmarks and the formats will change. Don’t use JSON for processing. If your use case needs column projection or predicate push down: ORC or Parquet

Recommendations For complex tables with common strings Avro with Snappy is a good fit (w/o projection) For other tables ORC with Zlib or Snappy is a good fit Tweet benchmark suggestions to @owen_omalley

Fun Stuff Built open benchmark suite for files Built pieces of a tool to convert files Avro, CSV, JSON, ORC, & Parquet Built a random parameterized generator Easy to model arbitrary tables Can write to Avro, ORC, or Parquet

Remaining work Extend benchmark with LZO and LZ4. Finish predicate pushdown benchmark. Add C++ reader for ORC, Parquet, & Avro. Add Presto ORC reader.

Thank you! Twitter: @owen_omalley Email: owen@hortonworks.com