Performance and Insights on File Formats – 2.0 Luca Menichetti, Vag Motesnitsalis.

Slides:



Advertisements
Similar presentations
Test Case Management and Results Tracking System October 2008 D E L I V E R I N G Q U A L I T Y (Short Version)
Advertisements

Key-word Driven Automation Framework Shiva Kumar Soumya Dalvi May 25, 2007.
Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.
System Design and Memory Limits. Problem  If you were integrating a feed of end of day stock price information (open, high, low, and closing price) for.
Spark Performance Patrick Wendell Databricks.
WebFOCUS Active Technologies: Continuing Innovation
Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,
--Motivation for using a database--1 Motivation for usin1g a database Why Use a Database.
CHEP 2015 Analysis of CERN Computing Infrastructure and Monitoring Data Christian Nieke, CERN IT / Technische Universität Braunschweig On behalf of the.
Supplementary Training Modules on Good Manufacturing Practice
Oracle XML Publisher Integration with PeopleSoft By, Mr. Venkat.
Job Description Report Generation. Job Description Reporting Click on Manage JD and select JD Report.
Introduction - The Need for Data Structures Data structures organize data –This gives more efficient programs. More powerful computers encourage more complex.
+ RSS Aggregation and Syndication. + Really Simple Syndication (aka, Rich Site Summary) Image source:
Hadoop File Formats and Data Ingestion
Non-intrusive Energy Disaggregation using Signal Unmixing Undergraduate: Philip Wolfe Mentors: Alireza Rahimpour, Yang Song Professor: Dr. Hairong Qi Final.
Objectives Learn what a file system does
File Organization Techniques
Offline Performance Monitoring for Linux Abhishek Shukla.
Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Avro Apache Course: Distributed class Student ID: AM Name: Azzaya Galbazar
Hadoop File Formats and Data Ingestion
Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
HDF-EOS Workshop VII, An XML Approach to HDF-EOS5 Files Jingli Yang 1, Bob Bane 1, Muhammad Rabi 1, Zhangshi Yin 1, Richard Ullman 1, Robert McGrath.
January 11, Files – Chapter 1 Introduction to the Design and Specification of File Structures.
TriUlti Senior Project iFlowEdit HTML5 Canvas Workflow Diagram Editor Sponsored By iNNOVA IT Solution Inc.
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Key Data Management Tasks in Stata
Step 1: Getting Started Preparing for the assignment and getting ready to choose a topic.
SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles.
Just as there are many human languages, there are many computer programming languages that can be used to develop software. Some are named after people,
1 Performance Optimization In QTP Execution Over Video Automation Testing Speaker : Krishnesh Sasiyuthaman Nair Date : 10/05/2012.
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
CS370 Spring 2007 CS 370 Database Systems Lecture 1 Overview of Database Systems.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
When buying a product in today’s marketplace, an individual is faced with the challenge of remembering an abundance of information about the product that.
Logic Analyzer ECE-4220 Real-Time Embedded Systems Final Project Dallas Fletchall.
Final project presentation by Alsharidah, Mosaed.
A New Operating Tool for Coding in Lossless Image Compression Radu Rădescu University POLITEHNICA of Bucharest, Faculty of Electronics, Telecommunications.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
NEFIS (WP5) Evaluation Meeting, November 2004 Evaluation Metadata Aljoscha Requardt, University of Hamburg Response rate: 93% (14 of 15 partners.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
CS5604: Final Presentation ProjOpenDSA: Log Support Victoria Suwardiman Anand Swaminathan Shiyi Wei Department of Computer Science, Virginia Tech December.
"Data sources index" a web application to list projects in Hadoop Luca Menichetti.
Filtering, aggregating and histograms A FEW COMPLETE EXAMPLES WITH MR, SPARK LUCA MENICHETTI, VAG MOTESNITSALIS.
Apache Avro CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.
1 Agenda TMA02 M876 Block 4. 2 Model of database development data requirements conceptual data model logical schema schema and database establishing requirements.
ORNL is managed by UT-Battelle for the US Department of Energy Spark On Demand Deploying on Rhea Dale Stansberry John Harney Advanced Data and Workflows.
Big Data is a Big Deal!.
File Format Benchmark - Avro, JSON, ORC, & Parquet
Spark Presentation.
Software as Data Structure
Open Source on .NET A real world use case.
Functions CIS 40 – Introduction to Programming in Python
Azure Machine Learning & ML Studio
SETL: Efficient Spark ETL on Hadoop
Methodology & Current Results
Lesson 1: Introduction to Trifacta Wrangler
Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.
Lesson 6: Tools Chapter 6D – Lookup.
Lesson 6: Tools Chapter 6C – Join.
Lesson 5: Wrangling Tools
Lesson 3: Trifacta Basics
Lesson 3: Trifacta Basics
Arrays.
HDInsight & Power BI By Łukasz Gołębiewski.
Day 2: introduction.
Presentation transcript:

Performance and Insights on File Formats – 2.0 Luca Menichetti, Vag Motesnitsalis

Design and Expectations 2 Use Cases:  Exhaustive (operation using all values of a record)  Selective (operation using limited values of a record) 5 Data Formats: CSV, Parquet, serialized RDD objects, JSON, Apache Avro The tests gave insights on specific advantages and dis- advantages for each format as well as their time and space performance. 2

Experiment descriptions For the “exhaustive” use case (UC1) we used EOS logs “processed” data.  Current default data format is CSV. For the “selective” use case (UC2) we used experiment Job Monitoring data from Dashboard.  Current default data format is JSON. For each use case all formats were generated a priori (from the default format) and then executed the tests. Technology: Spark (Scala) with SparkSQL library. No test performed with compression. 3

Formats CSV – text files, comma separated values, one per line JSON – text files, JavaScript objects, one per line Serialiazed RDD Objects (SRO) – Spark dataset serialized on text files Avro – serialization format with binary encoding Parquet – colunmar format with binary encoding 4

Space Requirements (in GB) 5

Spark executions for i in {1.. 50} foreach format in {CSV, JSON, SRO, Avro, Parquet} foreach UC in {UC1, UC2} spark-submit --execution-number 2 --execution-cores 2 --execution-memory 2G --class ch.cern.awg.Test$UC$format formats-analyses.jar input-$UC-$format > output-$UC-$format-$i We took the time from all (UC, format) jobs to calculate an average for each type of execution (deleting outliers). Times include reading and computation (test jobs don't write any file, they just print to stdout the result ). 6

Times: UC1 "Exhaustive" GB 7

Times: UC2 "Selective" GB 8

Time Comparison between UC1 and UC2 9

Space and Time Performance Gain/Loss [compared to current default format] CSVJSONSROAvroParquet Space UC1 [EOS logs] CSV =+ 84 %+ 56 %- 8 %- 51 % Time performance UC1 =+ 215 %+ 93 %=+ 35 % Space UC2 [Job Monitoring] JSON - 54 %=- 40 %- 51 %- 84 % Time performance UC %=- 35 %- 54 %- 79 % 10

Pros and Cons ProsCons CSVAlways supported and easy to use. Efficient. No schema change allowed. No type definitions. No declaration control. JSONEncoded in plain text (easy to use). Schema changes allowed. Inefficient. High space consuming. No declaration control. Serialized RDD Objects Declaration control. Choice “between” CSV and JSON (for space and time). Good to store aggregate result. Spark only. No compression. Schema changes allowed but to be manually implemented. AvroSchema changes allowed. Efficiency comparable to CSV. Compression definition included in the schema. Space consuming like CSV (not really a negative). Needs a plugin (we found an incompatibility with our Spark version and avro library, we had to fix and recompile it). ParquetLow space consuming (RLE). Extremely efficient for “selective” use cases but good performances also in other cases. Needs a plugin. Slow to be generated. 11

Data Formats - Overview CSVJSONSROAvroParquet Support Change of Schema NOYES Primitive/Complex Types -YES (but with general numeric) YES Declaration control-NOYES Support CompressionYES NOYES Storage ConsumptionMediumHighMedium/HighMediumLow (RLE) Supported by which technologies? AllAll (to be parsed from text) Spark onlyAll (needs plugin) Possilibity to print a snippet as sample YES NOYES (with avro tools) NO (yes with unofficial tools) 12

Conclusions There is no “ultimate” file format but… Avro shows promising results for exhaustive use cases, with performances comparable to CSV. Parquet shows extremely good results for selective use cases and really low space consuming. JSON is good to store directly (without any additional effort) data coming from web-like services that might change their format in a future, but it is too inefficient and high space consuming. CSV is still quite efficient in time and space, but the schema is frozen and leave the validation up to the user. Serialized Spark RDD is a good solution to store Scala objects that need to be reused soon (like aggregated results to plot or intermediate results to save for future computation), but it is not advisable to use it as final format since it’s not a general purpose format. 13

Thank You 14

Spark UC1 executions 15

Spark UC2 executions 16