Performance and Insights on File Formats – 2.0 Luca Menichetti, Vag Motesnitsalis
Design and Expectations 2 Use Cases: Exhaustive (operation using all values of a record) Selective (operation using limited values of a record) 5 Data Formats: CSV, Parquet, serialized RDD objects, JSON, Apache Avro The tests gave insights on specific advantages and dis- advantages for each format as well as their time and space performance. 2
Experiment descriptions For the “exhaustive” use case (UC1) we used EOS logs “processed” data. Current default data format is CSV. For the “selective” use case (UC2) we used experiment Job Monitoring data from Dashboard. Current default data format is JSON. For each use case all formats were generated a priori (from the default format) and then executed the tests. Technology: Spark (Scala) with SparkSQL library. No test performed with compression. 3
Formats CSV – text files, comma separated values, one per line JSON – text files, JavaScript objects, one per line Serialiazed RDD Objects (SRO) – Spark dataset serialized on text files Avro – serialization format with binary encoding Parquet – colunmar format with binary encoding 4
Space Requirements (in GB) 5
Spark executions for i in {1.. 50} foreach format in {CSV, JSON, SRO, Avro, Parquet} foreach UC in {UC1, UC2} spark-submit --execution-number 2 --execution-cores 2 --execution-memory 2G --class ch.cern.awg.Test$UC$format formats-analyses.jar input-$UC-$format > output-$UC-$format-$i We took the time from all (UC, format) jobs to calculate an average for each type of execution (deleting outliers). Times include reading and computation (test jobs don't write any file, they just print to stdout the result ). 6
Times: UC1 "Exhaustive" GB 7
Times: UC2 "Selective" GB 8
Time Comparison between UC1 and UC2 9
Space and Time Performance Gain/Loss [compared to current default format] CSVJSONSROAvroParquet Space UC1 [EOS logs] CSV =+ 84 %+ 56 %- 8 %- 51 % Time performance UC1 =+ 215 %+ 93 %=+ 35 % Space UC2 [Job Monitoring] JSON - 54 %=- 40 %- 51 %- 84 % Time performance UC %=- 35 %- 54 %- 79 % 10
Pros and Cons ProsCons CSVAlways supported and easy to use. Efficient. No schema change allowed. No type definitions. No declaration control. JSONEncoded in plain text (easy to use). Schema changes allowed. Inefficient. High space consuming. No declaration control. Serialized RDD Objects Declaration control. Choice “between” CSV and JSON (for space and time). Good to store aggregate result. Spark only. No compression. Schema changes allowed but to be manually implemented. AvroSchema changes allowed. Efficiency comparable to CSV. Compression definition included in the schema. Space consuming like CSV (not really a negative). Needs a plugin (we found an incompatibility with our Spark version and avro library, we had to fix and recompile it). ParquetLow space consuming (RLE). Extremely efficient for “selective” use cases but good performances also in other cases. Needs a plugin. Slow to be generated. 11
Data Formats - Overview CSVJSONSROAvroParquet Support Change of Schema NOYES Primitive/Complex Types -YES (but with general numeric) YES Declaration control-NOYES Support CompressionYES NOYES Storage ConsumptionMediumHighMedium/HighMediumLow (RLE) Supported by which technologies? AllAll (to be parsed from text) Spark onlyAll (needs plugin) Possilibity to print a snippet as sample YES NOYES (with avro tools) NO (yes with unofficial tools) 12
Conclusions There is no “ultimate” file format but… Avro shows promising results for exhaustive use cases, with performances comparable to CSV. Parquet shows extremely good results for selective use cases and really low space consuming. JSON is good to store directly (without any additional effort) data coming from web-like services that might change their format in a future, but it is too inefficient and high space consuming. CSV is still quite efficient in time and space, but the schema is frozen and leave the validation up to the user. Serialized Spark RDD is a good solution to store Scala objects that need to be reused soon (like aggregated results to plot or intermediate results to save for future computation), but it is not advisable to use it as final format since it’s not a general purpose format. 13
Thank You 14
Spark UC1 executions 15
Spark UC2 executions 16