Download presentation
Presentation is loading. Please wait.
1
SETL: Efficient Spark ETL on Hadoop
2
Agenda Context Idea 1: Reducing memory footprint
Idea 2: File size management Idea 3: Low-level file-format APIs Performance numbers 83% reduction in task hours 87% reduction in file count
3
Context: A common Hadoop ETL workload
Input Log data of Web servers, beacons, IoT, devices and so on Processing Transformation into a fixed target schema Cleansing Output Clean granular data Clean aggregations Exception report of rows that could not be imported
4
Running example for this talk
Input Beacon records imported once an hour Granular data Avro output for exporting to external systems Parquet output for AWS Athena query engine 45 column schema with 3 map-columns and one array column and the rest scalar Aggregate data on date, hour, country, event-type and two other fields and combination thereof CSV files Exception report CSV files, with error message and erroneous row Size 1 billion events per day Avro size ~80 GB per day, Parquet ~60 GB
5
Traditional approach using Spark
6
Problems with the traditional approach
Memory requirement proportional to input size If number of containers is less, serialized and de-serialized in each stage Need to provision for the peak input size Input size varies based on hour of the day and/or day of the week If large backlog needs to be processed, job may fail due to “out of memory” Multiple scans of the entire dataset leads to inefficiency
7
SETL: Save output as you scan
8
Benefits Granular data is scanned just once
Saved as a side-effect of processing rather than a separate action on RDD Only small amount of data kept in memory Memory requirements grow sublinear with input size Depends on the cardinality of the aggregation dimensions Common objection: side-effects are not resilient against task failures Not in our case: if a task is rerun, its outputs are overwritten
9
Limitations Limited benefits for non-additive aggregates
E.g. count-distinct, median, percentiles Workaround: approximate algorithms E.g. hyperloglog for count-distinct, Paterson algorithm for quantiles
10
Problem: Hard to control size of output files
Desired: Programmer specifies file size, input size determines the number of files Spark: Programmer specifies number of files, input size determines sizes of them
11
Manage the file sizes by moving blocks
Avro, ORC and Parquet files consist of large relocatable blocks Can merge these files into larger files without looking inside the blocks CSV and JSON files are wholly relocatable Just concatenating these files produce larger files Can be done as a Spark job Multiple file-formats can be merged in a single Spark job
12
Example: Parquet file merge operation
13
Problem: High-level APIs are inefficient
Multiple layers The layers are designed for generality rather than performance Every layer walks through the schema object once per record Boxing and unboxing causes performance problem too A large number of intermediate objects are created and destroyed Use of lowest level API to save the contents gives up to 10x write performance
14
Solution: Use custom writers
How about maintenance? Every time schema changes, do you end up coding new writers? Improved Avro compiler generates writers for the given schema Avro schema being an LL(1) Grammar lends itself to robust writers E.g. Parquet writer for given Avro Schema
15
Example: Layers of Avro file format for Spark
16
Traditional w/ low-level IO Traditional w/ low-level IO + file-merge
SETL Performance Traditional Traditional w/ low-level IO Traditional w/ low-level IO + file-merge Full SETL Executor memory–seconds (GB-second) 198693 149798 162384 33259 % reduction 25% 18% 83% # of output files per format per partition 100 13 Number of executors 70 20 Memory per executor (MB) 5120 900 Cores per executor 2 Elapsed time (seconds) 731 555 600 547 Vcore-seconds 56754 42775 46376 22155
17
Conclusion The new approach to ETL job Efficient use of CPU and memory
Predictable size for containers Duration of the job varies based on the input Desired size for the output files without affecting other tuning parameters
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.