SETL: Efficient Spark ETL on Hadoop

SETL: Efficient Spark ETL on Hadoop

Agenda Context Idea 1: Reducing memory footprint
Idea 2: File size management Idea 3: Low-level file-format APIs Performance numbers 83% reduction in task hours 87% reduction in file count

Context: A common Hadoop ETL workload
Input Log data of Web servers, beacons, IoT, devices and so on Processing Transformation into a fixed target schema Cleansing Output Clean granular data Clean aggregations Exception report of rows that could not be imported

Running example for this talk
Input Beacon records imported once an hour Granular data Avro output for exporting to external systems Parquet output for AWS Athena query engine 45 column schema with 3 map-columns and one array column and the rest scalar Aggregate data on date, hour, country, event-type and two other fields and combination thereof CSV files Exception report CSV files, with error message and erroneous row Size 1 billion events per day Avro size ~80 GB per day, Parquet ~60 GB

Traditional approach using Spark

Problems with the traditional approach
Memory requirement proportional to input size If number of containers is less, serialized and de-serialized in each stage Need to provision for the peak input size Input size varies based on hour of the day and/or day of the week If large backlog needs to be processed, job may fail due to “out of memory” Multiple scans of the entire dataset leads to inefficiency

SETL: Save output as you scan

Benefits Granular data is scanned just once
Saved as a side-effect of processing rather than a separate action on RDD Only small amount of data kept in memory Memory requirements grow sublinear with input size Depends on the cardinality of the aggregation dimensions Common objection: side-effects are not resilient against task failures Not in our case: if a task is rerun, its outputs are overwritten

Limitations Limited benefits for non-additive aggregates
E.g. count-distinct, median, percentiles Workaround: approximate algorithms E.g. hyperloglog for count-distinct, Paterson algorithm for quantiles

Problem: Hard to control size of output files
Desired: Programmer specifies file size, input size determines the number of files Spark: Programmer specifies number of files, input size determines sizes of them

Manage the file sizes by moving blocks
Avro, ORC and Parquet files consist of large relocatable blocks Can merge these files into larger files without looking inside the blocks CSV and JSON files are wholly relocatable Just concatenating these files produce larger files Can be done as a Spark job Multiple file-formats can be merged in a single Spark job

Example: Parquet file merge operation

Problem: High-level APIs are inefficient
Multiple layers The layers are designed for generality rather than performance Every layer walks through the schema object once per record Boxing and unboxing causes performance problem too A large number of intermediate objects are created and destroyed Use of lowest level API to save the contents gives up to 10x write performance

Solution: Use custom writers
How about maintenance? Every time schema changes, do you end up coding new writers? Improved Avro compiler generates writers for the given schema Avro schema being an LL(1) Grammar lends itself to robust writers E.g. Parquet writer for given Avro Schema

Example: Layers of Avro file format for Spark

Traditional w/ low-level IO Traditional w/ low-level IO + file-merge
SETL Performance Traditional Traditional w/ low-level IO Traditional w/ low-level IO + file-merge Full SETL Executor memory–seconds (GB-second) 198693 149798 162384 33259 % reduction 25% 18% 83% # of output files per format per partition 100 13 Number of executors 70 20 Memory per executor (MB) 5120 900 Cores per executor 2 Elapsed time (seconds) 731 555 600 547 Vcore-seconds 56754 42775 46376 22155

Conclusion The new approach to ETL job Efficient use of CPU and memory
Predictable size for containers Duration of the job varies based on the input Desired size for the output files without affecting other tuning parameters

SETL: Efficient Spark ETL on Hadoop

Similar presentations

Presentation on theme: "SETL: Efficient Spark ETL on Hadoop"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SETL: Efficient Spark ETL on Hadoop

Similar presentations

Presentation on theme: "SETL: Efficient Spark ETL on Hadoop"— Presentation transcript:

Similar presentations

About project

Feedback