Scaling Spark on HPC Systems

Scaling Spark on HPC Systems
Presented by: Jerrod Dixon

Outline HDFS vs Lustre MapReduce vs Spark Spark on HPC
Experimental Setup Results Conclusions

HDFS vs Lustre

Hadoop HDFS Distributed filesystem Multi-node replication
Direct communication with NameNode

Lustre Very popular filesystem for HPC systems Leverages
Management Server (MGS) Metadata Server (MDS) Object Storage Servers

Lustre Full POSIX support
Metadata Server informs clients where parts of file object are located Clients connect directly

MapReduce vs Spark

MapReduce Typical method of interacting on HDFS
Maps data in files to key-pairs Reduces to unique key with value

Spark Similar to overall methodology of MapReduce
Maintains processes in memory Distributes data across global and local scopes

Spark – Vertical Data Processes from disk only when final results requested Pulls from filesystems and works against data in batch methodolgy

Spark –Horizontal Data
Distributes work done across nodes as it is processed Similar distribution to HDFS replication, but force-kept in memory

Spark Operates primarily on Resilient Distributed Databases (RDDs)
Map processes can be nested but lazy Reduce operation forces processing Caching method to force map into memory Here, making note that for ‘lazy’ means that spark does not execute transformations until data is needed

Spark on HPC

Spark on HPC Spark designed for HDFS Works on data in batches
Expects partial data on local disk Executes jobs as results requested Works on data in batches Vertical Data movement

Experimental Setup

Hardware Edison and Cori Cray XC supercomputers at NERSC
Edison uses 5,576 compute nodes Each has two 2.4 GHz 12-core Intel “Ivy Bridge” processors Cori uses 1,630 compute nodes Each has two 2.3 GHz 16-core Intel “Haswell” processors.

Edison cluster Leverages Lustre Standard implementation
Single MDS, single MDT

Cori Cluster Leverages Luster Leverages BurstBuffer
Accelerates I/O performance

BurstBuffer Sits between memory and Lustre
Stores frequently accessed files to improve I/O

Results

Single Node Clear bottle-neck in communicating with disk

Multi-node file I/O

BurstBuffer

GroupBy Benchmark 16 nodes (384 cores) Edison weak scaling
Partitions must exchanged with partitions shm – memory mapped storage

GroupBy Benchmark Cori specific

Impact of BurstBuffer Increase in mean time till operation
Lower variability in access time

Conclusions

No mention of .persist() .cache()
Spark memory management to preserve processed partitions after eviction .cache() Mask of .persist() with bare basic parameters MEMORY_ONLY mode

Conclusions Clear limitations to using Lustre as filesytems
Increases in access time, decreases in processing, BurstBuffer helps but only with certain amount of nodes No discussions on Spark methods to overcome issues

Issues Weak scaling covered extensively
Strong scaling covered almost not at all No comparisons to equivalent work on HDFS system Spark is designed for HDFS, comparing work done on HPC to standard HDFS implementation seems intuitive

Questions?

Scaling Spark on HPC Systems

Similar presentations

Presentation on theme: "Scaling Spark on HPC Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scaling Spark on HPC Systems

Similar presentations

Presentation on theme: "Scaling Spark on HPC Systems"— Presentation transcript:

Similar presentations

About project

Feedback