Scaling Spark on HPC Systems Presented by: Jerrod Dixon
Outline HDFS vs Lustre MapReduce vs Spark Spark on HPC Experimental Setup Results Conclusions
HDFS vs Lustre
Hadoop HDFS Distributed filesystem Multi-node replication Direct communication with NameNode
Lustre Very popular filesystem for HPC systems Leverages Management Server (MGS) Metadata Server (MDS) Object Storage Servers
Lustre Full POSIX support Metadata Server informs clients where parts of file object are located Clients connect directly
MapReduce vs Spark
MapReduce Typical method of interacting on HDFS Maps data in files to key-pairs Reduces to unique key with value
Spark Similar to overall methodology of MapReduce Maintains processes in memory Distributes data across global and local scopes
Spark – Vertical Data Processes from disk only when final results requested Pulls from filesystems and works against data in batch methodolgy
Spark –Horizontal Data Distributes work done across nodes as it is processed Similar distribution to HDFS replication, but force-kept in memory
Spark Operates primarily on Resilient Distributed Databases (RDDs) Map processes can be nested but lazy Reduce operation forces processing Caching method to force map into memory Here, making note that for ‘lazy’ means that spark does not execute transformations until data is needed
Spark on HPC
Spark on HPC Spark designed for HDFS Works on data in batches Expects partial data on local disk Executes jobs as results requested Works on data in batches Vertical Data movement
Experimental Setup
Hardware Edison and Cori Cray XC supercomputers at NERSC Edison uses 5,576 compute nodes Each has two 2.4 GHz 12-core Intel “Ivy Bridge” processors Cori uses 1,630 compute nodes Each has two 2.3 GHz 16-core Intel “Haswell” processors.
Edison cluster Leverages Lustre Standard implementation Single MDS, single MDT
Cori Cluster Leverages Luster Leverages BurstBuffer Accelerates I/O performance
BurstBuffer Sits between memory and Lustre Stores frequently accessed files to improve I/O
Results
Single Node Clear bottle-neck in communicating with disk
Multi-node file I/O
BurstBuffer
GroupBy Benchmark 16 nodes (384 cores) Edison weak scaling Partitions must exchanged with partitions shm – memory mapped storage
GroupBy Benchmark Cori specific
Impact of BurstBuffer Increase in mean time till operation Lower variability in access time
Conclusions
No mention of .persist() .cache() Spark memory management to preserve processed partitions after eviction .cache() Mask of .persist() with bare basic parameters MEMORY_ONLY mode
Conclusions Clear limitations to using Lustre as filesytems Increases in access time, decreases in processing, BurstBuffer helps but only with certain amount of nodes No discussions on Spark methods to overcome issues
Issues Weak scaling covered extensively Strong scaling covered almost not at all No comparisons to equivalent work on HDFS system Spark is designed for HDFS, comparing work done on HPC to standard HDFS implementation seems intuitive
Questions?