Contention-Aware Resource Scheduling for Burst Buffer Systems 1,2 2 3 1 Weihao Liang, Yong Chen, Jialin Liu, Hong An 1. University of Science and Technology of China 2. Texas Tech University 3. Lawrence Berkeley National Laboratory August 13th, 2018
Scientific Applications Trend Scientific applications tend to be data intensive A GTC run on 29K cores on the Titan machine at OLCF manipulated over 54 Terabytes of data in a 24 hour period over a decade ago “Big data”: data-driven discovery, computing based innovation and discovery become highly data intensive Data growth rate higher than computational rate Data requirements for selected projects at NERSC Application Category Data FLASH Nuclear Physics 10 PB Nyx Cosmology VPIC Plasma Physics 1 PB LHC High Energy Physics 100 PB LSST 60 PB Meraculous Genomics 100 TB ~ 1 PB Source: S. Byna et. al., Lawrence Berkeley National Laboratory. 2017 2
I/O Bottleneck Increasing performance gap between the computing and I/O capability Drastically boosted multicore/manycore computing power compared with slowly improved data-access ability Source: DataCore Software Corp. 2016 3
Storage System Trend Newly emerged SSD-based burst buffer provides a promising solution for bridging the I/O gap Increase the depth of storage hierarchy 4
Burst Buffer System of Cori@NERSC Features: Dedicated nodes with SSD 288 nodes in total Each BB node has two Intel P3608 3.2TB NAND flash SSDs 6.4TB and 6.5GB/s R/W bandwidth per node Quickly absorb I/O traffic Asynchronously data flush In-transit data analysis Shared among users and applications Source: W. Bhimji et. al., Lawrence Berkeley National Lab. 2016. 5
Allocation Strategies for Burst Buffer Bandwidth Strategy As many BBs as possible Round-robin manner Shared by multiple jobs Interference Strategy As few BBs as possible Exclusively accessed by each job 6
Limitations of Existing Strategies Bandwidth Strategy Share but ignore I/O contention Job1 and Job3 on same BBs Imbalanced utilization BB1&BB2 have higher load than BB3&BB4 Interference Strategy Low bandwidth gain Each job only get up to one BB’s bandwidth Limited by the scale/number of BB nodes instead of total capacity/bandwidth available 7
Motivating Example Experiment of three concurrent jobs writing to 4 BB nodes on Cori, under different allocation strategies. Jobs # Procs Write Data Job1 16 160 GB Job2 8 Job3 8
Proposed Approach Contention-Aware Resource Scheduling Strategy (CARS) Allocate as many BBs as possible to meet the user’s request Assign most underutilized (lowest load) BBs for each new job Use the number of concurrent I/O processes to quantify the load of each BB node Jobs # Procs Write Data Job1 16 160 GB Job2 8 Job3 Example: BB nodes are allocated for jobs under contention-aware strategy 9
Design Overview Load monitor Job tracer Burst buffer scheduler Maintains I/O load of each BB node Updates status dynamically Job tracer Collects the distribution / concurrency of jobs on the entire burst buffer system Burst buffer scheduler Performs the allocation algorithm Allocates BB resources for jobs Communicates with load monitor and job tracer 10
Design and Implementation 11
Modeling Job and allocation model I/O contention model Assumption 1: The maximum number of BB nodes assigned to a job is decided by the request capacity Assumption 2: The I/O processes of a job is evenly distributed across all BB nodes assigned to it I/O contention model Assumption 1: One I/O process can transfer data to the BB at maximum bw GB/s and the peak bandwidth of one BB is BM Assumption 2: The actual bandwidth of a BB node is evenly distributed across all concurrent I/O processes (N) No contention Bandwidth of one process I/O Contention happens 12
Metrics Average job efficiency Average system utilization TEi : I/O time of ith job runs exclusively on BB nodes TCi : I/O time of ith job runs on one or more BB nodes with other jobs simultaneously (with possible contention) Average system utilization Separate the I/O time into multiple phases (ti). During the ith time step, the aggregate I/O bandwidth of all current active jobs is BWi. 13
Emulation Experiments on Cori Experimental Platform 8 Burst Buffer nodes on Cori 2x Intel Xeon E5 Processor 16-cores @ 2.3Ghz 2x Intel P3608 3.2TB SSD 128GB Memory 6.5GB/s peak read/write bandwidth Workload Setting IOR benchmark Tests with 10 jobs (each job was selected twice) Assign jobs in random order Jobs # of Procs Write Data Job1 128 4TB Job2 64 Job3 2TB Job4 Job5 1TB 14
Preliminary Results Average Job Efficiency Average System Utilization Contention-aware strategy achieved the highest job efficiency and system utilization The differences between analytical and experimental results are fairly small (2% ~ 8%) 15
Simulation Experiments Workload Setting Randomly ~100 jobs Assign jobs in different orders FCFS (First come first serve) SJF (Short job first) PS (Priority scheduling) Burst Buffer System Setting Close to the configuration of the Cori’s Burst Buffer system Job Config # Procs Write Data Job1 8192 80TB Job2 4096 40TB Job3 2048 Job4 20TB Job5 1024 Job6 10TB Job7 512 8TB Job8 256 4TB BB Config Value # of BBs 256 BW/BB 6.5GB/s Capacity/BB 6.4TB 16
Results Average Job Efficiency Average System Utilization Contention-aware strategy achieved more than 20% compared to bandwidth strategy Largest utilization is observed with SJF, where contention-aware strategy achieved nearly 90% system utilization 17
Conclusions Burst buffer is an attractive solution for mitigating the I/O gap for growingly critical data-intensive sciences Existing resource allocation strategies for burst buffer are rather rudimentary and have limitations We propose a contention-aware resource scheduling strategy to coordinate concurrent I/O-intensive jobs Preliminary results have shown improvements of both job performance and system utilization 18
Ongoing and Future Work Further optimization of CARS with consideration of other I/O patterns Explore efficient and powerful data movement strategies among different layers of deep storage hierarchy 19
Thank You and Q&A Weihao Liang lwh1990@mail.ustc.edu.cn Yong Chen yong.chen@ttu.edu Jialin Liu jalnliu@lbl.gov Hong An han@ustc.edu.cn Data Intensive Scalable Computing Lab http://discl.cs.ttu.edu 20