Contention-Aware Resource Scheduling for Burst Buffer Systems

Slides:

Advertisements

Similar presentations

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Advertisements

Jialin Liu, Bradly Crysler, Yin Lu, Yong Chen Oct. 15. Seminar Data-Intensive Scalable Computing Laboratory (DISCL) Locality-driven High-level.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

SLA-Oriented Resource Provisioning for Cloud Computing

An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.

A Stratified Approach for Supporting High Throughput Event Processing Applications July 2009 Geetika T. LakshmananYuri G. RabinovichOpher Etzion IBM T.

Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.

Storage Management in Virtualized Cloud Environments Sankaran Sivathanu, Ling Liu, Mei Yiduo and Xing Pu Student Workshop on Frontiers of Cloud Computing,

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.

Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.

Embedded System Lab. 최 길 모최 길 모 Kilmo Choi Active Flash: Towards Energy-Efficient, In-Situ Data Analytics on Extreme-Scale Machines.

Grid Lab About the need of 3 Tier storage 5/22/121CHEP 2012, The need of 3 Tier storage Dmitri Ozerov Patrick Fuhrmann CHEP 2012, NYC, May 22, 2012 Grid.

A User-Lever Concurrency Manager Hongsheng Lu & Kai Xiao.

OBJECTIVE: To learn about the various system calls. To perform the various CPU scheduling algorithms. To understand the concept of memory management schemes.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Performance and Energy Efficiency Evaluation of Big Data Systems Presented by Yingjie Shi Institute of Computing Technology, CAS

Template This is a template to help, not constrain, you. Modify as appropriate. Move bullet points to additional slides as needed. Don’t cram onto a single.

PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,

Author Utility-Based Scheduling for Bulk Data Transfers between Distributed Computing Facilities Xin Wang, Wei Tang, Raj Kettimuthu,

Tackling I/O Issues 1 David Race 16 March 2010.

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Rethinking RAID for SSD based HPC Systems Yugendra R. Guvvala, Yong Chen, and Yu Zhuang Department of Computer Science, Texas Tech University, Lubbock,

Model-driven Data Layout Selection for Improving Read Performance Jialin Liu 1, Bin Dong 2, Surendra Byna 2, Kesheng Wu 2, Yong Chen 1 Texas Tech University.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.

Using Pattern-Models to Guide SSD Deployment for Big Data in HPC Systems Junjie Chen 1, Philip C. Roth 2, Yong Chen 1 1 Data-Intensive Scalable Computing.

Jialin Liu, Surendra Byna, Yong Chen Oct Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented.

Md Baitul Al Sadi, Isaac J. Cushman, Lei Chen, Rami J. Haddad

Internal Parallelism of Flash Memory-Based Solid-State Drives

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Unistore: Project Updates

Organizations Are Embracing New Opportunities

lecture 5: CPU Scheduling

Architecture and Algorithms for an IEEE 802

CPU SCHEDULING.

Dan C. Marinescu Office: HEC 439 B. Office hours: M, Wd 3 – 4:30 PM.

Reducing Memory Interference in Multicore Systems

Introduction to Load Balancing:

Parallel-DFTL: A Flash Translation Layer that Exploits Internal Parallelism in Solid State Drives Wei Xie1 , Yong Chen1 and Philip C. Roth2 1. Texas Tech.

Diskpool and cloud storage benchmarks used in IT-DSS

BD-CACHE Big Data Caching for Datacenters

Locality-driven High-level I/O Aggregation

PA an Coordinated Memory Caching for Parallel Jobs

Ping-Sung Yeh, Te-Hao Hsu Conclusions Results Introduction

Jiang Zhou, Wei Xie, Dong Dai, and Yong Chen

Chapter 6: CPU Scheduling

Unistore: Project Updates

Department of Computer Science University of California, Santa Barbara

CPU Scheduling Basic Concepts Scheduling Criteria

CPU Scheduling G.Anuradha

Module 5: CPU Scheduling

ExaO: Software Defined Data Distribution for Exascale Sciences

3: CPU Scheduling Basic Concepts Scheduling Criteria

Shanjiang Tang1, Bingsheng He2, Shuhao Zhang2,4, Zhaojie Niu3

Chapter5: CPU Scheduling

Chapter 6: CPU Scheduling

Multiprocessor and Real-Time Scheduling

Lecture 18 Syed Mansoor Sarwar

Operating System , Fall 2000 EA101 W 9:00-10:00 F 9:00-11:00

Chapter 6: CPU Scheduling

Virtual Memory: Working Sets

Module 5: CPU Scheduling

Machine-Independent Operating System Features

Maria Méndez Real, Vincent Migliore, Vianney Lapotre, Guy Gogniat

Chapter 6: CPU Scheduling

On the Role of Burst Buffers in Leadership-Class Storage Systems

Module 5: CPU Scheduling

SANDIE: Optimizing NDN for Data Intensive Science

Presentation transcript:

Contention-Aware Resource Scheduling for Burst Buffer Systems 1,2 2 3 1 Weihao Liang, Yong Chen, Jialin Liu, Hong An 1. University of Science and Technology of China 2. Texas Tech University 3. Lawrence Berkeley National Laboratory August 13th, 2018

Scientific Applications Trend Scientific applications tend to be data intensive A GTC run on 29K cores on the Titan machine at OLCF manipulated over 54 Terabytes of data in a 24 hour period over a decade ago “Big data”: data-driven discovery, computing based innovation and discovery become highly data intensive Data growth rate higher than computational rate Data requirements for selected projects at NERSC Application Category Data FLASH Nuclear Physics 10 PB Nyx Cosmology VPIC Plasma Physics 1 PB LHC High Energy Physics 100 PB LSST 60 PB Meraculous Genomics 100 TB ~ 1 PB Source: S. Byna et. al., Lawrence Berkeley National Laboratory. 2017 2

I/O Bottleneck Increasing performance gap between the computing and I/O capability Drastically boosted multicore/manycore computing power compared with slowly improved data-access ability Source: DataCore Software Corp. 2016 3

Storage System Trend Newly emerged SSD-based burst buffer provides a promising solution for bridging the I/O gap Increase the depth of storage hierarchy 4

Burst Buffer System of Cori@NERSC Features: Dedicated nodes with SSD 288 nodes in total Each BB node has two Intel P3608 3.2TB NAND flash SSDs 6.4TB and 6.5GB/s R/W bandwidth per node Quickly absorb I/O traffic Asynchronously data flush In-transit data analysis Shared among users and applications Source: W. Bhimji et. al., Lawrence Berkeley National Lab. 2016. 5

Allocation Strategies for Burst Buffer Bandwidth Strategy As many BBs as possible Round-robin manner Shared by multiple jobs Interference Strategy As few BBs as possible Exclusively accessed by each job 6

Limitations of Existing Strategies Bandwidth Strategy Share but ignore I/O contention Job1 and Job3 on same BBs Imbalanced utilization BB1&BB2 have higher load than BB3&BB4 Interference Strategy Low bandwidth gain Each job only get up to one BB’s bandwidth Limited by the scale/number of BB nodes instead of total capacity/bandwidth available 7

Motivating Example Experiment of three concurrent jobs writing to 4 BB nodes on Cori, under different allocation strategies. Jobs # Procs Write Data Job1 16 160 GB Job2 8 Job3 8

Proposed Approach Contention-Aware Resource Scheduling Strategy （CARS） Allocate as many BBs as possible to meet the user’s request Assign most underutilized (lowest load) BBs for each new job Use the number of concurrent I/O processes to quantify the load of each BB node Jobs # Procs Write Data Job1 16 160 GB Job2 8 Job3 Example: BB nodes are allocated for jobs under contention-aware strategy 9

Design Overview Load monitor Job tracer Burst buffer scheduler Maintains I/O load of each BB node Updates status dynamically Job tracer Collects the distribution / concurrency of jobs on the entire burst buffer system Burst buffer scheduler Performs the allocation algorithm Allocates BB resources for jobs Communicates with load monitor and job tracer 10

Design and Implementation 11

Modeling Job and allocation model I/O contention model Assumption 1: The maximum number of BB nodes assigned to a job is decided by the request capacity Assumption 2: The I/O processes of a job is evenly distributed across all BB nodes assigned to it I/O contention model Assumption 1: One I/O process can transfer data to the BB at maximum bw GB/s and the peak bandwidth of one BB is BM Assumption 2: The actual bandwidth of a BB node is evenly distributed across all concurrent I/O processes (N) No contention Bandwidth of one process I/O Contention happens 12

Metrics Average job efficiency Average system utilization TEi : I/O time of ith job runs exclusively on BB nodes TCi : I/O time of ith job runs on one or more BB nodes with other jobs simultaneously (with possible contention) Average system utilization Separate the I/O time into multiple phases (ti). During the ith time step, the aggregate I/O bandwidth of all current active jobs is BWi. 13

Emulation Experiments on Cori Experimental Platform 8 Burst Buffer nodes on Cori 2x Intel Xeon E5 Processor 16-cores @ 2.3Ghz 2x Intel P3608 3.2TB SSD 128GB Memory 6.5GB/s peak read/write bandwidth Workload Setting IOR benchmark Tests with 10 jobs (each job was selected twice) Assign jobs in random order Jobs # of Procs Write Data Job1 128 4TB Job2 64 Job3 2TB Job4 Job5 1TB 14

Preliminary Results Average Job Efficiency Average System Utilization Contention-aware strategy achieved the highest job efficiency and system utilization The differences between analytical and experimental results are fairly small (2% ~ 8%) 15

Simulation Experiments Workload Setting Randomly ~100 jobs Assign jobs in different orders FCFS (First come first serve) SJF (Short job first) PS (Priority scheduling) Burst Buffer System Setting Close to the configuration of the Cori’s Burst Buffer system Job Config # Procs Write Data Job1 8192 80TB Job2 4096 40TB Job3 2048 Job4 20TB Job5 1024 Job6 10TB Job7 512 8TB Job8 256 4TB BB Config Value # of BBs 256 BW/BB 6.5GB/s Capacity/BB 6.4TB 16

Results Average Job Efficiency Average System Utilization Contention-aware strategy achieved more than 20% compared to bandwidth strategy Largest utilization is observed with SJF, where contention-aware strategy achieved nearly 90% system utilization 17

Conclusions Burst buffer is an attractive solution for mitigating the I/O gap for growingly critical data-intensive sciences Existing resource allocation strategies for burst buffer are rather rudimentary and have limitations We propose a contention-aware resource scheduling strategy to coordinate concurrent I/O-intensive jobs Preliminary results have shown improvements of both job performance and system utilization 18

Ongoing and Future Work Further optimization of CARS with consideration of other I/O patterns Explore efficient and powerful data movement strategies among different layers of deep storage hierarchy 19

Thank You and Q&A Weihao Liang lwh1990@mail.ustc.edu.cn Yong Chen yong.chen@ttu.edu Jialin Liu jalnliu@lbl.gov Hong An han@ustc.edu.cn Data Intensive Scalable Computing Lab http://discl.cs.ttu.edu 20