DOSAS: Mitigating the Resource Contention in Active Storage Systems Chao Chen 1, Yong Chen 1 and Philip C. Roth 2 1 Texas Tech University 2 Oak Ridge National Laboratory 1 Cluster-12
Outline Background Active Storage Motivation DOSAS (Dynamic Operation Scheduling Active Storage) Evaluation Conclusion and future work 2 Cluster-12
Background Applications from the area of climate science, astrophysics, etc. are becoming more and more data intensive − reads/ writes a large amount of data. FLASH: Buoyancy-Driven Turbulent Nuclear Burning (75TB-300TB) Climate science (10TB-355TB) GTC: 56TB per 100-hour run and generating 260GB per 120 seconds S3D: 90TB 120-hour run 3 Cluster-12
Background Processing model in current architecture: Data need to be transferred from Storage Nodes to Computer Nodes via network It is very time consuming I/O operations can dominate the system performance 4 Compute Node Disk Storage Node I/O request Data Application Analysis kernel Cluster-12
Active Storage Active Storage was proposed to mitigate such issue, and attracted intensive attention It moves appropriate computations close to where the data is stored, as opposed to moving the data to the compute devices 5 Compute Node Application Disk Storage Node I/O request Result Analysis kernel Data Network bandwidth cost is reduced Cluster-12
Active Storage Two examples of Active Storage: Felix et. al proposed the first prototype based on Lustre [1,2,3] Implemented in kernel space first Improved in user space later 6 NAL OST ASOBD OBDfilter ext3 ASDEV Processing Component User Space 1.Evan J Felix, Kevin Fox, Kevin Regimbal, and Jarek Nieplocha. Active Storage Processing in a Parallel File System. In 6th LCI International Conference on Linux Clusters: The HPC Revolution, Chapel Hill, North Carolina, Juan Piernas, Jarek Nieplocha, Evan J. Felix. "Evaluation of Active Storage Strategies for the Lustre Parallel File System". Proceedings of the Supercomputing'07 Conference, November, Juan Piernas, Jarek Nieplocha, "Efficient Management of Complex Striped Files in Active Storage", Proc. Europar' Cluster-12
Active Storage Woo et. al proposed another prototype based on PVFS [4] It provides a more sophisticated prototype based on MPI User can register their process kernels 7 Interconnection network Server 1 Client 1 Server nServer 2 Client 2Client n … … Parallel File System API Parallel File System API Active Storage API Active Storage API Parallel File System Client Application Parallel File System API Kernels Disk GPU 4.Seung Woo Son, Samuel Lang, Philip Carns, Robert Ross, and Rajeev Thakur. Enabling Active Storage on Parallel I / O Software Stacks. In 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST), Cluster-12
Performance Improvement of Active Storage The performance of the SUM operation for Traditional Storage (TS) and Active Storage (AS) schemes [4] Active Storage 50.9% improvement [4] 8 Cluster-12
Contention: A Problem for Active Storage High performance computing system may run dozens, or even hundreds of applications simultaneously. A system needs to give good performance to each of the running applications 9 Server 1 Server k Server m AINI AI NI AI NI AI I/O queue AI: Active I/O NI: Normal I/O I/O requests NI AI NI AI m < n APP2APP1 APP m p1p1 p1p1 p2p2 p2p2 pnpn pnpn p1p1 p1p1 p2p2 p2p2 pnpn pnpn p1p1 p1p1 p2p2 p2p2 pnpn pnpn Cluster-12
Offloading computation to storage nodes can improve performance but offloading too much computation causes resource contention and degrades overall performance DOSAS is proposed to balancing the performance gain versus its overhead Contention: A Problem for Active Storage 10 Performance Degradation Cluster-12 It coordinates computer nodes and storage nodes to complete the Active I/O requests automatically to achieve best system performance It enhanced MPI-IO library and is easy to use for application programmers
DOSAS Architecture 11 1.Active Storage Client Active API Processing Kernels 2.Active Storage Server Contention Estimator Active I/O runtime Processing Kernels Processing Kernels: a collection of predefined analysis operations that are widely used in data-intensive applications (such as k- mean, Gaussian-filter) Extended Parallel File System API Parallel File System Client Applications Normal I/O Active I/O Disks Parallel File System API Processing Kernels Client Server Active Storage Client Processing Kernels Active API Active Storage Server Contention Estimator Active I/O runtime Cluster-12
Active Storage Client Runs at each compute node Serve as an interface through enhanced MPI-IO interface (Active API) Assists the storage nodes to complete the I/O without the intervention of applications (Processing Kernels) 12 Cluster-12
Active API 13 Cluster-12 Operation parameter is added to MPI-IO function to invoke related analysis kernel Using a structure for returning result or operation
Active Storage Server Runs at each storage node Schedules the Active I/O requests between Compute node and Storage node (Contention estimator) Collect the result (Active I/O runtime) Serve for active I/O requests (Processing Kernels) 14 Cluster-12
Contention Estimator The task of Contention Estimator is scheduling the I/O requests between the compute node and storage node. The scheduling algorithm would decide whether a request can run using Active Storage or not 15 Active I/ONormal I/OActive I/O Normal I/OActive I/O… I/O queue: Which Active I/O should be served and which should be rejected ? Cluster-12
Contention Estimator Notations: 16 nthe number of I/O requests in I/O queue kThe number of active I/O requests in I/O queue didi The request data size of i-th I/O request DADA The total data size requested by active I/O requests. Thus (if i-th I/O is active I/O) DNDN The total data size requested by normal I/O requests. Thus (if i-th I/O is normal I/O) DThe total request data size in I/O queue. D=D A +D N S C,op The computation capability of each storage node given operation op C C,op The computation capability of each compute node given operation op f(x)The time needed to compute on x size data g(x)The time needed to transfer x size data from storage node to compute node h(x)The data size of the result computed on x size data by active I/O bwThe bandwidth of compute-storage network Table 1. Notations Cluster-12
Contention Estimator Based on above notations, the execution time of a given schedule can be estimated 17 All active I/Os are served: All active I/Os are rejected: Here: (in storage node) (in compute node) or Time for serving active I/O Time of transferring data of normal I/O Time for transferring result of active I/O EX: Cluster-12
Contention Estimator Scheduling problem is modeled as a binary optimization problem: for active I/O, storage node has two choices: accept or reject; for normal I/O, storage node will process it as normal Goal: minimize the total time combinations: 18 Where: for i-th active I/O is accepted; otherwise rejected Cluster-12
Active I/O runtime Execute the scheduling policy of Contention Estimator Interact with ASC and PKs for returning result by filling buf argument of struct result 19 Cluster-12
Evaluation Experiment Platform and Evaluated Operations: 20 PlatformDiscfarm Cluster at Texas Tech # of I/O requests per node1, 2, 4, 8, 16, 32, 64 Network Bandwidth118MB/s Data size of each I/O128MB, 256MB, 512MB and 1GB Total Data Size8GB, 16GB, 32GB and 64GB Evaluated schemesTS: traditional storage, AS: current active storage, DOSAS: proposed approach OperationsComputation ComplexityProcessing Rate SUM1 addition operation per data item860 MB/s 2D Gaussian Filter 9 multiplication operations, 9 addition operations and 1 divide operation per data item 80 MB/s Cluster-12
Impact of resource contention Execution time of SUM under AS and TS scheme with increasing I/O requests, each I/O request 128MB data Execution time of 2D Gaussian Filter under AS and TS scheme with increasing I/O requests, each I/O request 128MB data Processing rate 860MB/s Processing rate 80MB/s 21 Cluster-12 Network Bandwidth 118MB/s
DOSAS Performance Performance Improvement 22 Performance comparison of TS, AS and DOSAS, each I/O request 256MB data Cluster-12
Scheduling Algorithm 23 Case #Algorithm DecisionPractiseJudgment 1Active TRUE 2Active TRUE 3ActiveNormalFALES 4Normal TRUE 5Normal TRUE 6Normal TRUE 7Normal TRUE 8Active TRUE 9Active TRUE 10ActiveNormalFALES 11Normal TRUE 12Normal TRUE 13Normal TRUE 14Normal TRUE 15Active TRUE 16Active TRUE 17ActiveNormalFALES 18Normal TRUE 19Normal TRUE Correctness: 95% Table 2. Partial Scheduling Algorithm Evaluation Result Cluster-12 Evaluate the correctness of the scheduling algorithm
Bandwidth 24 Bandwidth comparison of TS, AS and DOSAS Cluster-12
Conclusion and Future work This study: Demonstrated resource contention has a great impact on performance of active storage DOSAS is introduced to mitigate such challenge issue Carried out experimental tests, and the result shows that DOSAS outperforms existing active storage architectures Evaluated the impact of computation complexity of operators The near-future exascale systems are likely to exhibit even more serious resource contention issues Further research required to address these challenges 25 Cluster-12
Thank You For more information Cluster The paper has been authored by a contractor of the U.S. Government under Contract No. DE-AC05-00OR Accordingly, the U.S. Government retains a non-exclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes