Extreme scale Lack of decomposition for insight Many services have centralized designs Impacts of service architectures an open question Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 2
Modular components design for composable services Explore the design space for HPC services Evaluate the impacts of different design choices Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 3
A taxonomy for classifying HPC system services A simulation tool to explore Distributed Key-Value Stores (KVS) design choices for large-scale system services An evaluation of KVS design choices for extreme-scale systems using both synthetic and real workload traces Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 4
Introduction & Motivation Key-Value Store Taxonomy Key-Value Store Simulation Evaluation Conclusions & Future Work Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 5
Introduction & Motivation Key-Value Store Taxonomy Key-Value Store Simulation Evaluation Conclusions & Future Work Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 6
Job Launch, Resource Management Systems System Monitoring I/O Forwarding, File Systems Function Call Shipping Key-Value Stores Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 7
Scalability Dynamicity Fault Tolerance Consistency Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 8
Large volume of data and state information Distributed NoSQL data stores used as building blocks Examples: Resource management (job, node status info) Monitoring (system active logs) File systems (metadata) SLURM++, MATRIX [1], FusionFS [2] Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 9 [1] K. Wang, I. Raicu. “Paving the Road to exascale through Many Task Computing”, Doctor Showcase, IEEE/ACM Supercomputing 2012 (SC12) [2] D. Zhao, I. Raicu. “Distributed File Systems for Exascale Computing”, Doctor Showcase, IEEE/ACM Supercomputing 2012 (SC12)
Introduction & Motivation Key-Value Store Taxonomy Key-Value Store Simulation Evaluation Conclusions & Future Work Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 10
Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services Decomposition Categorization Suggestion Implication 11
Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services Service model: functionality Data model: distribution and management of data Network model: dictates how the components are connected Recovery model: how to deal with component failures Consistency model: how rapidly data modifications propagate 12
Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 13 Data model: centralized Network model: aggregation tree Recovery model: fail-over Consistency model: strong
Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 14 Data Model: distributed with partition Network Model: fully-connected partial knowledge Recovery Model: consecutive replicas Consistency Model: strong, eventual VoldemortPastryZHT Datadistributed Network fully- connected partially- connected fully- connected Recovery n-way replications Consiste ncy eventualstrongeventual
Introduction & Motivation Key-Value Store Taxonomy Key-Value Store Simulation Evaluation Conclusions & Future Work Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 15
Discrete Event Simulation PeerSim Evaluated others: OMNET++, OverSim, SimPy Configurable number of servers and clients Different architectures Two parallel queues in a server Communication queue (send/receive requests) Processing queue (process request locally) Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 16
Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 17 The time to resolve a query locally (t LR ), and the time to resolve a remote query (t RR ) is given by: t LR = CS + SR + LP + SS + CR For fully connected: t RR = t LR + 2 × (SS + SR) For partially connected: t RR = t LR + 2k × (SS + SR) where k is the number of hops to find the predecessor
Defines what to do when a node fails How a node-state recovers when rejoining after failure Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 18 s 0 r 5,1 r 4,2 s 1 r 0,1 r 5,2 s 4 r 3,1 r 2,2 s 2 r 1,1 r 0,2 s 3 r 2,1 r 1,2 s 5 r 4,1 r 3,2 client EM X notify failure replicate s0 data first replica down second replica down replicate my data s 0 r 5,1 r 4,2 s 1 r 0,1 r 5,2 s 4 r 3,1 r 2,2 s 2 r 1,1 r 0,2 s 3 r 2,1 r 1,2 s 5 r 4,1 r 3,2 client EM X notify back s 0, s 4, s 5 data remove s 0 data s 0 is back remove s 5 data
Strong Consistency Every replica observes every update in the same order Client sends requests to a dedicated server (primary replica) Eventual Consistency Requests are sent to randomly chosen replica (coordinator) Three key parameters: N, R, W, satisfying R + W > N Use Dynamo [G. Decandia, 2007] version clock to track different versions of data and detect conflicts Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 19
Introduction & Motivation Key-Value Store Taxonomy Key-Value Store Simulation Evaluation Conclusions & Future Work Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 20
Evaluate the overheads Different architectures, focus on distributed ones Different models Light-weight simulations: Largest experiments 25GB RAM, 40 min walltime Workloads Synthetic workload with 64-bit key space Real workload traces from 3 representative system services: job launch, system monitoring, and I/O forwarding Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 21
Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 22 Validate against ZHT [1] (left) and Voldemort (right) ZHT BG/P up to 8K nodes (32K cores) Voldemort PROBE Kodiak Cluster up to 800 nodes [1] T. Li, X. Zhou, K. Brandstatter, D. Zhao, K. Wang, A. Rajendran, Z. Zhang, I. Raicu. “ZHT: A Light-weight Reliable Persistent Dynamic Scalable Zero-hop Distributed Hash Table”, IEEE International Parallel & Distributed Processing Symposium (IPDPS) 2013
Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 23 Partial connectivity higher latency due to the additional routing Fully-connected topology faster response (twice as fast at extreme scale)
Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 24 Adding replicas always involve overheads Replicas have larger impact on fully connected than on partially connected
Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 25 Higher failure frequency introduces more overhead, but the dominating factor is the client request processing messages
Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 26 Eventual consistency has more overhead than the strong consistency
Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 27 Fully connected Partially connected For job launch and I/O forwarding Eventual consistency performs worse almost URD for both request type and the key Monitoring Eventual consistency works better all requests are “put”
ZHT (distributed key/value storage) DKVS implementation MATRIX (runtime system) DKVS is used to keep task meta-data SLURM++ (job management system) DKVS is used to store task & resource information FusionFS (distributed file system) DKVS is used to maintain file/directory meta-data Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 28
Introduction & Motivation Key-Value Store Taxonomy Key-Value Store Simulation Evaluation Conclusions & Future Work Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 29
A taxonomy for classifying HPC system services A simulation tool to explore KVS design choices for large-scale system services An evaluation of KVS design choices for extreme-scale systems using both synthetic and real workload traces Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 30
Key-value Store is building block Service taxonomy is important Simulation framework to study services Distributed architecture is demanded Replication adds overhead Fully-connected topology is good As long as the request processing message dominates Consistency tradeoffs Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 31 Write-Intensity/Availability Read-Intensity/Performance Eventual Consistency Strong Consistency Weak Consistency
Extend the simulator to cover more of the taxonomy Explore other recovery models log-based information dispersal algorithm Explore other consistency models Explore using DKVS in the development of: General building block library Distributed monitoring system service Distributed message queue system Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 32
DOE contract: DE-FC02-06ER25750 Part of NSF award: CNS (PRObE) Collaboration with FusionFS project under NSF grant: NSF BG/P resource from ANL Thanks to Tonglin Li, Dongfang Zhao, Hakan Akkan Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 33
More information: – Contact: Questions? Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 34
Service Simulation Peer-to-peer networks simulation Telephony simulations Simulation of consistency Problem: not focus on HPC, or combine distributed features Taxonomy Investigation of distributed hash tables, and an algorithm taxonomy Grid computing workflows taxonomy Problems: none of them drive features in a simulation Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 35