Seven O’Clock: A New Distributed GVT Algorithm using Network Atomic Operations David Bauer, Garrett Yaun Christopher Carothers Computer Science Murat Yuksel.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Causal Delivery (Thomas) Matt Guinn CS523, Spring 2006.
Timed Distributed System Models  A. Mok 2014 CS 386C.
Virtual Time “Virtual Time and Global States of Distributed Systems” Friedmann Mattern, 1989 The Model: An asynchronous distributed system = a set of processes.
Parallel and Distributed Simulation Global Virtual Time - Part 2.
Time Warp: Global Control Distributed Snapshots and Fossil Collection.
Parallel and Distributed Simulation Time Warp: Basic Algorithm.
More on protocol implementation Version walks Timers and their problems.
1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.
Optimistic Parallel Discrete Event Simulation Based on Multi-core Platform and its Performance Analysis Nianle Su, Hongtao Hou, Feng Yang, Qun Li and Weiping.
Time and Clock Primary standard = rotation of earth De facto primary standard = atomic clock (1 atomic second = 9,192,631,770 orbital transitions of Cesium.
Computer Science 425 Distributed Systems CS 425 / ECE 428  2013, I. Gupta, K. Nahrtstedt, S. Mitra, N. Vaidya, M. T. Harandi, J. Hou.
1 A Practical Efficiency Criterion For The Null Message Algorithm András Varga 1, Y. Ahmet Şekerciuğlu 2, Gregory K. Egan 2 1 Omnest Global, Inc. 2 CTIE,
1 Complexity of Network Synchronization Raeda Naamnieh.
Distributed Systems Spring 2009
Time in Embedded and Real Time Systems Lecture #6 David Andrews
Distributed Systems Fall 2010 Time and synchronization.
CPSC 668Set 16: Distributed Shared Memory1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
Teaching material based on Distributed Systems: Concepts and Design, Edition 3, Addison-Wesley Copyright © George Coulouris, Jean Dollimore, Tim.
1 IMPROVING RESPONSIVENESS BY LOCALITY IN DISTRIBUTED VIRTUAL ENVIRONMENTS Luca Genovali, Laura Ricci, Fabrizio Baiardi Lucca Institute for Advanced Studies.
An Introduction to Parallel Computing Dr. David Cronk Innovative Computing Lab University of Tennessee Distribution A: Approved for public release; distribution.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Lecture 13 Synchronization (cont). EECE 411: Design of Distributed Software Applications Logistics Last quiz Max: 69 / Median: 52 / Min: 24 In a box outside.
Building Parallel Time-Constrained HLA Federates: A Case Study with the Parsec Parallel Simulation Language Winter Simulation Conference (WSC’98), Washington.
1 Topology Design of Structured Campus Networks by Habib Youssef Sadiq M. SaitSalman A. Khan Department of Computer Engineering King Fahd University of.
Time, Clocks and the Ordering of Events in a Distributed System - by Leslie Lamport.
Lecture 12 Synchronization. EECE 411: Design of Distributed Software Applications Summary so far … A distributed system is: a collection of independent.
Distributed Systems Foundations Lecture 1. Main Characteristics of Distributed Systems Independent processors, sites, processes Message passing No shared.
Lecture 2-1 CS 425/ECE 428 Distributed Systems Lecture 2 Time & Synchronization Reading: Klara Nahrstedt.
Distributed Real-Time systems 1 By: Mahdi Sadeghizadeh Website: Sadeghizadeh.ir Advanced Computer Networks.
Lecture 6: Introduction to Distributed Computing.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Network Aware Resource Allocation in Distributed Clouds.
Hardware Supported Time Synchronization in Multi-Core Architectures 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan,
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Parallel and Distributed Simulation Synchronizing Wallclock Time.
Parallel and Distributed Simulation Memory Management & Other Optimistic Protocols.
1 Scheduling The part of the OS that makes the choice of which process to run next is called the scheduler and the algorithm it uses is called the scheduling.
Meta-Simulation Design and Analysis for Large Scale Networks David W Bauer Jr Department of Computer Science Rensselaer Polytechnic Institute.
1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
DISTRIBUTED COMPUTING Introduction Dr. Yingwu Zhu.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Time This powerpoint presentation has been adapted from: 1) sApr20.ppt.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Distributed simulation with MPI in ns-3 Joshua Pelkey and Dr. George Riley Wns3 March 25, 2011.
Physical clock synchronization Question 1. Why is physical clock synchronization important? Question 2. With the price of atomic clocks or GPS coming down,
Rounding scheme if r * j  1 then r j := 1  When the number of processors assigned in the continuous solution is between 0 and 1 for each task, the speed.
High Level Architecture Time Management. Time management is a difficult subject There is no real time management in DIS (usually); things happen as packets.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
1 Fair Queuing Hamed Khanmirza Principles of Network University of Tehran.
Hwajung Lee. Primary standard = rotation of earth De facto primary standard = atomic clock (1 atomic second = 9,192,631,770 orbital transitions of Cesium.
Ordering of Events in Distributed Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau.
CIS 825 Review session. P1: Assume that processes are arranged in a ring topology. Consider the following modification of the Lamport’s mutual exclusion.
Efficient Algorithms for Distributed Snapshots and Global Virtual Time Approximation Author: Friedermann Mattern Presented By: Shruthi Koundinya.
Clock Synchronization (Time Management) Deadlock Avoidance Using Null Messages.
CSE 486/586 CSE 486/586 Distributed Systems Time and Synchronization Steve Ko Computer Sciences and Engineering University at Buffalo.
PDES Introduction The Time Warp Mechanism
Parallel and Distributed Simulation Techniques
PDES: Time Warp Mechanism Computing Global Virtual Time
Operating Systems CPU Scheduling.
CPSC 531: System Modeling and Simulation
Lecture 7: Introduction to Distributed Computing.
Parallel and Distributed Simulation
Physical clock synchronization
CSE 486/586 Distributed Systems Time and Synchronization
Parallel Exact Stochastic Simulation in Biochemical Systems
Presentation transcript:

Seven O’Clock: A New Distributed GVT Algorithm using Network Atomic Operations David Bauer, Garrett Yaun Christopher Carothers Computer Science Murat Yuksel Shivkumar Kalyanaraman ECSE

Global Virtual Time Defines a lower bound on any unprocessed event in the system. Defines the point beyond which events should not be reclaimed. !Imperative that GVT computation operate as efficiently as possible.

Key Problems Simultaneous Reporting Problem Transient Message Problem arises “because not all processors will report their local minimum at precisely the same instant in wall- clock time”. message is delayed in the network and neither the sender nor the receiver consider that message in their respective GVT calculation. Asynchronous Solution: create a synchronization, or “cut”, across the distributed simulation that divides events into two categories: past and future. Consistent Cut: a cut where there is no message scheduled in the future of the sending processor, but received in the past of the destination processor.

Mattern’s GVT Algorithm Construct cut via message-passing Cost: O(log n) if tree, O(N) if ring ! If large number of processors, then free pool exhausted waiting for GVT to complete

Fujimoto’s GVT Algorithm Construct cut using shared memory flag Cost: O(1) ! Limited to shared memory architecture Sequentially consistent memory model ensures proper causal order

Memory Model Sequentially consistent does not mean instantaneous Memory events are only guaranteed to be causally ordered Is there a method to achieve sequentially consistent shared memory in a loosely coordinated, distributed environment?

GVT Algorithm Differences Fujimoto7 O’ClockMatternSamadi Cost of Cut Calculation O(1) O(N) or O(log N) O(N) or O(log N) * Parallel / Distributed PP+D Global Invariant Shared Memory Flag Real Time Clock Message Passing Independent of Event Memory NYNN * cost of algorithm much higher

Network Atomic Operations Goal: each processor observes the “start” of the GVT computation at the same instance of wall clock time Definition: An NAO is an agreed upon frequency in wall clock time at which some event is logically observed to have happened across a distributed system.

Network Atomic Operations Goal: each processor observes the “start” of the GVT computation at the same instance of wall clock time Definition: An NAO is an agreed upon frequency in wall clock time at which some event is logically observed to have happened across a distributed system. wall-clock time Compute GVT wall-clock time Update Tables possible operations provided by a complete sequentially consistent memory model

Clock Synchronization Assumption: all processors share a highly accurate, common view of wall clock time. Basic building block: CPU timestamp counter –Measures time in terms of clock cycles, so a gigahertz CPU clock has granularity of 10 9 secs –Sending events across network is much larger granularity depending on tech: ~10 6 secs on 1000base/T

Clock Synchronization Issues: clock synchronization, drift and jitter Ostrovsky and Patt-Shamir: –provably optimal clock synchronization –clocks have drift and the message latency may be unbounded Well researched problem in distributed computing – we used simplified approach –simplified approach helpful in determining if system working properly

Max Send  t Definition: max_send_delta_t is maximum of –worst case bound on the time to send an event through the network –twice synchronization error –twice max clock drift over simulation time add a small amount of time to the NAO expiration –Similar to sequentially consistent memory Overcomes: –Transient message problem, clock drift/jitter and clock synchronization error

Max Send  t: clock drift Clock drift causes CPU clocks to become unsynchronized –Long running simulations may require multiple synchs –Or, we account for it in the NAO Max Send  t overcomes clock drift by ensuring no event “falls between the cracks”

Max Send  t What if clocks are not well synched? –Let  D max be the maximum clock drift. –Let  S max be the maximum synchronization error. Solution: Re-define  t max as  t’ max = max(  t max, 2*  D max, 2*  S max ) In practice both  D max and  S max are very small in comparison to  t max. LP 1 wallclock time LP 2 GVT  t max GVT  D max  t max

Transient Message Problem Max Send  t: worst case bound on time to send event in network –guarantees events are accounted for by either sender of receiver

Simultaneous Reporting Problem Problem arises when processors do not start GVT computation simultaneously Seven O’Clock does start simultaneously across all CPUs, therefore, problem cannot occur

NAO 7 5 GVT 109 LVT: 7LVT: 5 LVT: min(5,9) GVT: min(5,7) ABCDE

NAO 7 5 GVT 109 LVT: 7LVT: 5 LVT: min(5,9) GVT: min(5,7)

Simulation: Seven O’Clock GVT Algorithm –Assumptions: Each processor has a highly accurate clock A message passing interface w/o ack is available The worst case bound on the time to transmit a message through the network  t max is known. cut point LP 1 LP 2 wallclock time LP 3 LP 4 GVT #1  t max GVT # NAO LVT=min(5,9) LVT=min(7,9) 12 GVT=min(5,7) –Properties: a clock-based algorithm for distributed processors creates a sequentially consistent view of distributed memory

Limitations NAOs cannot be “forced” –agreed upon intervals cannot change Simulation End Time –worst-case, complete NAO and only one event remaining to process –amortized over entire run-time, cost is O(1) Exhausted Event Pool –requires tuning to ensure enough optimistic memory available

Uniqueness Only real-time based GVT algorithm Zero-cost consistent-cut  truly scalable –O(1) cost  optimal Only algorithm which is entirely independent of available event memory –Event memory loosely tied to GVT algorithm

Performance Analysis: Models r-PHOLD PHOLD with reverse computation Modified to control percent remote events (normally 75%) Destinations still decided using a uniform random number generator  all LPs possible destination TCP-Tahoe TCP-TAHOE ring of Campus Networks topology Same topology design as used by PDNS in MASCOTS ’03 Model limitations required us to increase the number of LAN routers in order to simulate the same network

Performance Analysis: Clusters Itanium Cluster Location: RPI Total Nodes: 4 Total CPU: 16 Total RAM: 64GB CPU: Quad Itanium-2 1.3GHz Network: Myrinet 1000base/T NetSim Cluster Location: RPI Total Nodes: 40 Total CPU: 80 Total RAM: 20GB CPU: Dual Intel 800MHz Network: ½ 100base/T, ½ 1000base/T Sith Cluster Location: Georgia Tech Total Nodes: 30 Total CPU: 60 Total RAM: 180GB CPU: Dual Itanium-2 900MHz Network: ethernet 1000base/T

Itanium Cluster: r-PHOLD, CPUs allocated round-robin

Maximize distribution (round robin among nodes) VERSUS Maximize parallelization (use all CPUs before using additional nodes)

NetSim Cluster: Comparing 10- and 25% remote events (using 1 CPU per node)

NetSim Cluster: Comparing 10- and 25% remote events (using 1 CPU per node)

TCP Model Topology Single Campus10 Campus Networks in a Ring Our model contained 1,008 campus networks in a ring, simulating > 540,000 nodes.

Itanium Cluster: TCP results using 2- and 4-nodes

Sith Cluster: TCP Model using 1 CPU per node and 2 CPU per node

Future Work & Conclusions Investigate “power” of different models by computing spectral analysis –GVT now in frequency domain –Determine max length of rollbacks Investigate new ways of measuring performance –Models too large to run sequentially –Account for hardware affects (even in NOW there are fluctuations in HW performance) –Account for model LP mapping –Account for different cases, ie, 4 CPUs distributed across 1, 2, and 4 nodes