D. Becker, M. Geimer, R. Rabenseifner, and F. Wolf Laboratory for Parallel Programming | September 21 2010 Synchronizing the timestamps of concurrent events.

Slides:



Advertisements
Similar presentations
Performance Analysis Tools for High-Performance Computing Daniel Becker
Advertisements

Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
HIERARCHY REFERENCING TIME SYNCHRONIZATION PROTOCOL Prepared by : Sunny Kr. Lohani, Roll – 16 Sem – 7, Dept. of Comp. Sc. & Engg.
D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.
Time and Global States Part 3 ECEN5053 Software Engineering of Distributed Systems University of Colorado, Boulder.
/ PSWLAB Efficient Decentralized Monitoring of Safety in Distributed System K Sen, A Vardhan, G Agha, G Rosu 20 th July 2007 Presented by.
Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.
Distributed Systems Fall 2010 Time and synchronization.
Ordering and Consistent Cuts Presented By Biswanath Panda.
Distributed Systems Fall 2009 Logical time, global states, and debugging.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
Lecture 13 Synchronization (cont). EECE 411: Design of Distributed Software Applications Logistics Last quiz Max: 69 / Median: 52 / Min: 24 In a box outside.
Lecture 12 Synchronization. EECE 411: Design of Distributed Software Applications Summary so far … A distributed system is: a collection of independent.
Computer Science Lecture 10, page 1 CS677: Distributed OS Last Class: Clock Synchronization Physical clocks Clock synchronization algorithms –Cristian’s.
Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport (1978) Presented by: Yoav Kantor.
Distributed Systems Foundations Lecture 1. Main Characteristics of Distributed Systems Independent processors, sites, processes Message passing No shared.
Sensor Networks Storage Sanket Totala Sudarshan Jagannathan.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Distributed Control of FACTS Devices Using a Transportation Model Bruce McMillin Computer Science Mariesa Crow Electrical and Computer Engineering University.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Neural and Evolutionary Computing - Lecture 10 1 Parallel and Distributed Models in Evolutionary Computing  Motivation  Parallelization models  Distributed.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Extreme scale parallel and distributed systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward.
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Logical Clocks n event ordering, happened-before relation (review) n logical clocks conditions n scalar clocks condition implementation limitation n vector.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
1 Performance Analysis with Vampir ZIH, Technische Universität Dresden.
1 SIAC 2000 Program. 2 SIAC 2000 at a Glance AMLunchPMDinner SunCondor MonNOWHPCGlobusClusters TuePVMMPIClustersHPVM WedCondorHPVM.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Page 1 Logical Clocks Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this presentation is.
Chapter 10 Analysis and Design Discipline. 2 Purpose The purpose is to translate the requirements into a specification that describes how to implement.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
“Virtual Time and Global States of Distributed Systems”
CSE 486/586 CSE 486/586 Distributed Systems Logical Time Steve Ko Computer Sciences and Engineering University at Buffalo.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
Distributed Systems Fall 2010 Logical time, global states, and debugging.
Time This powerpoint presentation has been adapted from: 1) sApr20.ppt.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.
Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:
D u k e S y s t e m s Asynchronous Replicated State Machines (Causal Multicast and All That) Jeff Chase Duke University.
Programmability Hiroshi Nakashima Thomas Sterling.
Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay.
Logical Clocks. Topics r Logical clocks r Totally-Ordered Multicasting.
Distributed systems. distributed systems and protocols distributed systems: use components located at networked computers use message-passing to coordinate.
ICDCS 2006 Efficient Incremental Optimal Chain Partition of Distributed Program Traces Selma Ikiz Vijay K. Garg Parallel and Distributed Systems Laboratory.
Reachability Testing of Concurrent Programs1 Reachability Testing of Concurrent Programs Richard Carver, GMU Yu Lei, UTA.
Presented by Jack Dongarra University of Tennessee and Oak Ridge National Laboratory KOJAK and SCALASCA.
Logical Clocks event ordering, happened-before relation (review) logical clocks conditions scalar clocks  condition  implementation  limitation vector.
Productive Performance Tools for Heterogeneous Parallel Computing
Parallel Programming By J. H. Wang May 2, 2017.
Parallel and Distributed Simulation Techniques
Concurrent Graph Exploration with Multiple Robots
Soft Error Detection for Iterative Applications Using Offline Training
Basics of Distributed Systems
BigSim: Simulating PetaFLOPS Supercomputers
Parallel Exact Stochastic Simulation in Biochemical Systems
Presentation transcript:

D. Becker, M. Geimer, R. Rabenseifner, and F. Wolf Laboratory for Parallel Programming | September Synchronizing the timestamps of concurrent events in traces of hybrid MPI/OpenMP applications

2Daniel Becker Cluster systems represent majority of today’s supercomputers – Availability of inexpensive commodity components Vast diversity – Architecture – Interconnect technology – Software environment Message-passing and shared-memory programming models for communication and synchronization Cluster systems Resulting in hybrid MPI/OpenMP applications and the need for generic software tools

3Daniel Becker Application areas – Performance analysis Time-line visualization Wait-state analysis – Performance modeling – Performance prediction – Debugging Events recorded at runtime to enable post-mortem analysis of dynamic program behavior Event includes at least timestamp, location, and event type Event tracing Send Recv Barrier E E SXE MX RXE …… ESX E …… ERX E … … SRX X EEE merge (opt.) write record

4Daniel Becker Problem: Non-synchronized clocks

5Daniel Becker MotivationClock synchronizationLogical synchronizationAlgorithmic extensionsParallel synchronizationExperimental evaluationSummary Outline

6Daniel Becker Lamport, Mattern, Fidge, Rabenseifner Restore and preserve logical correctness Lamport, Mattern, Fidge, Rabenseifner Restore and preserve logical correctness Dunigan, Maillet, Tron, Doleschal Measure offset values and determine interpolation function Determine medial smoothing function based on send/receive differences Duda, Hofman, Hilgers Query time from reference clocks synchronized at regular intervals Mills Clock synchronization Network-based synchronization Error estimation Offset interpolation Logical synchronization Clock synchronization

7Daniel Becker Local correction Clock condition violation Forward amortization Subsequent events Backward amortization Preceding events Controlled logical clock E X E S µ min XXRE

8Daniel Becker MPI semantics E E MX E E E E E E E E E E

9Daniel Becker Neither restores nor preserves clock condition in OpenMP event semantics May introduce violations in locations that were previously intact Limitations of the CLC algorithm R S omp_barrier R

10Daniel Becker Collective communication omp_barrier E E OX Consider OpenMP constructs as composed of multiple logical messages Define logical send/receive pairs for each flavor

11Daniel Becker OpenMP semantics E E E F J OX E E E U U L Tasking U U L U

12Daniel Becker Operation may have multiple logical receive and send events Multiple receives used to synchronize multiple clocks Latest send event is the relevant send event Happened-before relation MXE E OX E E

13Daniel Becker Correct local traces in parallel – Keep whole trace in memory – Exploit distributed memory & processing capabilities Parallelization Replay communication – Traverse trace in parallel – Exchange data at synchronization points – Use operation of same type MPI functions OpenMP constructs Linear offset interpolation Forward replay Backward replay

14Daniel Becker Forward replay 1 …… 3 …… 2 …… omp_barrier 2 1 3

15Daniel Becker Avoid new violations Do not advance send farther than matching receive Backward amortization Exchange remote event data Store those data temporarily Backward replay Determine correction Apply correction independently Piece-wise correction R S S R

16Daniel Becker Data on sender side needed Communication direction – Communication precedes in backward direction – Roles of sender and receiver are inverted Traversal direction – Start at end of trace – Avoid deadlocks Backward replay SR …… SR …… S S R R R R S S

17Daniel Becker Piece-wise correction LC i b R R R R SSSSS ∆t R R LC i b Controlled logical clock without jump discontinuities LC i ’ – LC i b Controlled logical clock with jump discontinuities LC i A’ - LC i b Linear interpolation for backward amortization LC i A - LC i b Piecewise linear interpolation for backward amortization Amortization interval min(LC k ’(corr. receive event) - µ - LC i b ) differences to LC i b

18Daniel Becker Experimental evaluation Significant percentage of messages was violated (up to 5%) After correction all traces were free of clock condition violations Nicole cluster 32 compute nodes 2 quad-core Opteron running at 2.4 GHz Infiniband Applications PEPC (4 threads per process) Jacobi solver (2 threads per process) Evaluation focused on frequency of clock violations, accuracy, and scalability of the correction

19Daniel Becker Event position – Absolute deviations correspond to value clock condition violations – Relative deviations are negligible Accuracy of the algorithm Event distance – Larger relative deviations possible – Impact on analysis results negligible Correction only marginally changes the length of local intervals Correction changed the length of local intervals only marginally

20Daniel Becker Only violated MPI semantics in original trace Roughly half of the corrections correspond to OpenMP semantics Synchronizing hybrid codes Algorithm preserved OpenMP semantics R R S omp_barrier

21Daniel Becker Scalability Correction easily scaled with target application

22Daniel Becker Summary Controlled logical clock algorithm is limited Problem characterized Identified happened- before relations in OpenMP semantics Coverage of realistic hybrid codes Algorithm extended Parallel forward and backward replay Good accuracy and scalability Algorithm parallelized

23Daniel Becker Exploit knowledge of MPI- internal messaging inside collective operations using PERUSE Leverage periodic offset measurements at global synchronization points Outlook Measure (indirect) offsets periodically CLC can increase accuracy between measurements Combined method desirable

24Daniel Becker Thanks!