What is the Cost of Determinism?

Slides:

Advertisements

Similar presentations

PEREGRINE: Efficient Deterministic Multithreading through Schedule Relaxation Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, Junfeng Yang Software.

Advertisements

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

Michael Bond (Ohio State) Milind Kulkarni (Purdue)

An Case for an Interleaving Constrained Shared-Memory Multi-Processor Jie Yu and Satish Narayanasamy University of Michigan.

Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

CHESS: A Systematic Testing Tool for Concurrent Software CSCI6900 George.

Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.

Calvin: Deterministic or Not? Free Will to Choose Derek R. Hower, Polina Dudnik, Mark D. Hill, David A. Wood.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Progress Guarantee for Parallel Programs via Bounded Lock-Freedom Erez Petrank – Technion Madanlal Musuvathi- Microsoft Bjarne Steensgaard - Microsoft.

Execution Replay for Multiprocessor Virtual Machines George W. Dunlap Dominic Lucchetti Michael A. Fetterman Peter M. Chen.

User Level Interprocess Communication for Shared Memory Multiprocessor by Bershad, B.N. Anderson, A.E., Lazowska, E.D., and Levy, H.M.

Performance and Energy Bounds for Multimedia Applications on Dual-processor Power-aware SoC Platforms Weng-Fai WONG 黄荣辉 Dept. of Computer Science National.

Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee.

Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient.

PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection S. Lu, P. Zhou, W. Liu, Y. Zhou, J. Torrellas University.

Parrot: A Practical Runtime for Deterministic, Stable, and Reliable threads HEMING CUI, YI-HONG LIN, HAO LI, XINAN XU, JUNFENG YANG, JIRI SIMSA, BEN BLUM,

DTHREADS: Efficient Deterministic Multithreading

Scheduling for Reduced CPU Energy M. Weiser, B. Welch, A. Demers, and S. Shenker.

By Praveen Venkataramani Vishwani D. Agrawal TEST PROGRAMMING FOR POWER CONSTRAINED DEVICES 5/9/201322ND IEEE NORTH ATLANTIC TEST WORKSHOP 1.

RCDC SLIDES README Font Issues – To ensure that the RCDC logo appears correctly on all computers, it is represented with images in this presentation. This.

Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.

Light64: Lightweight Hardware Support for Data Race Detection during Systematic Testing of Parallel Programs A. Nistor, D. Marinov and J. Torellas to appear.

0 Deterministic Replay for Real- time Software Systems Alice Lee Safety, Reliability & Quality Assurance Office JSC, NASA Yann-Hang.

Deterministic Replay of Java Multithreaded Applications Jong-Deok Choi and Harini Srinivasan slides made by Qing Zhang.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Microsoft Research Asia Ming Wu, Haoxiang Lin, Xuezheng Liu, Zhenyu Guo, Huayang Guo, Lidong Zhou, Zheng Zhang MIT Fan Long, Xi Wang, Zhilei Xu.

Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,

COS 598: Advanced Operating System. Operating System Review What are the two purposes of an OS? What are the two modes of execution? Why do we have two.

- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.

SSGRR A Taxonomy of Execution Replay Systems Frank Cornelis Andy Georges Mark Christiaens Michiel Ronsse Tom Ghesquiere Koen De Bosschere Dept. ELIS.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

AADEBUG MUNCHEN Non-intrusive on-the-fly data race detection using execution replay Michiel Ronsse - Koen De Bosschere Ghent University - Belgium.

SAN FRANCISCO, CA, USA Adaptive Energy-efficient Resource Sharing for Multi-threaded Workloads in Virtualized Systems Can HankendiAyse K. Coskun Boston.

Hybrid Prototyping of MPSoCs Samar Abdi Electrical and Computer Engineering Concordia University Montreal, Canada

1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.

(Mis)Understanding the NUMA Memory System Performance of Multithreaded Workloads Zoltán Majó Thomas R. Gross Department of Computer Science ETH Zurich,

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Micro-sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems Jeongseob Ahn, Chang Hyun Park, and Jaehyuk.

Age Based Scheduling for Asymmetric Multiprocessors Nagesh B Lakshminarayana, Jaekyu Lee & Hyesoon Kim.

Seminar of “Virtual Machines” Course Mohammad Mahdizadeh SM. University of Science and Technology Mazandaran-Babol January 2010.

Drinking from Both Glasses: Adaptively Combining Pessimistic and Optimistic Synchronization for Efficient Parallel Runtime Support Man Cao Minjia Zhang.

Motivation  Parallel programming is difficult  Culprit: Non-determinism Interleaving of parallel threads But required to harness parallelism  Sequential.

Sound and Precise Analysis of Parallel Programs through Schedule Specialization Jingyue Wu, Yang Tang, Gang Hu, Heming Cui, Junfeng Yang Columbia University.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.

University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.

Migration Cost Aware Task Scheduling Milestone Shraddha Joshi, Brian Osbun 10/24/2013.

State Machine Replication State Machine Replication through transparent distributed protocols State Machine Replication through a shared log.

Agenda  Quick Review  Finish Introduction  Java Threads.

Reachability Testing of Concurrent Programs1 Reachability Testing of Concurrent Programs Richard Carver, GMU Yu Lei, UTA.

Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.

M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.

Chris Fallin, David Lewis, Zongwei Zhou Date & location of presentation Deterministic Multiprocessing.

Explicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic! Joe Devietti, Brandon Lucia, Luis Ceze and Mark Oskin University.

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, Junfeng Yang

Threads and Memory Models Hal Perkins Autumn 2011

CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #

Haishan Zhu, Mattan Erez

Dynamic Voltage Scaling

Threads and Memory Models Hal Perkins Autumn 2009

A High Performance SoC: PkunityTM

DMP: Deterministic Shared Memory Multiprocessing

Presentation transcript:

What is the Cost of Determinism? Cedomir Segulja, Tarek S. Abdelrahman University of Toronto

Source: [Youtube] Source: [Intel]

Non-Determinism Same program + same input ≠ same output This is bad for … Testing Too many interleaving to test Debugging Hard to debug when behavior is not repeatable Selling CAD tools users expect each run to produce the same circuit

Deterministic Schedulers Determinism Deterministic Schedulers Maximum Slowdown DMP [Devietti et al. 2009] 1.7x Kendo [Olszewski et al. 2009] 1.6x Grace [Berger et al. 2009] 3.6x CoreDet [Bergan et al. 2010] 10x Calvin [Hower et al. 2011] RCDC [Devietti et al. 2011] Dthreads [Liu et al. 2011] 4x Conversion [Merriﬁeld and Eriksson 2013] 5x Parrot [Cui et al. 2013] 3.8x RFDet [Lu et al. 2014] 2.6x Is good, but costly What is the fundamental cost of determinism? What is this cost across various execution environments? “Determinism in the field” 1 2 Source: [Bergan et al. 2011] and the respective papers *Only to show that determinism comes at a cost, and not to be used for a direct comparison (different features, benchmarks, # threads, etc.)

What is Determinism? Property that requires observing the same output whenever program runs with the same input SyncOrder determinism [Lu and Scott 11] Require the same program result and same order of synchronization More flexible than internal determinism Still greatly eases testing [Cui et al. 13] We assume data-race-freedom Determinism during debugging is needed But the cost of determinism matters the most in production All data races are bugs [Boehm 2008, S. Adve 2010, Marino et al. 2010, Lucia et al. 2010, …] Data races in general do not help performance [Boehm 12] External SyncOrder Internal

What is the impact of enforcing a fixed synchronization order on program execution time?

Schedule-Record-Replay Framework 1 2 application application schedule thread1 thread2 scheduler replayer serial hybrid round-robin perturber idle small perturbations architectures dynamic-A dynamic-S NUMA background processes recorder DVFS

Replayer Force threads to wait only when absolutely necessary under the schedule And do so with as little overhead as possible Non-deterministic execution vs. Non-deterministic execution with the replayer’s overhead

Deterministic Schedulers Schedules Deterministic Schedulers Schedule Grace [Berger et al. 2009] serial Dthreads [Liu et al. 2011] round-robin Conversion [Merriﬁeld and Eriksson 2013] Parrot [Cui et al. 2013] Kendo [Olszewski et al. 2009] dynamic RCDC [Devietti et al. 2011] RFDet [Lu et al. 2014] DMP [Devietti et al. 2009] hybrid CoreDet [Bergan et al. 2010] Calvin [Hower et al. 2011] When does a thread pass its turn? At the end – serial After each synchronization operation – round-robin After each instruction/store – dynamic-A/dynamic-S After N instructions – hybrid N = 100,000 No “reduced serial mode”

Platform 8-core Xeon E5-2660 24 SPLASH-2 and PARSEC benchmarks, 8 threads Deterministic slowdown 𝑑𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑖𝑠𝑡𝑖𝑐 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑛𝑜𝑛 𝑑𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑖𝑠𝑡𝑖𝑐 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 Data races in general do not help performance [Boehm 12] 15 benchmarks had races, performance degradation in only 3 barnes (11%), radiosity (5%), raytrace_parsec (8%)

Benchmarks serial round-robin dynamic-S dynamic-A hybrid splash barnes 1.10 0.98 0.95 0.96 0.99 cholesky 3.39 2.39 1.07 1.05 fft 4.36 1.02 1.01 fmm 6.34 1.33 1.16 1.13 1.19 lu_cb 1.00 lu_ncb ocean_cp ocean_ncp radiosity 7.58 3.04 1.09 1.08 2.67 radix raytrace 7.72 2.93 1.88 volrend 6.12 1.91 1.67 water_nsquared water_spatial parsec blackscholes bodytrack 5.87 1.04 dedup 5.04 1.77 1.63 1.34 facesim 6.19 ferret 3.19 1.58 1.23 1.25 fluidanimate 1.81 0.97 7.26 1.52 1.06 streamcluster swaptions vips 7.61 5.27 1.31 average slowdown 3.61 1.60 1.17 maximum slowdown

the fundamental cost of determinism is small. For this set of benchmarks and our platform, and implementation overhead set aside, the fundamental cost of determinism is small.

What is the performance cost of insisting on the same schedule across different environments?

Schedule-Record-Perturb-Replay Framework 1 2 application application schedule thread1 thread2 scheduler replayer serial hybrid round-robin perturber idle small perturbations architectures dynamic-A dynamic-S NUMA background processes recorder DVFS

Perturber Small perturbations (context switches, thread migrations, page faults) Simulate first order effects by inserting small delays (μs and ms) Background processes Spawn additional threads and control their work to sleep ratio Dynamic voltage and frequency scaling (DVFS) Use Linux’s cpufreq system to explore different DVFS policies Non-uniform memory access (NUMA) Spread threads over two NUMA nodes Asymmetric architectures Use DVFS to create asymmetry [Shelepov et al. 2009]

Metric Deterministic slowdown 𝑑𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑖𝑠𝑡𝑖𝑐 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑛𝑜𝑛 𝑑𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑖𝑠𝑡𝑖𝑐 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 Same conditions during both runs, for example 𝑑𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑖𝑠𝑡𝑖𝑐 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑤𝑖𝑡ℎ 𝑏𝑎𝑐𝑘𝑔𝑟𝑜𝑢𝑛𝑑 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑠 𝑛𝑜𝑛 𝑑𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑖𝑠𝑡𝑖𝑐 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑤𝑖𝑡ℎ 𝑏𝑎𝑐𝑘𝑔𝑟𝑜𝑢𝑛𝑑 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑠

Benchmarks Quiet Small perturbations Backgroud proc. DVFS NUMA Asym. Arch. balanced unbalanced auto manual 4/4 1/7 splash barnes 0.96 0.95 0.97 0.92 0.91 0.94 cholesky 1.05 1.06 1.25 1.02 1.08 1.03 1.09 fft 1.01 1.07 1.00 fmm 1.13 1.19 1.24 1.14 1.15 lu_cb 0.99 0.98 lu_ncb ocean_cp ocean_ncp radiosity 1.94 1.11 1.46 1.71 radix raytrace 1.92 1.44 1.69 volrend 1.38 1.55 water_nsquared water_spatial parsec blackscholes bodytrack 1.04 1.51 1.33 1.56 dedup 1.35 1.31 1.29 1.32 1.64 facesim ferret 1.23 1.21 1.37 1.10 fluidanimate 1.77 1.39 1.63 streamcluster swaptions vips 1.43 1.53 avg. slowdown 1.17 max. slowdown

Insisting on the same schedule in the presence of skewed conditions can slow down execution by a factor of almost 2x.

Conclusions Employed the schedule-record-replay framework to divorce implementation overhead from the fundamental cost of enforcing deterministic execution Fundamental cost of determinism is small (4% on avg., 33 % max.) There is room for lowering overheads in current deterministic systems Measured this fundamental cost across a range of execution environments The cost of raises to almost 2x when threads face skewed conditions Do we need a more relaxed definition of determinism? Quantified various sources of non-determinism Deterministic logical clocks are not deterministic (not only due to the performance counters imperfections [Weaver et al. 2013])

Thank you!