OSDI ’10 Research Visions 3 October 2010 1 Epoch parallelism: One execution is not enough Jessica Ouyang, Kaushik Veeraraghavan, Dongyoon Lee, Peter Chen,

Slides:



Advertisements
Similar presentations
TRAMP Workshop Some Challenges Facing Transactional Memory Craig Zilles and Lee Baugh University of Illinois at Urbana-Champaign.
Advertisements

Remus: High Availability via Asynchronous Virtual Machine Replication
Transactional Memory Parag Dixit Bruno Vavala Computer Architecture Course, 2012.
An Case for an Interleaving Constrained Shared-Memory Multi-Processor Jie Yu and Satish Narayanasamy University of Michigan.
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
OSDI ’10 Research Visions 3 October Epoch parallelism: One execution is not enough Jessica Ouyang, Kaushik Veeraraghavan, Dongyoon Lee, Peter Chen,
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Detecting and surviving data races using complementary schedules
Dec 5, 2007University of Virginia1 Efficient Dynamic Tainting using Multiple Cores Yan Huang University of Virginia Dec
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Asynchronous Assertions Eddie Aftandilian and Sam Guyer Tufts University Martin Vechev ETH Zurich and IBM Research Eran Yahav Technion.
An Case for an Interleaving Constrained Shared-Memory Multi- Processor CS6260 Biao xiong, Srikanth Bala.
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
Submitted by: Omer & Ofer Kiselov Supevised by: Dmitri Perelman Networked Software Systems Lab Department of Electrical Engineering, Technion.
Continuously Recording Program Execution for Deterministic Replay Debugging.
Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie.
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.
CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.
What Great Research ?s Can RAMP Help Answer? What Are RAMP’s Grand Challenges ?
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient.
PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection S. Lu, P. Zhou, W. Liu, Y. Zhou, J. Torrellas University.
Parallelizing Data Race Detection Benjamin Wester Facebook David Devecsery, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan.
DoublePlay: Parallelizing Sequential Logging and Replay Kaushik Veeraraghavan Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn,
RCDC SLIDES README Font Issues – To ensure that the RCDC logo appears correctly on all computers, it is represented with images in this presentation. This.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.
Parallelizing Security Checks on Commodity Hardware E.B. Nightingale, D. Peek, P.M. Chen and J. Flinn U Michigan.
- 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera.
1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**
Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.
Parallelizing Security Checks on Commodity Hardware Ed Nightingale Dan Peek, Peter Chen Jason Flinn Microsoft Research University of Michigan.
GPU Architecture and Programming
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
Aritra Sengupta, Swarnendu Biswas, Minjia Zhang, Michael D. Bond and Milind Kulkarni ASPLOS 2015, ISTANBUL, TURKEY Hybrid Static-Dynamic Analysis for Statically.
Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.
Hybrid Transactional Memory Sanjeev Kumar, Michael Chu, Christopher Hughes, Partha Kundu, Anthony Nguyen, Intel Labs University of Michigan Intel Labs.
Automatically Exploiting Cross- Invocation Parallelism Using Runtime Information Jialu Huang, Thomas B. Jablin, Stephen R. Beard, Nick P. Johnson, and.
Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
…and region serializability for all JESSICA OUYANG, PETER CHEN, JASON FLINN & SATISH NARAYANASAMY UNIVERSITY OF MICHIGAN.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching Yao Song 11/05/2015.
Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient.
CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.
Speculation Supriya Vadlamani CS 6410 Advanced Systems.
The Standford Hydra CMP  Lance Hammond  Benedict A. Hubbert  Michael Siu  Manohar K. Prabhu  Michael Chen  Kunle Olukotun Presented by Jason Davis.
Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.
Optimistic Hybrid Analysis
Speculative Lock Elision
Atomic Operations in Hardware
Threads and Memory Models Hal Perkins Autumn 2011
Optimistic Hybrid Analysis:
Hwisoo So. , Moslem Didehban#, Yohan Ko
Chapter 4: Threads.
ISCA 2005 Panel Guri Sohi.
Single-Chip Multiprocessors: the Rebirth of Parallel Architecture
Transactional Memory An Overview of Hardware Alternatives
Efficient software checkpointing framework for speculative techniques
Lecture 22: Consistency Models, TM
Design and Implementation Issues for Atomicity
15-740/ Computer Architecture Lecture 14: Prefetching
rePLay: A Hardware Framework for Dynamic Optimization
Presentation transcript:

OSDI ’10 Research Visions 3 October Epoch parallelism: One execution is not enough Jessica Ouyang, Kaushik Veeraraghavan, Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanansamy University of Michigan

Motivation Write a single program that is both fast & correct Make it easier for programmers –Change approach to programming –Write program that is fast or correct – not both Combine multiple, specialized executions –Fast/buggy accelerates slow/correct –Slow/correct checks fast/buggy Jessica Ouyang2 Fast & Correct Slow & Correct Fast & Buggy Slow & Correct

E1 E3 E2 E0 ==? 2. Start epoch 1. Checkpoint state Jessica Ouyang3 Epoch parallelism E1 Time E0 E2 E3 Fast & buggySlow & correct E3 != 3. Check state 4. Roll back & Re-execute

Nice properties of uniprocessor -Fewer races -Stronger memory consistency model -Easier to replay Uniprocessor execution Jessica Ouyang4 CPU 0CPU 1CPU 2CPU 3 E1 E0 B1 B0 MultiprocessorUniprocessor Performance E0 B0 A1 A0 E1 B1 A1 A0

Using epoch parallelism Jessica Ouyang5 CPU 0CPU 1CPU 2CPU 3 E1 E0 B1 B0 E0 S0 Multi-threadedSingle-threaded E1 S1 Transform function Challenges -Importing state to start epochs -Checking state A1 A0

Jessica Ouyang - University of Michigan6 Conclusion Rethink having a single program/execution be both fast & correct Use separate, specialized executions to achieve different goals

OSDI ’10 Research Visions 3 October Epoch parallelism: One execution is not enough Jessica Ouyang, Kaushik Veeraraghavan, Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanansamy University of Michigan

Related Work Master/Slave Speculative Parallelization –Zilles, Sohi, IEEE ‘02 Thread-Level Data Speculation –Steffan, Mowry, HPCA ‘98 Enhancing Software Reliability with Speculative Threads –Oplinger, Lam, APLOS ’02 BASE –Castro, Rodriguez, Liskov, TOCS ’03 GRACE –Berger, Yang, Liu, Novark, OOPSLA ‘09 Jessica Ouyang8

More uses of epoch parallelism Uniprocessor execution –Deterministic replay –Data race detection/avoidance Optimistic concurrency –Lock elision –Transactional memory Additional runtime checks –Assertions, bounds checking –Security checks Jessica Ouyang9

Programming effort Write one program –Compiler/runtime/hardware optimizes aggressively –Original program checks correctness Write 2 versions of same program –One with checks (assertions, security) and one without Write 2 versions + transform function –Arbitrary implementations Jessica Ouyang10

Programming effort Single-threaded & multi-threaded use case –Need additional transform function –Generate input to start epochs Is this really less work than 1 correct & fast multi- threaded program? Jessica Ouyang11

Redundancy & efficiency Base-line overhead is 2x throughput Acceptable for some applications –Core counts increasing –Using cores is hard Can make it more efficient –Remove redundant instructions –Use fast & buggy as software predictor for slow & correct (branched, load value) Jessica Ouyang12

E3 Jessica Ouyang13 Epoch parallelism E2 E1 E0 Time E0 E2 E3 Fast, buggyCorrect, slow

Slow and correct E0 has completed E1 Misspeculation in epoch parallelism Jessica Ouyang14 E0 Time E0 Fast, buggy Correct, slow Check thread- parallel checkpoint Checkpoint doesn’t match! E1 ? ? Use result from epoch- parallel Restart execution of epoch 1 E3 E2 E3 Continue executing