SIMD Lane Decoupling Improved Timing-Error Resilience

Slides:



Advertisements
Similar presentations
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Advertisements

Computer Architecture Instruction-Level Parallel Processors
IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture 11: Sequential Circuit Design. CMOS VLSI DesignCMOS VLSI Design 4th Ed. 11: Sequential Circuits2 Outline  Sequencing  Sequencing Element Design.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
Introduction to CMOS VLSI Design Lecture 19: Design for Skew David Harris Harvey Mudd College Spring 2004.
Clock Design Adopted from David Harris of Harvey Mudd College.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Automated Design.
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
Circuit-Level Timing Speculation: The Razor Latch Developed by Trevor Mudge’s group at the University of Michigan, 2003.
11 1 Process Variation in Near-threshold Wide SIMD Architectures Sangwon Seo 1, Ronald G. Dreslinski 1, Mark Woh 1, Yongjun Park 1, Chaitali Chakrabarti.
Basics and Architectures
A New Methodology for Reduced Cost of Resilience Andrew B. Kahng, Seokhyeong Kang and Jiajia Li UC San Diego VLSI CAD Laboratory.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Safe Overclocking Safe Overclocking of Tightly Coupled CGRAs and Processor Arrays using Razor © 2012 Guy Lemieux Alex Brant, Ameer Abdelhadi, Douglas Sim,
Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,
Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,
Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.
Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.
Patricia Gonzalez Divya Akella VLSI Class Project.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.
Sunpyo Hong, Hyesoon Kim
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.
PipeliningPipelining Computer Architecture (Fall 2006)
Yuxi Liu The Chinese University of Hong Kong Circuit Timing Problem Driven Optimization.
Power-Optimal Pipelining in Deep Submicron Technology
Lecture 11: Sequential Circuit Design
Time-borrowing platform in the Xilinx UltraScale+ family of FPGAs and MPSoCs Ilya Ganusov, Benjamin Devlin.
Raghuraman Balasubramanian Karthikeyan Sankaralingam
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Multiscalar Processors
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
SECTIONS 1-7 By Astha Chawla
ISPASS th April Santa Rosa, California
Wayne Wolf Dept. of EE Princeton University
Hot Chips, Slow Wires, Leaky Transistors
Lynn Choi Dept. Of Computer and Electronics Engineering
Architecture & Organization 1
/ Computer Architecture and Design
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
The University of British Columbia
Architecture & Organization 1
Day 26: November 1, 2013 Synchronous Circuits
Address-Value Delta (AVD) Prediction
Circuit Design Techniques for Low Power DSPs
332:578 Deep Submicron VLSI Design Lecture 14 Design for Clock Skew
†UCSD, ‡UCSB, EHTZ*, UNIBO*
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
/ Computer Architecture and Design
Linköping University, IDA, ESLAB
Mattan Erez The University of Texas at Austin
Patrick Akl and Andreas Moshovos AENAO Research Group
ITAP: Idle-Time-Aware Power Management for GPU Execution Units
Stream-based Memory Specialization for General Purpose Processors
Presentation transcript:

SIMD Lane Decoupling Improved Timing-Error Resilience Evgeni Krimer (UT Austin) Patrick Chiang (Oregon State) Mattan Erez (UT Austin)

All systems power/energy bound SIMD Lane Decoupling (C) M. Erez, E. Krimer All systems power/energy bound The good: Transistor still following Moore’s Law The bad: Transistor power efficiency improving too slowly Larger fraction of power to non-compute resources The conclusion: Better algorithms More efficient architectures Proportionality: waste less of what you have This paper: SIMD + timing speculation Efficient architecture + proportional guardbands

SIMD Lane Decoupling (C) M. Erez, E. Krimer Outline Setup: efficient architecture + proportional margining Proportional margining w/ timing speculation Timing speculation with SIMD Problem and DPSP solution Methodology and modeling Evaluation

Voltage/timing margins “waste” energy SIMD Lane Decoupling (C) M. Erez, E. Krimer Voltage/timing margins “waste” energy Illustrative only – not to scale Typical logic delay Maximum logic delay Process variation guard-band Noise guard-band Wearout guard-band Temperature … Today time (1 cycle)

Voltage/timing margins “waste” energy SIMD Lane Decoupling (C) M. Erez, E. Krimer Voltage/timing margins “waste” energy Illustrative only – not to scale Typical logic delay Maximum logic delay Process variation guard-band Noise guard-band Wearout guard-band Temperature … Today time (1 cycle) Typical logic delay Maximum logic delay Process variation guard-band Noise guard-band Wearout guard-band Temperature … Future

Timing speculation to the rescue [Ernst04] SIMD Lane Decoupling (C) M. Erez, E. Krimer Timing speculation to the rescue [Ernst04] Razor latches Speculate low delay Detect violations Early/late mismatch Recover by stalling Requires fast “global” signal Alternative – flush Requires extra ~10% logic Path delay restrictions: Δ < t < Δ+cycle

SIMD Lane Decoupling (C) M. Erez, E. Krimer Outline Setup: SIMD architecture + proportional margining Proportional margining w/ timing speculation Timing speculation with SIMD Problem and DPSP solution Methodology and modeling Evaluation

SIMD leads to inefficient timing speculation SIMD Lane Decoupling (C) M. Erez, E. Krimer

SIMD leads to inefficient timing speculation SIMD Lane Decoupling (C) M. Erez, E. Krimer

Decoupled Parallel SIMD Pipeline (DPSP) SIMD Lane Decoupling (C) M. Erez, E. Krimer Decoupled Parallel SIMD Pipeline (DPSP) Shallow FIFO for control (or between stages)

Decoupled Parallel SIMD Pipeline (DPSP) SIMD Lane Decoupling (C) M. Erez, E. Krimer Decoupled Parallel SIMD Pipeline (DPSP) Decoupling mitigates SIMD impact

DPSP challenge 1: inter-lane communication SIMD Lane Decoupling (C) M. Erez, E. Krimer DPSP challenge 1: inter-lane communication Decoupling may delay producer (store) Micro barriers Enforce SIMD semantics Not a problem in practice with GPUs Execution model requires explicit sync across CTAs / work-groups

DPSP challenge 2: memory access locality SIMD Lane Decoupling (C) M. Erez, E. Krimer DPSP challenge 2: memory access locality Loads and stores no longer aligned Memory “divergence” May increase pressure on on-chip memory access May impact off-chip access Old NVIDIA hardware had memory coalescing issues No Problem with coalescing buffers and caches Micro-barriers if problematic Can be done implicitly or explicitly in hardware Sync before every load Prediction

SIMD Lane Decoupling (C) M. Erez, E. Krimer Outline Setup: efficient architecture + proportional margining Proportional margining w/ timing speculation Timing speculation with SIMD Problem and DPSP solution Methodology and modeling Evaluation

SIMD Lane Decoupling (C) M. Erez, E. Krimer Evaluation flow Error Measurements Error Probability Model Energy-Efficiency Model Design Space Exploration Arch Sim. Validation

SIMD Lane Decoupling (C) M. Erez, E. Krimer Measuring error rate Pawlowski ISSCC’12 Inherently circuit and implementation dependent Used 3 exemplary circuits SPICE-simulated adder [Ernst04] FPGA-modeled multiplier [Ernst04] Multiplier fabricated in our IBM 45nm SOI test chip [Pawlowski12]

Modeling the error rate function SIMD Lane Decoupling (C) M. Erez, E. Krimer Modeling the error rate function 2-parameter model Adder [Ernst04] Mul. [Ernst04]

ET2 energy-efficiency metric SIMD Lane Decoupling (C) M. Erez, E. Krimer ET2 energy-efficiency metric Energy x (execution)Time2 In circuit context: time=delay -> ED2 Isolates architecture efficiency Independent of DVFS Shows improvements in addition to DVFS

SIMD Lane Decoupling (C) M. Erez, E. Krimer Simple ET2 model Throughput (1/T): Relative energy: Dynamic Static

GP-GPU simulation adds some realism SIMD Lane Decoupling (C) M. Erez, E. Krimer GP-GPU simulation adds some realism Baseline uses ideal margins without specuation Only max delay vs. typical delay left on table Timing speculation overhead is 0 – 15% ET2 GPGPUSim (version 2.1) Cycle-based extendable GP-GPU simulator from UBC Developer-recommended parameters Extended to DPSP Recovery through stall Micro-barrier options Explicit CTA/workgroup synchronization only (no mbarriers) Implicit sync before every memory operation Power model based on Hong & Kim, ISCA’10

SIMD Lane Decoupling (C) M. Erez, E. Krimer Outline Setup: efficient architecture + proportional margining Proportional margining w/ timing speculation Timing speculation with SIMD Problem and DPSP solution Methodology and modeling Evaluation Design-space exploration Architecture effects

SIMD Lane Decoupling (C) M. Erez, E. Krimer Adder [Ernst04] Mul. [Ernst04] ET2 vs. SIMD (no spec.) DPSP DPSP *- Relative ET2 - lower elevation is better

SIMD Lane Decoupling (C) M. Erez, E. Krimer Adder [Ernst04] Mul. [Ernst04] DPSP vs. SIMD (w/ spec.) SIMD – DPSP *- ET2 Difference - higher elevation is better

Bringing in architecture effects SIMD Lane Decoupling (C) M. Erez, E. Krimer Bringing in architecture effects Fabricated MUL Adder

SIMD Lane Decoupling (C) M. Erez, E. Krimer Summary Design margins  inefficiency Naive timing speculation with SIMD is inefficient DPSP enables efficient speculation in SIMD Microbarriers maintain semantics when necessary With GPU, frequent mbarriers help memory access Simple models can capture error response Error rate exponential with Vdd Dependent on circuit and implementation Design-space exploration shows potential When and why timing speculation should (not) be used DPSP consistently improves ET2 (10 – 45%) DPSP achieves 10 – 20% better ET2 than SIMD w/ spec.

SIMD Lane Decoupling (C) M. Erez, E. Krimer backup

Detailed ET2 vs. Vdd behavior SIMD Lane Decoupling (C) M. Erez, E. Krimer Detailed ET2 vs. Vdd behavior NN AES BFS MUM

Frequent micro-barriers improve ET2 SIMD Lane Decoupling (C) M. Erez, E. Krimer Frequent micro-barriers improve ET2 Adder Multiplier Fab.

Modeling the error rate function SIMD Lane Decoupling (C) M. Erez, E. Krimer Modeling the error rate function Adder [Ernst04] Mul. [Ernst04]

Proportional margining SIMD Lane Decoupling (C) M. Erez, E. Krimer Proportional margining Static margin control Binning Vdd/frequency/biasing adjustment Dynamic margin control Vdd/frequency/biasing for slowly varying effects Temperature and aging Clocking tricks From GALS to dynamic and elastic clocking Typical logic delay Maximum logic delay Process variation guard-band Other Noise guard-band Wearout guard-band Clock Skew and jitter time

Detailed results summary SIMD Lane Decoupling (C) M. Erez, E. Krimer Detailed results summary BFS High divergence rate Requires implicit synchronizations Limits DPSP opportunities CP,DG,RAY Sensitive to memory coalescing Synchronization between memory operations solves it MUM Low SIMD occupancy limits the benefit of decoupling WP Not enough registers, lots of memory spills. Extremely sensitive to memory latency and the exact scheduling – disturbed by DPSP