SIMD Lane Decoupling Improved Timing-Error Resilience

SIMD Lane Decoupling Improved Timing-Error Resilience
Evgeni Krimer (UT Austin) Patrick Chiang (Oregon State) Mattan Erez (UT Austin)

All systems power/energy bound
SIMD Lane Decoupling (C) M. Erez, E. Krimer All systems power/energy bound The good: Transistor still following Moore’s Law The bad: Transistor power efficiency improving too slowly Larger fraction of power to non-compute resources The conclusion: Better algorithms More efficient architectures Proportionality: waste less of what you have This paper: SIMD + timing speculation Efficient architecture + proportional guardbands

SIMD Lane Decoupling (C) M. Erez, E. Krimer
Outline Setup: efficient architecture + proportional margining Proportional margining w/ timing speculation Timing speculation with SIMD Problem and DPSP solution Methodology and modeling Evaluation

Voltage/timing margins “waste” energy
SIMD Lane Decoupling (C) M. Erez, E. Krimer Voltage/timing margins “waste” energy Illustrative only – not to scale Typical logic delay Maximum logic delay Process variation guard-band Noise guard-band Wearout guard-band Temperature … Today time (1 cycle)

Voltage/timing margins “waste” energy
SIMD Lane Decoupling (C) M. Erez, E. Krimer Voltage/timing margins “waste” energy Illustrative only – not to scale Typical logic delay Maximum logic delay Process variation guard-band Noise guard-band Wearout guard-band Temperature … Today time (1 cycle) Typical logic delay Maximum logic delay Process variation guard-band Noise guard-band Wearout guard-band Temperature … Future

Timing speculation to the rescue [Ernst04]
SIMD Lane Decoupling (C) M. Erez, E. Krimer Timing speculation to the rescue [Ernst04] Razor latches Speculate low delay Detect violations Early/late mismatch Recover by stalling Requires fast “global” signal Alternative – flush Requires extra ~10% logic Path delay restrictions: Δ < t < Δ+cycle

Outline Setup: SIMD architecture + proportional margining Proportional margining w/ timing speculation Timing speculation with SIMD Problem and DPSP solution Methodology and modeling Evaluation

SIMD leads to inefficient timing speculation
SIMD Lane Decoupling (C) M. Erez, E. Krimer

Decoupled Parallel SIMD Pipeline (DPSP)
SIMD Lane Decoupling (C) M. Erez, E. Krimer Decoupled Parallel SIMD Pipeline (DPSP) Shallow FIFO for control (or between stages)

Decoupled Parallel SIMD Pipeline (DPSP)
SIMD Lane Decoupling (C) M. Erez, E. Krimer Decoupled Parallel SIMD Pipeline (DPSP) Decoupling mitigates SIMD impact

DPSP challenge 1: inter-lane communication
SIMD Lane Decoupling (C) M. Erez, E. Krimer DPSP challenge 1: inter-lane communication Decoupling may delay producer (store) Micro barriers Enforce SIMD semantics Not a problem in practice with GPUs Execution model requires explicit sync across CTAs / work-groups

DPSP challenge 2: memory access locality
SIMD Lane Decoupling (C) M. Erez, E. Krimer DPSP challenge 2: memory access locality Loads and stores no longer aligned Memory “divergence” May increase pressure on on-chip memory access May impact off-chip access Old NVIDIA hardware had memory coalescing issues No Problem with coalescing buffers and caches Micro-barriers if problematic Can be done implicitly or explicitly in hardware Sync before every load Prediction

Outline Setup: efficient architecture + proportional margining Proportional margining w/ timing speculation Timing speculation with SIMD Problem and DPSP solution Methodology and modeling Evaluation

Evaluation flow Error Measurements Error Probability Model Energy-Efficiency Model Design Space Exploration Arch Sim. Validation

Measuring error rate Pawlowski ISSCC’12 Inherently circuit and implementation dependent Used 3 exemplary circuits SPICE-simulated adder [Ernst04] FPGA-modeled multiplier [Ernst04] Multiplier fabricated in our IBM 45nm SOI test chip [Pawlowski12]

Modeling the error rate function
SIMD Lane Decoupling (C) M. Erez, E. Krimer Modeling the error rate function 2-parameter model Adder [Ernst04] Mul. [Ernst04]

ET2 energy-efficiency metric
SIMD Lane Decoupling (C) M. Erez, E. Krimer ET2 energy-efficiency metric Energy x (execution)Time2 In circuit context: time=delay -> ED2 Isolates architecture efficiency Independent of DVFS Shows improvements in addition to DVFS

Simple ET2 model Throughput (1/T): Relative energy: Dynamic Static

GP-GPU simulation adds some realism
SIMD Lane Decoupling (C) M. Erez, E. Krimer GP-GPU simulation adds some realism Baseline uses ideal margins without specuation Only max delay vs. typical delay left on table Timing speculation overhead is 0 – 15% ET2 GPGPUSim (version 2.1) Cycle-based extendable GP-GPU simulator from UBC Developer-recommended parameters Extended to DPSP Recovery through stall Micro-barrier options Explicit CTA/workgroup synchronization only (no mbarriers) Implicit sync before every memory operation Power model based on Hong & Kim, ISCA’10

Outline Setup: efficient architecture + proportional margining Proportional margining w/ timing speculation Timing speculation with SIMD Problem and DPSP solution Methodology and modeling Evaluation Design-space exploration Architecture effects

Adder [Ernst04] Mul. [Ernst04] ET2 vs. SIMD (no spec.) DPSP DPSP *- Relative ET2 - lower elevation is better

Adder [Ernst04] Mul. [Ernst04] DPSP vs. SIMD (w/ spec.) SIMD – DPSP *- ET2 Difference - higher elevation is better

Bringing in architecture effects
SIMD Lane Decoupling (C) M. Erez, E. Krimer Bringing in architecture effects Fabricated MUL Adder

Summary Design margins  inefficiency Naive timing speculation with SIMD is inefficient DPSP enables efficient speculation in SIMD Microbarriers maintain semantics when necessary With GPU, frequent mbarriers help memory access Simple models can capture error response Error rate exponential with Vdd Dependent on circuit and implementation Design-space exploration shows potential When and why timing speculation should (not) be used DPSP consistently improves ET2 (10 – 45%) DPSP achieves 10 – 20% better ET2 than SIMD w/ spec.

backup

Detailed ET2 vs. Vdd behavior
SIMD Lane Decoupling (C) M. Erez, E. Krimer Detailed ET2 vs. Vdd behavior NN AES BFS MUM

Frequent micro-barriers improve ET2
SIMD Lane Decoupling (C) M. Erez, E. Krimer Frequent micro-barriers improve ET2 Adder Multiplier Fab.

Modeling the error rate function
SIMD Lane Decoupling (C) M. Erez, E. Krimer Modeling the error rate function Adder [Ernst04] Mul. [Ernst04]

Proportional margining
SIMD Lane Decoupling (C) M. Erez, E. Krimer Proportional margining Static margin control Binning Vdd/frequency/biasing adjustment Dynamic margin control Vdd/frequency/biasing for slowly varying effects Temperature and aging Clocking tricks From GALS to dynamic and elastic clocking Typical logic delay Maximum logic delay Process variation guard-band Other Noise guard-band Wearout guard-band Clock Skew and jitter time

Detailed results summary
SIMD Lane Decoupling (C) M. Erez, E. Krimer Detailed results summary BFS High divergence rate Requires implicit synchronizations Limits DPSP opportunities CP,DG,RAY Sensitive to memory coalescing Synchronization between memory operations solves it MUM Low SIMD occupancy limits the benefit of decoupling WP Not enough registers, lots of memory spills. Extremely sensitive to memory latency and the exact scheduling – disturbed by DPSP

SIMD Lane Decoupling Improved Timing-Error Resilience

Similar presentations

Presentation on theme: "SIMD Lane Decoupling Improved Timing-Error Resilience"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SIMD Lane Decoupling Improved Timing-Error Resilience

Similar presentations

Presentation on theme: "SIMD Lane Decoupling Improved Timing-Error Resilience"— Presentation transcript:

Similar presentations

About project

Feedback