Decoupled Access-Execute Pioneering Compilation for Energy Efficiency

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Lecture 4: CPU Performance

CSCI 4717/5717 Computer Architecture

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Computer Abstractions and Technology

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Instruction-Level Parallelism (ILP)

PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.

Towards Power Efficiency on Task-Based, Decoupled Access-Execute Models Konstantinos Koukos David Black-Schaffer Vasileios Spiliopoulos Stefanos Kaxiras.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.

Computer Performance Computer Engineering Department.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.

CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

CS 352H: Computer Systems Architecture

Andreas Hoffmann Andreas Ropers Tim Kogel Stefan Pees Prof

Andrea Acquaviva, Luca Benini, Bruno Riccò

Microarchitecture.

How do we evaluate computer architectures?

Sujata Ray Dey Maheshtala College Computer Science Department

SECTIONS 1-7 By Astha Chawla

Milad Hashemi, Onur Mutlu, and Yale N. Patt

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Automatic Detection of Extended Data-Race-Free Regions

CDA 3101 Spring 2016 Introduction to Computer Organization

Lecture 5: GPU Compute Architecture

CS2100 Computer Organisation

CMSC 611: Advanced Computer Architecture

Instruction Scheduling for Instruction-Level Parallelism

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Computer Architecture Lecture 4 17th May, 2006

Hardware Multithreading

Compiler Back End Panel

Address-Value Delta (AVD) Prediction

Yingmin Li Ting Yan Qi Zhao

Serial versus Pipelined Execution

Compiler Back End Panel

Advanced Computer Architecture

Instruction Level Parallelism (ILP)

Sujata Ray Dey Maheshtala College Computer Science Department

Jinquan Dai, Long Li, Bo Huang Intel China Software Center

Latency Tolerance: what to do when it just won’t go away

The Vector-Thread Architecture

Hardware Multithreading

CSC3050 – Computer Architecture

How to improve (decrease) CPI

Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.

CS2100 Computer Organisation

Presentation transcript:

Decoupled Access-Execute Pioneering Compilation for Energy Efficiency Present self + team Previous decades: Time is money  Performance Current decade: Power is money  Energy This work improves energy efficiency without damaging performance. Alexandra Jimborean Kim-Ahn Tran, Konstantinos Koukos, Per Ekemark, Vasileios Spiliopoulos, Magnus Själander, Trevor Carlson, Stefanos Kaxiras

Efficiency vs Performance Compiler techniques to deliver high performance at low energy costs! OoO Limited OoO Performance Hardware InO Energy efficiency

Roadmap Software Decoupled Access Execute (DAE) for energy efficiency Goal: reduce energy Large power-hungry OoO cores Clairvoyance Goal: increase performance, memory- & instruction-level-parallelism Limited OoO cores SWOOP Goal: Make IO cores run like OoO cores In-Order (IO) cores CPU Abilities Roadmap Compiler Sophistication

(Best presentation award) Roadmap Software Decoupled Access Execute (DAE) for energy efficiency Goal: reduce energy Large power-hungry OoO cores Clairvoyance Goal: increase performance, memory- & instruction-level-parallelism Limited OoO cores SWOOP Goal: Make IO cores run like OoO cores In-Order (IO) cores CGO’14 (Best presentation award) CPU Abilities CC’16 (Best paper award) Roadmap Compiler Sophistication HiPEAC’16

Traditional HW techniques DVFS: Dynamic Voltage Frequency Scaling Generate large memory-bound and compute bound phases and scale frequency (DVFS) Optimize the software to match the hardware DVFS capabilities DAE - OoO Automatically tune the code at compile time!

Coupled Execution DAE - OoO Memory bound Compute bound Optimal f - Coupled Execution Max f - Coupled Execution fopt fmax Energy waste Performance loss Memory bound Compute bound DAE - OoO

Decoupled Execution DAE - OoO Access (prefetch) Execute (compute) Ideal DVFS (fmin - fmax) - Coupled Execution fmax fmin fmax fmax fmax fmin fmax fmin fmax fmax Decouple the compute (execute) and memory (access) Access: prefetches the data to the L1 cache Execute: using the data from the L1 cache Access at low frequency and Execute at high frequency DAE - OoO Decoupled Execution (DAE) Access (prefetch) Execute (compute) fmin fmax

How do we implement DAE? DAE - OoO Access phase: prefetches the data Remove arithmetic computation Keep (replicate) address calculation Turn loads into prefetches Remove stores to globals (no side effects) Execute phase: original (unmodified) task Scheduled to run on same core right after Access DVFS Access and Execute independently fLow Save energy DAE - OoO fHigh Compute fast

Modeling zero-latency, per-core DVFS Understanding DAE Modeling zero-latency, per-core DVFS Evaluation Energy(Joule) Time(sec) Coupled Decoupled Coupled Decoupled fmin  fmax Access: fmin Execute: fmin  fmax

Understanding DAE Evaluation Slightly Faster and 25% Lower Energy Performance is “unaffected” Energy is 25% reduced Slightly Faster and 25% Lower Energy Evaluation Energy(Joule) Time(sec) Coupled Decoupled Coupled Decoupled

5% Performance improvements! SPEC CPU 2006 / Parboil Time EXE 22% EDP improvement Time ACC 5% Performance improvements! EDP DAE - OoO Energy EXE Energy ACC ON REAL HW! Intel Sandybridge & Haswell Lower is better!

https://github.com/etascale/daedal Take-away message Goal: Save energy, keep performance Tool: DVFS Problem: Fine granularity of memory- and compute-bound phases  impossible to adjust f at this rate Solution: Decoupled Access Execute Contribution: Automatic DAE (compiler) Results: 22%-25% EDP improvement compared to original Open source https://github.com/etascale/daedal DAE - OoO

(Artifact evaluation) Roadmap Software Decoupled Access Execute (DAE) for energy efficiency Goal: reduce energy Large power-hungry OoO cores Clairvoyance Goal: increase performance, memory- & instruction-level-parallelism Limited OoO cores SWOOP Goal: Make IO cores run like OoO cores In-Order (IO) cores CPU Abilities CGO’17 (Artifact evaluation) Roadmap Compiler Sophistication

} Clairvoyance Instruction Window Limits MLP Program instructions that the processor can “see” … instruction window Clairvoyance

Target: processors with a limited instruction window Instruction Window Limits perf. Program Program … instruction reordering overlap long-latency loads … long-latency loads Target: processors with a limited instruction window Clairvoyance exposes memory-level parallelism (MLP)

Clairvoyance: Look-ahead instruction scheduling Global instruction scheduling Advanced software pipelining Optimizations Clairvoyance

Clairvoyance transformations This is NOT ABOUT DVFS anymore Goal: Hide cache misses, memory level parallelism, instruction level parallelism Improve the performance of more energy-efficient but limited OoO processors Clairvoyance

Break Dependent Chains t1 = load A t2 = load t1 t3 = load B t4 = load t3 t5 = t2+t4 ACCx Clairvoyance EXE

Break Dependent Chains ACCx_1 t1 = load A t3 = load B t1 = load A t2 = load t1 t3 = load B t4 = load t3 t5 = t2+t4 ACCx_2 t2 = load t1 t4 = load t3 Clairvoyance EXE t5 = t2+t4 Multiple access phases maximize the distance between loads and their use.

Clairvoyance: mem intensive Clairvoyance: ~15% performance Clairvoyance ON REAL ARM HW! ARMv8-64 APM X-GENE Lower is better!

Clairvoyance: all the rest performance not affected Clairvoyance ON REAL ARM HW! ARMv8-64 APM X-GENE Lower is better!

Take away message Clairvoyance Limited OoO  lower tolerance to overhead Aggressive optimizations Fine grain memory-bound phases Increase memory level parallelism Fine grain compute-bound phases Increase instruction level parallelism Hide long latency misses in software: 15% performance improvements https://github.com/ktran/clairvoyance Clairvoyance

Roadmap Software Decoupled Access Execute (DAE) for energy efficiency Goal: reduce energy Large power-hungry OoO cores Clairvoyance Goal: increase performance, memory- & instruction-level-parallelism Limited OoO cores SWOOP Goal: Make IO cores run like OoO cores In-Order (IO) cores CPU Abilities Roadmap DAE Sophistication CAL 2017

SWOOP SWOOP Add just a bit of HW to eliminate overheads Upon a cache miss run more Access phases, “out-of-order” (controlled by hardware) Virtual register contexts for more registers Reduce register pressure SWOOP

SWOOP vs. IO vs. OOO Cutting-Edge opt Caution: These are speedups A7-like A15-like A7-SWOOP Cutting-Edge opt Caution: These are speedups HIGHER is better! Good trade-off between InO (efficient) and OoO (performance)

Conclusions SWOOP In-order cores: Keep energy efficiency Increase performance In-order cores  no tolerance to instruction count overhead Compiler optimizations and minimum hardware support SWOOP

Efficiency vs Performance Compiler techniques to deliver high performance at low energy costs! DAE OoO Clairvoyance Limited OoO Performance Hardware SWOOP InO Energy efficiency

Decoupled Access-Execute Pioneering Compilation for Energy Efficiency Present self + team Previous decades: Time is money  Performance Current decade: Power is money  Energy This work improves energy efficiency without damaging performance. Alexandra Jimborean Alexandra.Jimborean @ it.uu.se