Analyzing Behavior Specialized Acceleration

Slides:

Advertisements

Similar presentations

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

Exploring the Potential of Heterogeneous Von Neumann / Dataflow Execution Models Tony Nowatzki Vinay Gangadhar Karthikeyan Sankaralingam University of.

Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.

Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Amir Roth and Gurindar S. Sohi University of Wisconsin-Madison

Ph.D. in Computer Science

A Closer Look at Instruction Set Architectures

CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains

Design-Space Exploration

Multiscalar Processors

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Morgan Kaufmann Publishers

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam

Exploring Value Prediction with the EVES predictor

Pipelining and Vector Processing

Milad Hashemi, Onur Mutlu, Yale N. Patt

Tony Nowatzki Vinay Gangadhar Karthikeyan Sankaralingam

Hardware Multithreading

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Compiler Back End Panel

Compiler Back End Panel

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Tony Nowatzki∗ Newsha Ardalani† Karthikeyan Sankaralingam‡ Jian Weng∗

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

ARM ORGANISATION.

Patrick Akl and Andreas Moshovos AENAO Research Group

8 – Simultaneous Multithreading

rePLay: A Hardware Framework for Dynamic Optimization

CSE 502: Computer Architecture

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Spring’19 Prof. Eric Rotenberg

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Analyzing Behavior Specialized Acceleration Tony Nowatzki Karthikeyan Sankaralingam My presentation today is about a new paradigm for building efficient general purpose cores, and a novel modeling technique to explore this paradigm.

Architectural Specialization High-importance Kernels/Domains “General Purpose” Codes Deep Neural Phone App. Video Enc. Web App Speech Rec. Server Image Filter Compiler Dynamic Execution Dynamic Execution This all comes down to performing architectural specialization. The way this normally works is that you find high-importance kernels/domains, and building specific hardware for execution of each workload. This gets very high performance/low-power by focusing the hardware resources on exactly that work that needs to be performed. General purpose codes – don’t fit well into this paradigm If we look at the phases of execution inside these codes, they exhibit broad behaviors that are common across applications and domains. For example (data parallelism, control criticality, or memory regularity) Suggests that maybe instead of doing How to specialize? Acc. 1 Acc. 2 Acc. 3 Acc. 4 Observation: Broad program behaviors span application domains Too many applications Too much code in each one Code changes too quickly Domain Specific Acceleration

Behavior Specialization Behaviors exploitable by specialized architectures Exploitable behaviors cover majority of applications Phone App. Compiler BSA 2 General Core BSA 3 Web Server Game AI BSA 1 BSA 4 Analytics Web App Productivity Graph Proc. For behavior specialization to work, two things have to be true. Behaviors should be exploitable by (BSA: Behavior Specialized Accelerator) ExoCore: Modular Behavior Specialized Core Dynamic Execution

Behavior Specialization in Practice Domain-Specific Accelerators Shared Cache / NoC ExoCore Dynamic Program Execution High-level Program Binary Compiler BSA 2 General Core BSA 3 BSA 1 BSA 4 (BSA: Behavior Specialized Accelerator) ExoCore: Modular Behavior Specialized Core ExoCore-Enabled Multicore

Behavior Specialization Challenges Methodological How to study these in an effective way? Months/years of modeling effort to days/weeks Design/Analysis How to compose and organize accelerators? If we want to explore the inclusion of 10-20 different possible accelerators, we need to build simulators and compilers for them, which is incredibly time consuming. New modeling methodology that can reduce the months/years of compiler+simulator development to days or weeks. An accelerator composition that significantly pushes speedup and energy efficiency without user involvement 1.4-2.0× Speedup, 1.5-1.7× Energy Efficiency without programmer involvement

What is a programmable accelerator? Definition: Hardware which performs work, instead of the general purpose processor, to specialize elements of its execution. General Purpose Processor Accelerator Model all elements in the processor. (events and dependencies) Model specialization of execution. (remove/transform events)

Background: μArch Dependence Graphs [Fields, ISCA 2001] Represent execution trace as a series of event nodes. Edges between events describe μArch Dependencies. Inorder Fetch/Dispatch/Commit F F F Fetch F F Fetch-Dispatch Latency D D D Dispatch D D Dispatch Before Execute E E E Execute E E Execution Latency P P P Complete P P Complete Before Commit C C C Commit C C Inst. 4 Inst. 1 Inst. 2 Inst. 3

Background: μArch Dependence Graphs Data Dependence Branch Misprediction Resource Dependence F D E P C bgz r1 r2 = ld[r3] r4 = r2 * 2 r5 = r6 * r7 6 1 Critical Path  Execution Time F F F F D D D D E E E 1 E 6 P P P P C C C C bgz r1 r2 = ld[r3] r4 = r2 * 2 r5 = r6 * r7

Background: μArch Dependence Graphs ∑ μArch Events  Energy F F I$ read F F D D D RF read D IW read E E E ALU E ALU ALU Rob Write P P P P ROB Read C C C ROB read C bgz r1 r2 = ld[r3] r4 = r2 * 2 r5 = r6 * r7

What is a programmable accelerator? Definition: Hardware which performs work, instead of the general purpose processor, to specialize elements its execution. General Purpose Processor Accelerator Model all elements in the processor. (events and dependencies) Model specialization of execution. (remove/transform events)

Big Idea: Transformable Dependence Graph (TDG) Accelerator+Core Dependence Graph D P 1 2 3 4 Graph Transform 1 2 3 Core’s Dep. Graph Program IR Model three things: Accelerator’s compiler Accelerator’s microarchitecture Interaction b/t accelerator and general purpose core Transformable Dependence Graph (TDG)

Example Accelerator: BERET [Micro 2011] (Bundled Execution of Recurring Traces) Original Program BERET Program BERET Hardware Comp. FU 0 ld (r0), r3 add r3, r0, r3 Comp. FU 1 assert r3 > r2 sub r2 r3 r2 assert r2 < r4 ld + << BB1 ld (r0), r3 add r3, r0, r3 bge r3 r2 BB2 sub r2 r3 r2 bge r2 r4 BB3 add r2 r3 r2 T F 2% 98% ~0% Configuration Compound FU 0 Control Compound FU 1 Simple accel – BERET Goal -- executes program phases with hot-loop traces with very power efficient hardware. Cfus connected to register file To map a program onto beret, take hot loop trace, assign instructions to the CFU, Compound FU 2 Reg. File assert + × Compound FU 3 Hot Loop Trace

Control Mispeculation Original TDG F D E P C F F F F F F F F F F F F F F F F D D D D D D D D D D D D D D D D ld + br - br ld + br - br ld + br + br P P P P P P P P P P P P P P P P C C C C C C C C C C C C C C C C Iter 1 Iter 2 Iter 3 General Core BB1 ld (r0), r3 add r3, r0, r3 bge r3 r2 BB2 sub r2 r3 r2 bge r2 r4 BB3 add r2 r3 r2 T F 2% 98% Control Mispeculation Program IR

BERET TDG Transformation P C F F F F F F F F F F F F F F F F D D D D D D D D D D D D D D D D ld + br - br ld + br - br ld + br + br P P P P P P P P P P P P P P P P C C C C C C C C C C C C C C C C Iter 1 Iter 2 Iter 3 General Core BERET + General Core Iter 3 (Replay) Iter 1 Iter 2 Iter 3 F F F F F F CFU2 CFU2 CFU2 D CFU1 CFU1 CFU1 D D D D D br br br ld + ld + ld + ld + br + br - br - br - br P P P P P P Serialization Edges C C C C C C Accelerator Entry

Accelerator+Core Dependence Graph TDG Framework gem5 Simulator D P Dep. Graph Program IR D P Dep. Graph TDG Analyzer & Transformer (McPAT/CACTI) TDG Constructor (Accelerator Analysis + Graph Transformation) TDG: Transformable Dependence Graph Accelerator+Core Dependence Graph So that was an example, and our TDG modeling framework looks like this. We use gem5 to produce the original TDG, and we implement our own TDG analysis and transformation library. This is easier than a compiler+simulator approach, because adding an accelerator is merely a matter of adding an accelerator analysis pass and graph transformation pass. Accelerator+Core Cycle Time & Power Framework coming to a Docker container soon!

Does TDG-modeling work? (compared to the Compiler+Simulator Approach) Accelerator Average Error in Speedup Average Error in Energy Reduction Conserv. Cores 5% 10% BERET 8% 7% SIMD 12% DySER 15% (compared to published numbers)

Behavior Specialization Challenges Methodological How to study these in an effective way? Transformable Dependence Graph (TDG) Design/Analysis Transformable dependence graph modeling methodology for behavior specialized accelerators, now I want to talk about how we can create an effective core organization. How to compose and organize accelerators?

Choosing Synergistic Accelerators General Purpose Code Data Parallel Non-Data Parallel Low Control Some Control High Control High ILP Low ILP The intuitive way to choose accelerators is to classify program phases into different behaviors, and create an accelerator for each phase type. So in this slide I’ll describe one possible program phase classification, that we use in this work. … SIMD Separable Datapath Non-critical Control Repetitive Control Unpredictable Control DP-CGRA NS-DF Trace Proc.

Compute slice to CGRA offloading Target properties: Data-Parallel, Low-control Separable Datapath Irregular Code, Non-critical Control Repetitive Control Related Publication: DySER, HPCA 2011 SEED, ISCA 2015 BERET, MICRO 2011 (modified for dataflow) Essential Mechanism: Short vector instructions Compute slice to CGRA offloading Config. dataflow, no control speculation Speculates trace path, re-execution Error Checking Sparse MM Histogram Graph Processing Stencil Dynamic Prog. Mat. Mult. Image Process. Example Workloads: High level view of the architecture (general purpose processor and 4 accelerators integrated behind a private cache. That allows us to share data and switch between accelerators quickly) Without getting into the architecture, I want to briefly describe these. Most of us know what SIMD is, it targets data parallel workloads with low control Data-parallel CGRA can efficiently execute data-parallel regions with control flow, and performance by offloading large compute datapaths onto a coarse grain reconfigurable architecture Non-speculative dataflow processor is good for irregular code phases where the control decisions are not on the critical path. Trace-processor is a dataflow-extension of the beret processor I showed earlier (it’s good at executing codes with very repetitive control decisions).

Evaluation Methodology Workloads Regular: Intel Microbenchmarks, Parboil Semi-Regular: TPCH Queries, Mediabench, SpecFP Irregular: SpecINT General Core Configurations Little (IO2) Medium (OOO2) Big (OOO4) Very Big (OOO6) Issue Width 2 4 6 ROB Size - 64 168 192 Instr. Window 32 48 52 Dcache Ports 1 3 FUs (ALU, Mul/Div,FP) 2,1,1 2,1,1, 3,2,2 4,2,3 We study this architecture with a wide variety of workloads and general purpose cores.

Evaluation Questions Does ExoCore have high potential speedup or energy efficiency improvements? Do program behaviors span multiple domains (and how much coverage)? What new design-space opportunities does does an ExoCore approach enable?

ExoCore pushes frontier by 1.4-2.0× Speedup, Overall Results ExoCore >2× Energy Benefit ExoCore pushes frontier by 1.4-2.0× Speedup, 1.5-1.7× Energy Eff.

Per-Benchmark Execution Time Execution Time Relative to OOO2 OOO2 + 4-BSA vs OOO2 Exec. Time Parboil Mediabench TPCH SPECFP SpecINT Intel µbench 2. Exploited program behaviors span across domains (~80% Coverage of program execution time)

× × Design Space Cores Accelerators Metrics Speedup Energy Area Tiny Inorder (IO2) Small OOO (OOO2) SIMD (S) DP-CGRA (D) Trace-P (T) NS-DF (N) Speedup Energy Area × × Medium OOO (OOO4) Large OOO (OOO6) Finally, the ExoCore organization opens up a huge design space in combining different combinations of accelerators and cores together. We have explored this design space across different metrics. (4 Cores) (16 Combinations)

3. Many new design tradeoffs enabled Design Space Results Performance Energy-Efficiency Large OOO SIMD DP-CGRA Trace-P NS-DF Tiny IO SIMD DP-CGRA NS-DF Energy-Delay Energy-Delay-Area Without boring you with data, I’ll just summarize the results here. Small/ Medium OOO SIMD DP-CGRA Trace-P NS-DF Small OOO DP-CGRA NS-DF 3. Many new design tradeoffs enabled

Conclusions ExoCore Architecture Organization Common behaviors spanning domains and program execution Significant performance/energy advantages + new design space Transformable Dependence Graph (TDG) Common framework for accelerator modeling and comparisons Higher abstraction while remaining detailed and accurate BSA 2 General Core BSA 3 BSA 1 BSA 4 ExoCore: Modular Behavior Specialized Core D P 1 2 3 4 ExoCore architecture organization is a promising way forward to improve performance/energy efficiency of general purpose processors without relying on technology scaling. Transformable dependence graph is a modeling framework which greatly simplifies early design-space exploration of such designs. 1 2 3 Core’s Dep. Graph Program IR Transformable Dependence Graph (TDG)