Analyzing Behavior Specialized Acceleration Tony Nowatzki Karthikeyan Sankaralingam My presentation today is about a new paradigm for building efficient general purpose cores, and a novel modeling technique to explore this paradigm.
Architectural Specialization High-importance Kernels/Domains “General Purpose” Codes Deep Neural Phone App. Video Enc. Web App Speech Rec. Server Image Filter Compiler Dynamic Execution Dynamic Execution This all comes down to performing architectural specialization. The way this normally works is that you find high-importance kernels/domains, and building specific hardware for execution of each workload. This gets very high performance/low-power by focusing the hardware resources on exactly that work that needs to be performed. General purpose codes – don’t fit well into this paradigm If we look at the phases of execution inside these codes, they exhibit broad behaviors that are common across applications and domains. For example (data parallelism, control criticality, or memory regularity) Suggests that maybe instead of doing How to specialize? Acc. 1 Acc. 2 Acc. 3 Acc. 4 Observation: Broad program behaviors span application domains Too many applications Too much code in each one Code changes too quickly Domain Specific Acceleration
Behavior Specialization Behaviors exploitable by specialized architectures Exploitable behaviors cover majority of applications Phone App. Compiler BSA 2 General Core BSA 3 Web Server Game AI BSA 1 BSA 4 Analytics Web App Productivity Graph Proc. For behavior specialization to work, two things have to be true. Behaviors should be exploitable by (BSA: Behavior Specialized Accelerator) ExoCore: Modular Behavior Specialized Core Dynamic Execution
Behavior Specialization in Practice Domain-Specific Accelerators Shared Cache / NoC ExoCore Dynamic Program Execution High-level Program Binary Compiler BSA 2 General Core BSA 3 BSA 1 BSA 4 (BSA: Behavior Specialized Accelerator) ExoCore: Modular Behavior Specialized Core ExoCore-Enabled Multicore
Behavior Specialization Challenges Methodological How to study these in an effective way? Months/years of modeling effort to days/weeks Design/Analysis How to compose and organize accelerators? If we want to explore the inclusion of 10-20 different possible accelerators, we need to build simulators and compilers for them, which is incredibly time consuming. New modeling methodology that can reduce the months/years of compiler+simulator development to days or weeks. An accelerator composition that significantly pushes speedup and energy efficiency without user involvement 1.4-2.0× Speedup, 1.5-1.7× Energy Efficiency without programmer involvement
What is a programmable accelerator? Definition: Hardware which performs work, instead of the general purpose processor, to specialize elements of its execution. General Purpose Processor Accelerator Model all elements in the processor. (events and dependencies) Model specialization of execution. (remove/transform events)
Background: μArch Dependence Graphs [Fields, ISCA 2001] Represent execution trace as a series of event nodes. Edges between events describe μArch Dependencies. Inorder Fetch/Dispatch/Commit F F F Fetch F F Fetch-Dispatch Latency D D D Dispatch D D Dispatch Before Execute E E E Execute E E Execution Latency P P P Complete P P Complete Before Commit C C C Commit C C Inst. 4 Inst. 1 Inst. 2 Inst. 3
Background: μArch Dependence Graphs Data Dependence Branch Misprediction Resource Dependence F D E P C bgz r1 r2 = ld[r3] r4 = r2 * 2 r5 = r6 * r7 6 1 Critical Path Execution Time F F F F D D D D E E E 1 E 6 P P P P C C C C bgz r1 r2 = ld[r3] r4 = r2 * 2 r5 = r6 * r7
Background: μArch Dependence Graphs ∑ μArch Events Energy F F I$ read F F D D D RF read D IW read E E E ALU E ALU ALU Rob Write P P P P ROB Read C C C ROB read C bgz r1 r2 = ld[r3] r4 = r2 * 2 r5 = r6 * r7
What is a programmable accelerator? Definition: Hardware which performs work, instead of the general purpose processor, to specialize elements its execution. General Purpose Processor Accelerator Model all elements in the processor. (events and dependencies) Model specialization of execution. (remove/transform events)
Big Idea: Transformable Dependence Graph (TDG) Accelerator+Core Dependence Graph D P 1 2 3 4 Graph Transform 1 2 3 Core’s Dep. Graph Program IR Model three things: Accelerator’s compiler Accelerator’s microarchitecture Interaction b/t accelerator and general purpose core Transformable Dependence Graph (TDG)
Example Accelerator: BERET [Micro 2011] (Bundled Execution of Recurring Traces) Original Program BERET Program BERET Hardware Comp. FU 0 ld (r0), r3 add r3, r0, r3 Comp. FU 1 assert r3 > r2 sub r2 r3 r2 assert r2 < r4 ld + << BB1 ld (r0), r3 add r3, r0, r3 bge r3 r2 BB2 sub r2 r3 r2 bge r2 r4 BB3 add r2 r3 r2 T F 2% 98% ~0% Configuration Compound FU 0 Control Compound FU 1 Simple accel – BERET Goal -- executes program phases with hot-loop traces with very power efficient hardware. Cfus connected to register file To map a program onto beret, take hot loop trace, assign instructions to the CFU, Compound FU 2 Reg. File assert + × Compound FU 3 Hot Loop Trace
Control Mispeculation Original TDG F D E P C F F F F F F F F F F F F F F F F D D D D D D D D D D D D D D D D ld + br - br ld + br - br ld + br + br P P P P P P P P P P P P P P P P C C C C C C C C C C C C C C C C Iter 1 Iter 2 Iter 3 General Core BB1 ld (r0), r3 add r3, r0, r3 bge r3 r2 BB2 sub r2 r3 r2 bge r2 r4 BB3 add r2 r3 r2 T F 2% 98% Control Mispeculation Program IR
BERET TDG Transformation P C F F F F F F F F F F F F F F F F D D D D D D D D D D D D D D D D ld + br - br ld + br - br ld + br + br P P P P P P P P P P P P P P P P C C C C C C C C C C C C C C C C Iter 1 Iter 2 Iter 3 General Core BERET + General Core Iter 3 (Replay) Iter 1 Iter 2 Iter 3 F F F F F F CFU2 CFU2 CFU2 D CFU1 CFU1 CFU1 D D D D D br br br ld + ld + ld + ld + br + br - br - br - br P P P P P P Serialization Edges C C C C C C Accelerator Entry
Accelerator+Core Dependence Graph TDG Framework gem5 Simulator D P Dep. Graph Program IR D P Dep. Graph TDG Analyzer & Transformer (McPAT/CACTI) TDG Constructor (Accelerator Analysis + Graph Transformation) TDG: Transformable Dependence Graph Accelerator+Core Dependence Graph So that was an example, and our TDG modeling framework looks like this. We use gem5 to produce the original TDG, and we implement our own TDG analysis and transformation library. This is easier than a compiler+simulator approach, because adding an accelerator is merely a matter of adding an accelerator analysis pass and graph transformation pass. Accelerator+Core Cycle Time & Power Framework coming to a Docker container soon!
Does TDG-modeling work? (compared to the Compiler+Simulator Approach) Accelerator Average Error in Speedup Average Error in Energy Reduction Conserv. Cores 5% 10% BERET 8% 7% SIMD 12% DySER 15% (compared to published numbers)
Behavior Specialization Challenges Methodological How to study these in an effective way? Transformable Dependence Graph (TDG) Design/Analysis Transformable dependence graph modeling methodology for behavior specialized accelerators, now I want to talk about how we can create an effective core organization. How to compose and organize accelerators?
Choosing Synergistic Accelerators General Purpose Code Data Parallel Non-Data Parallel Low Control Some Control High Control High ILP Low ILP The intuitive way to choose accelerators is to classify program phases into different behaviors, and create an accelerator for each phase type. So in this slide I’ll describe one possible program phase classification, that we use in this work. … SIMD Separable Datapath Non-critical Control Repetitive Control Unpredictable Control DP-CGRA NS-DF Trace Proc.
Compute slice to CGRA offloading Target properties: Data-Parallel, Low-control Separable Datapath Irregular Code, Non-critical Control Repetitive Control Related Publication: DySER, HPCA 2011 SEED, ISCA 2015 BERET, MICRO 2011 (modified for dataflow) Essential Mechanism: Short vector instructions Compute slice to CGRA offloading Config. dataflow, no control speculation Speculates trace path, re-execution Error Checking Sparse MM Histogram Graph Processing Stencil Dynamic Prog. Mat. Mult. Image Process. Example Workloads: High level view of the architecture (general purpose processor and 4 accelerators integrated behind a private cache. That allows us to share data and switch between accelerators quickly) Without getting into the architecture, I want to briefly describe these. Most of us know what SIMD is, it targets data parallel workloads with low control Data-parallel CGRA can efficiently execute data-parallel regions with control flow, and performance by offloading large compute datapaths onto a coarse grain reconfigurable architecture Non-speculative dataflow processor is good for irregular code phases where the control decisions are not on the critical path. Trace-processor is a dataflow-extension of the beret processor I showed earlier (it’s good at executing codes with very repetitive control decisions).
Evaluation Methodology Workloads Regular: Intel Microbenchmarks, Parboil Semi-Regular: TPCH Queries, Mediabench, SpecFP Irregular: SpecINT General Core Configurations Little (IO2) Medium (OOO2) Big (OOO4) Very Big (OOO6) Issue Width 2 4 6 ROB Size - 64 168 192 Instr. Window 32 48 52 Dcache Ports 1 3 FUs (ALU, Mul/Div,FP) 2,1,1 2,1,1, 3,2,2 4,2,3 We study this architecture with a wide variety of workloads and general purpose cores.
Evaluation Questions Does ExoCore have high potential speedup or energy efficiency improvements? Do program behaviors span multiple domains (and how much coverage)? What new design-space opportunities does does an ExoCore approach enable?
ExoCore pushes frontier by 1.4-2.0× Speedup, Overall Results ExoCore >2× Energy Benefit ExoCore pushes frontier by 1.4-2.0× Speedup, 1.5-1.7× Energy Eff.
Per-Benchmark Execution Time Execution Time Relative to OOO2 OOO2 + 4-BSA vs OOO2 Exec. Time Parboil Mediabench TPCH SPECFP SpecINT Intel µbench 2. Exploited program behaviors span across domains (~80% Coverage of program execution time)
× × Design Space Cores Accelerators Metrics Speedup Energy Area Tiny Inorder (IO2) Small OOO (OOO2) SIMD (S) DP-CGRA (D) Trace-P (T) NS-DF (N) Speedup Energy Area × × Medium OOO (OOO4) Large OOO (OOO6) Finally, the ExoCore organization opens up a huge design space in combining different combinations of accelerators and cores together. We have explored this design space across different metrics. (4 Cores) (16 Combinations)
3. Many new design tradeoffs enabled Design Space Results Performance Energy-Efficiency Large OOO SIMD DP-CGRA Trace-P NS-DF Tiny IO SIMD DP-CGRA NS-DF Energy-Delay Energy-Delay-Area Without boring you with data, I’ll just summarize the results here. Small/ Medium OOO SIMD DP-CGRA Trace-P NS-DF Small OOO DP-CGRA NS-DF 3. Many new design tradeoffs enabled
Conclusions ExoCore Architecture Organization Common behaviors spanning domains and program execution Significant performance/energy advantages + new design space Transformable Dependence Graph (TDG) Common framework for accelerator modeling and comparisons Higher abstraction while remaining detailed and accurate BSA 2 General Core BSA 3 BSA 1 BSA 4 ExoCore: Modular Behavior Specialized Core D P 1 2 3 4 ExoCore architecture organization is a promising way forward to improve performance/energy efficiency of general purpose processors without relying on technology scaling. Transformable dependence graph is a modeling framework which greatly simplifies early design-space exploration of such designs. 1 2 3 Core’s Dep. Graph Program IR Transformable Dependence Graph (TDG)