Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Complex Operators for Irregular Codes

Similar presentations


Presentation on theme: "Efficient Complex Operators for Irregular Codes"— Presentation transcript:

1 Efficient Complex Operators for Irregular Codes
Jack Sampson, Ganesh Venkatesh, Nathan Goulding-Hotta, Saturnino Garcia, Steven Swanson, Michael Bedford Taylor Department of Computer Science and Engineering University of California, San Diego Today’s concerns about power will lead us to consider putting into hardware codes beyond those traditionally targeted by accelerators and to consider converting software into hardware for reasons of energy and power on equal terms with conversions motivated by speedup. Today, I’m going to discuss two new techniques, selective depipelining (describe) and cachelets (describe), which can improve both the performance and energy efficiency of specialized hardware targeting irregular codes. Heterogeneous platforms featuring such specialized hardware are likely to become increasingly common, because we, as computer architects, have run into the [next slide] Utilization Wall

2 The Utilization Wall With each successive process generation, the percentage of a chip that can actively switch drops exponentially due to power constraints. [Venkatesh, Chakraborty]

3 The Utilization Wall Scaling theory Observed impact Classical scaling
Transistor and power budgets no longer balanced Exponentially increasing problem! Observed impact Experimental results Flat frequency curve Increasing cache/processor ratio “Turbo Boost” Classical scaling Device count S2 Device frequency S Device power (cap) 1/S Device power (Vdd) 1/S2 Utilization 1 Leakage limited scaling Device count S2 Device power (cap) 1/S Device power (Vdd) ~1 Utilization 1/S2 [Venkatesh, Chakraborty]

4 The Utilization Wall Scaling theory 2x Observed impact 2x 2x
Transistor and power budgets no longer balanced Exponentially increasing problem! Observed impact Experimental results Flat frequency curve Increasing cache/processor ratio “Turbo Boost” 2x 2x 2x [Venkatesh, Chakraborty]

5 The Utilization Wall Scaling theory Observed impact 3x 2x
Transistor and power budgets no longer balanced Exponentially increasing problem! Observed impact Experimental results Flat frequency curve Increasing cache/processor ratio “Turbo Boost” 3x 2x [Venkatesh, Chakraborty]

6 Dealing with the Utilization Wall
Insights: Power is now more expensive than area Specialized logic has been shown as an effective way to improve energy efficiency ( x) Our Approach: Use area for specialized cores to save energy on common apps Can apply power savings to other programs, increasing throughput Specialized coprocessors provide an architectural way to trade area for an effective increase in power budget Challenge: coprocessors for all types of applications

7 Specializing Irregular Codes
Effectiveness of specialization dependent on coverage Need to cover many types of code Both regular and irregular What is irregular code? Lacks easily exploited structure / parallelism Found broadly across desktop workloads How can we make it efficient? Reduce per-op overheads with complex operators Improve latency for serial portions

8 Candidates for Irregular Codes
Microprocessors Handle all codes Poor scaling of performance vs. energy Utilization wall aggravates scaling problems Accelerators Require parallelizable, highly structured code Memory system challenging to integrate with conventional memory Target performance over energy Conservation Cores (C-Cores) [Venkatesh, et al. ASPLOS 2010] Handle arbitrary code Share L1 cache with host processor Target energy over performance Explicit statement ACCEL bad at IrCode

9 Conservation Cores (C-Cores)
Hot code Automatically generated from hot regions of program source Hot code implemented by C-Core, cold code runs on host CPU Profiler selects regions C-to-Verilog compiler converts source regions to C-Cores Drop-in replacements for code No algorithmic changes required Software compatible in absence of available C-Core Toolchain handles HW generation/SW integration D cache C-Core Host CPU (general purpose) I cache [Venkatesh, et al. ASPLOS 2010] Cold code

10 This Paper: Two Techniques for Efficient Irregular Code Coprocessors
Selective De-Pipelining (SDP) Form long combinational paths for non-pipeline parallel codes Run logic at slow frequency while improving throughput! Challenge: handling memory operations Cachelets L1 access is a large fraction of critical path for irregular codes Can we make a cache hit only 0.5 cycles? Specialize individual loads and stores Apply both to the C-Core platform Up to 2.5x speedup vs an efficient in-order processor Up to 22x EDP improvement General applicability

11 Outline Efficiency through specialization
Baseline C-Core Microarchitecture Selective De-Pipelining Cachelets Conclusion

12 Constructing a C-Core C-Cores start with source code Code supported
Parallelism agnostic Function call interface Code supported Arbitrary memory access patterns Data structures Complex control flow No parallelizing compiler required Example code for (i=0; i<N; i++) { x = A[i]; y = B[i]; C[x] = D[y] + x + y + x*y; }

13 Constructing a C-Core C-Cores start with source code Code supported
Parallelism agnostic Function call interface Code supported Arbitrary memory access patterns Data structures Complex control flow No parallelizing compiler required BB1 BB0 BB2 CFG

14 Constructing a C-Core C-Cores start with source code Code supported +
BB1 C-Cores start with source code Parallelism agnostic Function call interface Code supported Arbitrary memory access patterns Data structures Complex control flow No parallelizing compiler required + +1 + LD LD + <N? + LD * + + + ST DFG

15 Constructing a C-Core (cont.)
+ * LD +1 <N? ST Schedule memory operations on L1 Add pipeline registers to match host processor frequency

16 Observation Pipeline registers just for timing
+ * LD +1 <N? ST  Pipeline registers just for timing No actual overlap in execution between pipeline stages

17 Outline Efficiency through specialization
Baseline C-Core Microarchitecture Selective De-Pipelining Cachelets Conclusion

18 Meeting the Needs of Datapath and Memory
Easy to replicate operators in space Energy-efficient when operators feed directly to operators Memory Interface is inherently centralized Performance-efficient when the interface can be rapidly multiplexed Can we serve both at once?

19 Constructing Efficient Complex Operators
BB1 BB0 BB2 CFG Direct mapping from CFG, DFG Produces large, complex operators (one per CFG node) + * LD +1 <N? ST + * LD +1 <N? ST Complex Operator

20 Selective De-Pipelining (SDP)
+ * LD +1 <N? ST Memory mux clock SDP addresses the needs of datapath and memory Fast, pipelined memory Slow, aperiodic datapath clock

21 Selective De-Pipelining (SDP)
+ * LD +1 <N? ST Memory mux clock Intra-basic-block registers on fast clock for memory Registers between basic blocks clocked on slow clock

22 Selective De-Pipelining (SDP)
+ * LD +1 <N? ST Memory mux clock Lead with energy savings Constructs large, efficient operators Combinational paths spanning entire basic block In-order memory pipelining, handles dependence

23 SDP Benefits Reduced clock power Reduced area
Improved inter-operator optimization Easier to meet timing

24 SDP Results (EDP improvement)
SDP creates more energy-efficient coprocessors

25 SDP Results (speedup) New design faster than C-Cores, host processor
SDP most effective for apps with larger BBs

26 Outline Efficiency through specialization
Baseline C-Core Microarchitecture Selective De-Pipelining Cachelets Conclusion

27 Motivation for Cachelets
Relative to a processor ALU operations ~3x faster Many more ALU operations executing in parallel L1 cache latency has not improved L1 cache latency more critical for C-Cores L1 access is 9x longer than an ALU op! Can we make L1 accesses faster?

28 Cache Access Latency Limiting factor for performance
50% of scheduling latency for last op on critical path Closer caches could reduce latency But must be very small

29 Cachelets Integrate into datapath for low-latency access Coherent
Several 1-4 line fully-associative arrays Built with latches Each services subset of loads/stores Coherent MEI states only (no shared lines) Checkout/shootdown via L1 offloads coherence complexity BIGGER FIGURE

30 Cachelet Insertion Policies
Each memory operation mapped to cachelet or L1 Profile-based assignment Two policies: Private and Shared Fewer than 16 lines per C-Core, on average Private: One operation per cachelet Average of 8.4 cachelets per C-Ccore Area overhead of 13.4% Shared: Several operations per cachelet 6.2 cachelets per C-Ccore, Average sharing factor of 10.3 Area overhead of 16.8%

31 Cachelet Impact on Critical Path
Refer to motivation graphs (leftmost and rightmost bars are the same) Provide majority of utility of full sized L1 at cachelet latency Improve EDP – reduction in latency worth the energy

32 Cachelet Speedup over SDP
Benefits of cachelets depend on application Best when there are several disjoint memory access streams Usually deployed for spatial rather than temporal locality

33 C-Cores with SDP and Cachelets vs. Host Processor
Average speedup of 1.61x over in-order host processor

34 C-Cores with SDP and Cachelets vs. Host Processor
10.3x EDP improvement over in-order host processor

35 Conclusion Achieving high coverage with specialization requires handling both irregular and regular codes Selective De-Pipelining addresses the divergent needs of memory and datapath Cachelets reduce cache access time by a factor of 6 for subset of memory operations Using SDP and cachelets, we provide both 10.3x EDP 1.6x performance improvements for irregular code

36

37 Backup Slides

38 Application Level, with both SDP and Cachelets
57% EDP reduction over in-order host processor

39 Application Level, with both SDP and Cachelets
Average application speedup of 1.33x over host processor

40 SDP Results (Application EDP improvement)
SDP creates more energy-efficient coprocessors

41 SDP Results (Application Speedup)
New design faster than C-Cores, host processor SDP most effective for apps with larger BBs

42 Cachelet Speedup over SDP (Application Level)
Benefits of cachelets depend on application Best when there are several disjoint memory access streams Usually deployed for spatial rather than temporal locality


Download ppt "Efficient Complex Operators for Irregular Codes"

Similar presentations


Ads by Google