Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer.

Similar presentations


Presentation on theme: "A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer."— Presentation transcript:

1 A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer Architecture Universitat Politècnica de Catalunya Barcelona, SPAIN E-mail: {jmcodina,fran,antonio}@ac.upc.es UPC UNIVERSITAT POLITÈCNICA DE CATALUNYA

2 Why Clustered Architectures? UPC I NTRODUCTION  Semiconductor technology is continuously improving  New technologies pack more logic in a single chip  Exploit more ILP  More functional units, registers, etc.  Faster clock cycles  However, new problems may arise  Delay of signals or data movement from one part to another  Power consumption  Solution: exploit communication locality  Divide the system into several “units”  They can work almost independently and at very high frequency  Some communication channels are used to exchange signals/data  CLUSTERING

3 Current Trends in Clustered Architectures UPC I NTRODUCTION  Partition the register file & functional units  For embedded/DSP processors: VLIW design  C6000 DSP of Texas Instruments  TigerSharc of Analog Devices  Lx of HP/ST, etc.  Code generation  Cluster assingment  Instruction Scheduling  Register Allocation  For loops: modulo scheduling

4 Previous work on modulo scheduling UPC I NTRODUCTION  Several works for non-clustered VLIW architectures  Iterative MS, Slack MS, Swing MS, IRIS MS, etc…  Some works for clustered VLIW architectures  E. Nystrom and A. E. Eichenberger [MICRO ´98]  M. M. Fernandes et al. [HPCA ´99]  J. Sánchez and A. González [ICPP ´00]  J. Sánchez and A. González [MICRO ´00]  All of them are non-register constraint Shared memory Distributed memory

5 How to deal with register constraints? UPC I NTRODUCTION  Add spill code and/or increment II  Eisembeis et al. [MICRO´94]  Ruttenberg et al. [PLDI ´96]  Zalamea et al. [PLDI´00]  In these previous works:  Non-clustered  Spill after scheduling  List Scheduling  K.Kailas, K.Ebcioglu and A.Agrawala [HPCA´01]  In this work:  Clustered  Spill during scheduling  Modulo Scheduling

6 Talk Outline UPC  Clustered VLIW Architecture  Our previous work  URACAM  Basic Ideas  Algorithm  Example  Evaluation  Conclusions

7 Talk Outline UPC  Clustered VLIW Architecture  Our previous work  URACAM  Basic Ideas  Algorithm  Example  Evaluation  Conclusions

8 Architecture Overview UPC C LUSTERED V LIW A RCHITECTURE L1 CACHE LOCAL REGISTER FILE FU LOCAL REGISTER FILE FU... BUS/ES

9 FU LOCAL REGISTER FILE IVR L1 CACHE BUS Bus Output Bus Input Detailed Cluster UPC C LUSTERED V LIW A RCHITECTURE

10 Talk Outline UPC  Clustered VLIW Architecture  Our previous work  URACAM  Basic Ideas  Algorithm  Example  Evaluation  Conclusions

11 Our previous work UPC  Features of the basic scheduling algorithm (SA+GO - ICPP ’00)  Unified assign-and-schedule approach  Cluster assignment heuristics to reduce the number of communications  Loop Unrolling to reduce the number of communications  Main drawbacks  It does not deal with Spill Code  Unroll could increase code size

12 Talk Outline UPC  Clustered VLIW Architecture  Our previous work  URACAM  Basic Ideas  Algorithm  Example  Evaluation  Conclusions

13 Basic Ideas UPC URACAM  Main factors in Modulo Scheduling for Clustered Architectures  Communications  Register requirements  Memory pressure  A good scheme has to take into account all of them

14 Algorithm Overview UPC URACAM Sort DDG nodes START Next node Try to schedule in cluster 0 Best State + II Try to schedule in cluster N Compute MII No Feasible State Ne w Stat e Try to Improve Ne w Stat e Ne w Stat e Try to Improve Ne w Stat e

15 Algorithm Overview UPC URACAM Sort DDG nodes START Next node Try to schedule in cluster 0 Best State + II Try to schedule in cluster N Compute MII No Feasible State Ne w Stat e Try to Improve Ne w Stat e Ne w Stat e Try to Improve Ne w Stat e Compute MII  Like a monolithic architecture Recurrences Resources

16 Algorithm Overview UPC URACAM Sort DDG nodes START Next node Try to schedule in cluster 0 Best State + II Try to schedule in cluster N Compute MII No Feasible State Ne w Stat e Try to Improve Ne w Stat e Ne w Stat e Try to Improve Ne w Stat e Sort DDG nodes  According to SMS (Llosa et al., PACT´96)  Priority to nodes in recurrences  Avoids predecessors and successors before a node

17 Algorithm Overview UPC URACAM Sort DDG nodes START Next node Try to schedule in cluster 0 Best State + II Try to schedule in cluster N Compute MII No Feasible State Ne w Stat e Try to Improve Ne w Stat e Ne w Stat e Try to Improve Ne w Stat e START + Next node  All nodes are handled following computed order

18 Algorithm Overview UPC URACAM Sort DDG nodes START Next node Best State + II Compute MII No Feasible State Try to Improve Ne w Stat e Try to schedule in cluster 0 Try to schedule in cluster N New Stat e New Stat e Try to Improve Ne w Stat e Try to schedule in all clusters  Generation of a possible partial schedule (new state). Schedule the operation as close as possible to scheduled ones Resource constraints Communications are scheduled

19 Algorithm Overview UPC URACAM Sort DDG nodes START Next node Try to schedule in cluster 0 Best State + II Try to schedule in cluster N Compute MII No Feasible State Ne w Stat e Try to Improve New Stat e Ne w Stat e Try to Improve New Stat e Trying to improve  Adding spill code to reduce register requirements  Spill code to reduce communications  memory-based communications  Communications to reduce memory pressure  Undoing Spill Code to reduce memory pressure

20 Algorithm Overview UPC URACAM Sort DDG nodes START Next node Try to schedule in cluster 0 Best State + II Try to schedule in cluster N Compute MII No Feasible State Ne w Stat e Try to Improve Ne w Stat e Ne w Stat e Try to Improve Ne w Stat e Best State  Non valid candidates are discarded  If no feasible state  increase II  Best candidate from the valid ones choosed  Figure of Merit

21 Figure of Merit UPC URACAM  Used to choose the best alternative in every partial schedule  A unique criteria to evaluate a schedule  Measuring the utilization of the most critical resources  Underlying concepts:  Scare resources are more valuable than abundant ones  Maximize the available resources of the most used ones  Set of percentages % Com Mem %... % % Regs 01NN+1N+N 2N+1 Percentages N = num_clusters

22 Using Figure of Merit UPC URACAM  Comparing two new states  Compute the percentage of remaining resources usage  Compare from the highest to the lowest percentages  Figure of Merit in transformations gives  Best candidate  Benefit of the transformation

23 An Example UPC URACAM A B C 2 D 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 Bus Mem Clust 1 Regs Clust 1 4416 Mem Clust 2 4 Regs Clust 2 16 Nodes DBACDBAC Cluster 1Cluster 2 4 Free resources Used resources

24 An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4416 Mem Clust 2 4 Regs Clust 2 16 Nodes D 0% 0% B A C Cluster 1Cluster 2 4 Free resources Used resources 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1

25 An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4416 Mem Clust 2 4 Regs Clust 2 16 Nodes D 0% 0% B 6,25% A C Cluster 1Cluster 2 4 6,25% Free resources Used resources 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1

26 An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4416 Mem Clust 2 4 Regs Clust 2 16 Nodes D 0% 0% B 25% - 6,25% A C Cluster 1Cluster 2 4 Communicatio n 25% 6,25% Free resources Used resources 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1

27 An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4415 Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% -6,25% A C Cluster 1Cluster 2 4 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16

28 An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4415 Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% -6,25% A 20% C Cluster 1Cluster 2 4 20% 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16

29 An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4415 Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% - 6,25% A 50% -13,33% C Cluster 1Cluster 2 4 13,33% 50% Communications 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16

30 An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4412 Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% - 6,25% A 20% 50% -13,33% C Cluster 1Cluster 2 4 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16

31 An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4412 Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% - 6,25% A 20% 50% -13,33% C 83,33% Cluster 1Cluster 2 4 83,33% 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16

32 An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4412 Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% - 6,25% A 20% 50% -13,33% C 83,33% 50% - 8,33% Cluster 1Cluster 2 4 8,33% 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16 Spill Code 50%

33 An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4412 Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% - 6,25% A 20% 50% -13,33% C 25%-25%-... Cluster 1Cluster 2 4 Communicatio n Through mem. Communicatio n 25% 6,25% 25% 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16

34 44 An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 3312 Mem Clust 2 3 Regs Clust 2 15 Free resources Used resources Nodes D 0% 0% B 6,25% 25% - 6,25% A 20% 50% -13,33% C 83,33% 25%-25%-... 50% - 8,33% Cluster 1Cluster 2 4 St Ld Com 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 Cluster 1 Cluster 2 16 4

35 Memory operations UPC URACAM  Additional memory operations  Spill Code  Communications through memory  Maybe operations from the original DDG cannot be scheduled  Solution:  Differentiate memory pressure in the figure of merit  Global  Original memory operations  Local 2N+2 Percentages N = num_clusters % Com Local Mem %... % % Regs 01N+1N+22N+1 % Global Mem N

36 Talk Outline UPC  Clustered VLIW Architecture  Our previous work  URACAM  Basic Ideas  Algorithm  Example  Evaluation  Conclusions

37 Evaluation UPC URACAM  Evaluated using SPECfp95  Using graphs generated by the ICTINEO compiler

38 Configuration UPC P ERFORMANCE E VALUATION LatenciesINTFP MEM 22 ARITH /ABS 13 MUL 26 DIV/SQR/TRG 618 ResourcesUnified2-cluster4-cluster INT/cluster 421 FP/cluster 421 MEM/cluster 421 REGS/cluster 64/3232/1616/8 2-cluster4-cluster Comm Buses 1/4 Bus Latency 1 1/4 1

39 IPC - 64 registers UPC P ERFORMANCE E VALUATION

40 IPC - 64 registers UPC P ERFORMANCE E VALUATION

41 IPC - 32 registers UPC P ERFORMANCE E VALUATION

42 IPC - 32 registers UPC P ERFORMANCE E VALUATION

43 URACAM Performance – 1 bus UPC P ERFORMANCE E VALUATION 64 Registers

44 URACAM Performance – 4 bus UPC P ERFORMANCE E VALUATION 64 Registers

45 Talk Outline UPC  Clustered VLIW Architecture  Our previous work  URACAM  Basic Ideas  Algorithm  Evaluation  Conclusions

46 Conclusions UPC  URACAM handles at the same time communications, memory pressure and registers  Search for the best overall solution  Figure of Merit: a unique criterion to compare partial schedules  Transformations to improve partial schedules  Spill Code to reduce register pressure  Communications through memory to reduce bus pressure  Communications through bus to reduce memory pressure  Undo Spill Code to reduce memory pressure  Spill Code for Clustered VLIW Architecture  Done during the scheduling

47 Conclusions UPC  URACAM achieves better schedules than previous work on Modulo Scheduling for a Clustered VLIW Architecture  Speed up of 18% for 2 clusters and 22% for 4 clusters [ For 1 inter-register bus with 1-cycle latency and 32 registers ]  Degradation with respect non-clustered architecture  3% for 2 clusters and 10% for 4 clusters [ For 4 inter-register bus with 1-cycle latency and 32 registers ]  URACAM is an adaptive and powerful technique  Figure of Merit  Transformations

48 A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer Architecture Universitat Politècnica de Catalunya Barcelona, SPAIN E-mail: {jmcodina,fran,antonio}@ac.upc.es UPC UNIVERSITAT POLITÈCNICA DE CATALUNYA


Download ppt "A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer."

Similar presentations


Ads by Google