Download presentation
Presentation is loading. Please wait.
Published byGodfrey Pierce Modified over 8 years ago
1
A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer Architecture Universitat Politècnica de Catalunya Barcelona, SPAIN E-mail: {jmcodina,fran,antonio}@ac.upc.es UPC UNIVERSITAT POLITÈCNICA DE CATALUNYA
2
Why Clustered Architectures? UPC I NTRODUCTION Semiconductor technology is continuously improving New technologies pack more logic in a single chip Exploit more ILP More functional units, registers, etc. Faster clock cycles However, new problems may arise Delay of signals or data movement from one part to another Power consumption Solution: exploit communication locality Divide the system into several “units” They can work almost independently and at very high frequency Some communication channels are used to exchange signals/data CLUSTERING
3
Current Trends in Clustered Architectures UPC I NTRODUCTION Partition the register file & functional units For embedded/DSP processors: VLIW design C6000 DSP of Texas Instruments TigerSharc of Analog Devices Lx of HP/ST, etc. Code generation Cluster assingment Instruction Scheduling Register Allocation For loops: modulo scheduling
4
Previous work on modulo scheduling UPC I NTRODUCTION Several works for non-clustered VLIW architectures Iterative MS, Slack MS, Swing MS, IRIS MS, etc… Some works for clustered VLIW architectures E. Nystrom and A. E. Eichenberger [MICRO ´98] M. M. Fernandes et al. [HPCA ´99] J. Sánchez and A. González [ICPP ´00] J. Sánchez and A. González [MICRO ´00] All of them are non-register constraint Shared memory Distributed memory
5
How to deal with register constraints? UPC I NTRODUCTION Add spill code and/or increment II Eisembeis et al. [MICRO´94] Ruttenberg et al. [PLDI ´96] Zalamea et al. [PLDI´00] In these previous works: Non-clustered Spill after scheduling List Scheduling K.Kailas, K.Ebcioglu and A.Agrawala [HPCA´01] In this work: Clustered Spill during scheduling Modulo Scheduling
6
Talk Outline UPC Clustered VLIW Architecture Our previous work URACAM Basic Ideas Algorithm Example Evaluation Conclusions
7
Talk Outline UPC Clustered VLIW Architecture Our previous work URACAM Basic Ideas Algorithm Example Evaluation Conclusions
8
Architecture Overview UPC C LUSTERED V LIW A RCHITECTURE L1 CACHE LOCAL REGISTER FILE FU LOCAL REGISTER FILE FU... BUS/ES
9
FU LOCAL REGISTER FILE IVR L1 CACHE BUS Bus Output Bus Input Detailed Cluster UPC C LUSTERED V LIW A RCHITECTURE
10
Talk Outline UPC Clustered VLIW Architecture Our previous work URACAM Basic Ideas Algorithm Example Evaluation Conclusions
11
Our previous work UPC Features of the basic scheduling algorithm (SA+GO - ICPP ’00) Unified assign-and-schedule approach Cluster assignment heuristics to reduce the number of communications Loop Unrolling to reduce the number of communications Main drawbacks It does not deal with Spill Code Unroll could increase code size
12
Talk Outline UPC Clustered VLIW Architecture Our previous work URACAM Basic Ideas Algorithm Example Evaluation Conclusions
13
Basic Ideas UPC URACAM Main factors in Modulo Scheduling for Clustered Architectures Communications Register requirements Memory pressure A good scheme has to take into account all of them
14
Algorithm Overview UPC URACAM Sort DDG nodes START Next node Try to schedule in cluster 0 Best State + II Try to schedule in cluster N Compute MII No Feasible State Ne w Stat e Try to Improve Ne w Stat e Ne w Stat e Try to Improve Ne w Stat e
15
Algorithm Overview UPC URACAM Sort DDG nodes START Next node Try to schedule in cluster 0 Best State + II Try to schedule in cluster N Compute MII No Feasible State Ne w Stat e Try to Improve Ne w Stat e Ne w Stat e Try to Improve Ne w Stat e Compute MII Like a monolithic architecture Recurrences Resources
16
Algorithm Overview UPC URACAM Sort DDG nodes START Next node Try to schedule in cluster 0 Best State + II Try to schedule in cluster N Compute MII No Feasible State Ne w Stat e Try to Improve Ne w Stat e Ne w Stat e Try to Improve Ne w Stat e Sort DDG nodes According to SMS (Llosa et al., PACT´96) Priority to nodes in recurrences Avoids predecessors and successors before a node
17
Algorithm Overview UPC URACAM Sort DDG nodes START Next node Try to schedule in cluster 0 Best State + II Try to schedule in cluster N Compute MII No Feasible State Ne w Stat e Try to Improve Ne w Stat e Ne w Stat e Try to Improve Ne w Stat e START + Next node All nodes are handled following computed order
18
Algorithm Overview UPC URACAM Sort DDG nodes START Next node Best State + II Compute MII No Feasible State Try to Improve Ne w Stat e Try to schedule in cluster 0 Try to schedule in cluster N New Stat e New Stat e Try to Improve Ne w Stat e Try to schedule in all clusters Generation of a possible partial schedule (new state). Schedule the operation as close as possible to scheduled ones Resource constraints Communications are scheduled
19
Algorithm Overview UPC URACAM Sort DDG nodes START Next node Try to schedule in cluster 0 Best State + II Try to schedule in cluster N Compute MII No Feasible State Ne w Stat e Try to Improve New Stat e Ne w Stat e Try to Improve New Stat e Trying to improve Adding spill code to reduce register requirements Spill code to reduce communications memory-based communications Communications to reduce memory pressure Undoing Spill Code to reduce memory pressure
20
Algorithm Overview UPC URACAM Sort DDG nodes START Next node Try to schedule in cluster 0 Best State + II Try to schedule in cluster N Compute MII No Feasible State Ne w Stat e Try to Improve Ne w Stat e Ne w Stat e Try to Improve Ne w Stat e Best State Non valid candidates are discarded If no feasible state increase II Best candidate from the valid ones choosed Figure of Merit
21
Figure of Merit UPC URACAM Used to choose the best alternative in every partial schedule A unique criteria to evaluate a schedule Measuring the utilization of the most critical resources Underlying concepts: Scare resources are more valuable than abundant ones Maximize the available resources of the most used ones Set of percentages % Com Mem %... % % Regs 01NN+1N+N 2N+1 Percentages N = num_clusters
22
Using Figure of Merit UPC URACAM Comparing two new states Compute the percentage of remaining resources usage Compare from the highest to the lowest percentages Figure of Merit in transformations gives Best candidate Benefit of the transformation
23
An Example UPC URACAM A B C 2 D 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 Bus Mem Clust 1 Regs Clust 1 4416 Mem Clust 2 4 Regs Clust 2 16 Nodes DBACDBAC Cluster 1Cluster 2 4 Free resources Used resources
24
An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4416 Mem Clust 2 4 Regs Clust 2 16 Nodes D 0% 0% B A C Cluster 1Cluster 2 4 Free resources Used resources 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1
25
An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4416 Mem Clust 2 4 Regs Clust 2 16 Nodes D 0% 0% B 6,25% A C Cluster 1Cluster 2 4 6,25% Free resources Used resources 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1
26
An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4416 Mem Clust 2 4 Regs Clust 2 16 Nodes D 0% 0% B 25% - 6,25% A C Cluster 1Cluster 2 4 Communicatio n 25% 6,25% Free resources Used resources 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1
27
An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4415 Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% -6,25% A C Cluster 1Cluster 2 4 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16
28
An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4415 Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% -6,25% A 20% C Cluster 1Cluster 2 4 20% 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16
29
An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4415 Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% - 6,25% A 50% -13,33% C Cluster 1Cluster 2 4 13,33% 50% Communications 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16
30
An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4412 Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% - 6,25% A 20% 50% -13,33% C Cluster 1Cluster 2 4 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16
31
An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4412 Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% - 6,25% A 20% 50% -13,33% C 83,33% Cluster 1Cluster 2 4 83,33% 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16
32
An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4412 Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% - 6,25% A 20% 50% -13,33% C 83,33% 50% - 8,33% Cluster 1Cluster 2 4 8,33% 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16 Spill Code 50%
33
An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 4412 Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% - 6,25% A 20% 50% -13,33% C 25%-25%-... Cluster 1Cluster 2 4 Communicatio n Through mem. Communicatio n 25% 6,25% 25% 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16
34
44 An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust 1 3312 Mem Clust 2 3 Regs Clust 2 15 Free resources Used resources Nodes D 0% 0% B 6,25% 25% - 6,25% A 20% 50% -13,33% C 83,33% 25%-25%-... 50% - 8,33% Cluster 1Cluster 2 4 St Ld Com 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 Cluster 1 Cluster 2 16 4
35
Memory operations UPC URACAM Additional memory operations Spill Code Communications through memory Maybe operations from the original DDG cannot be scheduled Solution: Differentiate memory pressure in the figure of merit Global Original memory operations Local 2N+2 Percentages N = num_clusters % Com Local Mem %... % % Regs 01N+1N+22N+1 % Global Mem N
36
Talk Outline UPC Clustered VLIW Architecture Our previous work URACAM Basic Ideas Algorithm Example Evaluation Conclusions
37
Evaluation UPC URACAM Evaluated using SPECfp95 Using graphs generated by the ICTINEO compiler
38
Configuration UPC P ERFORMANCE E VALUATION LatenciesINTFP MEM 22 ARITH /ABS 13 MUL 26 DIV/SQR/TRG 618 ResourcesUnified2-cluster4-cluster INT/cluster 421 FP/cluster 421 MEM/cluster 421 REGS/cluster 64/3232/1616/8 2-cluster4-cluster Comm Buses 1/4 Bus Latency 1 1/4 1
39
IPC - 64 registers UPC P ERFORMANCE E VALUATION
40
IPC - 64 registers UPC P ERFORMANCE E VALUATION
41
IPC - 32 registers UPC P ERFORMANCE E VALUATION
42
IPC - 32 registers UPC P ERFORMANCE E VALUATION
43
URACAM Performance – 1 bus UPC P ERFORMANCE E VALUATION 64 Registers
44
URACAM Performance – 4 bus UPC P ERFORMANCE E VALUATION 64 Registers
45
Talk Outline UPC Clustered VLIW Architecture Our previous work URACAM Basic Ideas Algorithm Evaluation Conclusions
46
Conclusions UPC URACAM handles at the same time communications, memory pressure and registers Search for the best overall solution Figure of Merit: a unique criterion to compare partial schedules Transformations to improve partial schedules Spill Code to reduce register pressure Communications through memory to reduce bus pressure Communications through bus to reduce memory pressure Undo Spill Code to reduce memory pressure Spill Code for Clustered VLIW Architecture Done during the scheduling
47
Conclusions UPC URACAM achieves better schedules than previous work on Modulo Scheduling for a Clustered VLIW Architecture Speed up of 18% for 2 clusters and 22% for 4 clusters [ For 1 inter-register bus with 1-cycle latency and 32 registers ] Degradation with respect non-clustered architecture 3% for 2 clusters and 10% for 4 clusters [ For 4 inter-register bus with 1-cycle latency and 32 registers ] URACAM is an adaptive and powerful technique Figure of Merit Transformations
48
A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer Architecture Universitat Politècnica de Catalunya Barcelona, SPAIN E-mail: {jmcodina,fran,antonio}@ac.upc.es UPC UNIVERSITAT POLITÈCNICA DE CATALUNYA
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.