A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer.

Slides:

Advertisements

Similar presentations

UPC MICRO35 Istanbul Nov Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

Advertisements

Virtual Cluster Scheduling Through the Scheduling Graph Josep M. Codina Jesús Sánchez Antonio González Intel Barcelona Research Center, Intel Labs - UPC.

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica.

U P C CGO’03 San Francisco March 2003 Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

ECE 667 Synthesis and Verification of Digital Circuits

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

1 CS 201 Compiler Construction Software Pipelining: Circular Scheduling.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.

1 ITCS 3181 Logic and Computer Systems B. Wilkinson Slides9.ppt Modification date: March 30, 2015 Processor Design.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

EDA (CS286.5b) Day 10 Scheduling (Intro Branch-and-Bound)

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

1 of 14 1 /23 Flexibility Driven Scheduling and Mapping for Distributed Real-Time Systems Paul Pop, Petru Eles, Zebo Peng Department of Computer and Information.

9. Code Scheduling for ILP-Processors TECH Computer Science {Software! compilers optimizing code for ILP-processors, including VLIW} 9.1 Introduction 9.2.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Behavioral Synthesis Outline –Synthesis Procedure –Example –Domain-Specific Synthesis –Silicon Compilers –Example Tools Goal –Understand behavioral synthesis.

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

Behnam Robatmili, Katherine E. Coons, Kathryn S. McKinley, and Doug Burger Register Bank Assignment For Spatially Partitioned Processors.

1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé.

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Register Allocation (via graph coloring)

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Figure 1.1 Interaction between applications and the operating system.

Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.

1 Liveness analysis and Register Allocation Cheng-Chia Chen.

A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.

1 of 14 1 / 18 An Approach to Incremental Design of Distributed Embedded Systems Paul Pop, Petru Eles, Traian Pop, Zebo Peng Department of Computer and.

Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

Generic Software Pipelining at the Assembly Level Markus Pister

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Adapting Convergent Scheduling Using Machine Learning Diego Puppin*, Mark Stephenson †, Una-May O’Reilly †, Martin Martin †, and Saman Amarasinghe † *

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Register Allocation John Cavazos University.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.

Principles and Modulo Scheduling Algorithm

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Michael Chu, Kevin Fan, Scott Mahlke

Register Pressure Guided Unroll-and-Jam

Efficient Interconnects for Clustered Microarchitectures

Static Code Scheduling

Presentation transcript:

A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer Architecture Universitat Politècnica de Catalunya Barcelona, SPAIN UPC UNIVERSITAT POLITÈCNICA DE CATALUNYA

Why Clustered Architectures? UPC I NTRODUCTION  Semiconductor technology is continuously improving  New technologies pack more logic in a single chip  Exploit more ILP  More functional units, registers, etc.  Faster clock cycles  However, new problems may arise  Delay of signals or data movement from one part to another  Power consumption  Solution: exploit communication locality  Divide the system into several “units”  They can work almost independently and at very high frequency  Some communication channels are used to exchange signals/data  CLUSTERING

Current Trends in Clustered Architectures UPC I NTRODUCTION  Partition the register file & functional units  For embedded/DSP processors: VLIW design  C6000 DSP of Texas Instruments  TigerSharc of Analog Devices  Lx of HP/ST, etc.  Code generation  Cluster assingment  Instruction Scheduling  Register Allocation  For loops: modulo scheduling

Previous work on modulo scheduling UPC I NTRODUCTION  Several works for non-clustered VLIW architectures  Iterative MS, Slack MS, Swing MS, IRIS MS, etc…  Some works for clustered VLIW architectures  E. Nystrom and A. E. Eichenberger [MICRO ´98]  M. M. Fernandes et al. [HPCA ´99]  J. Sánchez and A. González [ICPP ´00]  J. Sánchez and A. González [MICRO ´00]  All of them are non-register constraint Shared memory Distributed memory

How to deal with register constraints? UPC I NTRODUCTION  Add spill code and/or increment II  Eisembeis et al. [MICRO´94]  Ruttenberg et al. [PLDI ´96]  Zalamea et al. [PLDI´00]  In these previous works:  Non-clustered  Spill after scheduling  List Scheduling  K.Kailas, K.Ebcioglu and A.Agrawala [HPCA´01]  In this work:  Clustered  Spill during scheduling  Modulo Scheduling

Talk Outline UPC  Clustered VLIW Architecture  Our previous work  URACAM  Basic Ideas  Algorithm  Example  Evaluation  Conclusions

Talk Outline UPC  Clustered VLIW Architecture  Our previous work  URACAM  Basic Ideas  Algorithm  Example  Evaluation  Conclusions

Architecture Overview UPC C LUSTERED V LIW A RCHITECTURE L1 CACHE LOCAL REGISTER FILE FU LOCAL REGISTER FILE FU... BUS/ES

FU LOCAL REGISTER FILE IVR L1 CACHE BUS Bus Output Bus Input Detailed Cluster UPC C LUSTERED V LIW A RCHITECTURE

Talk Outline UPC  Clustered VLIW Architecture  Our previous work  URACAM  Basic Ideas  Algorithm  Example  Evaluation  Conclusions

Our previous work UPC  Features of the basic scheduling algorithm (SA+GO - ICPP ’00)  Unified assign-and-schedule approach  Cluster assignment heuristics to reduce the number of communications  Loop Unrolling to reduce the number of communications  Main drawbacks  It does not deal with Spill Code  Unroll could increase code size

Talk Outline UPC  Clustered VLIW Architecture  Our previous work  URACAM  Basic Ideas  Algorithm  Example  Evaluation  Conclusions

Basic Ideas UPC URACAM  Main factors in Modulo Scheduling for Clustered Architectures  Communications  Register requirements  Memory pressure  A good scheme has to take into account all of them

Algorithm Overview UPC URACAM Sort DDG nodes START Next node Try to schedule in cluster 0 Best State + II Try to schedule in cluster N Compute MII No Feasible State Ne w Stat e Try to Improve Ne w Stat e Ne w Stat e Try to Improve Ne w Stat e

Algorithm Overview UPC URACAM Sort DDG nodes START Next node Try to schedule in cluster 0 Best State + II Try to schedule in cluster N Compute MII No Feasible State Ne w Stat e Try to Improve Ne w Stat e Ne w Stat e Try to Improve Ne w Stat e Compute MII  Like a monolithic architecture Recurrences Resources

Algorithm Overview UPC URACAM Sort DDG nodes START Next node Try to schedule in cluster 0 Best State + II Try to schedule in cluster N Compute MII No Feasible State Ne w Stat e Try to Improve Ne w Stat e Ne w Stat e Try to Improve Ne w Stat e Sort DDG nodes  According to SMS (Llosa et al., PACT´96)  Priority to nodes in recurrences  Avoids predecessors and successors before a node

Algorithm Overview UPC URACAM Sort DDG nodes START Next node Try to schedule in cluster 0 Best State + II Try to schedule in cluster N Compute MII No Feasible State Ne w Stat e Try to Improve Ne w Stat e Ne w Stat e Try to Improve Ne w Stat e START + Next node  All nodes are handled following computed order

Algorithm Overview UPC URACAM Sort DDG nodes START Next node Best State + II Compute MII No Feasible State Try to Improve Ne w Stat e Try to schedule in cluster 0 Try to schedule in cluster N New Stat e New Stat e Try to Improve Ne w Stat e Try to schedule in all clusters  Generation of a possible partial schedule (new state). Schedule the operation as close as possible to scheduled ones Resource constraints Communications are scheduled

Algorithm Overview UPC URACAM Sort DDG nodes START Next node Try to schedule in cluster 0 Best State + II Try to schedule in cluster N Compute MII No Feasible State Ne w Stat e Try to Improve New Stat e Ne w Stat e Try to Improve New Stat e Trying to improve  Adding spill code to reduce register requirements  Spill code to reduce communications  memory-based communications  Communications to reduce memory pressure  Undoing Spill Code to reduce memory pressure

Algorithm Overview UPC URACAM Sort DDG nodes START Next node Try to schedule in cluster 0 Best State + II Try to schedule in cluster N Compute MII No Feasible State Ne w Stat e Try to Improve Ne w Stat e Ne w Stat e Try to Improve Ne w Stat e Best State  Non valid candidates are discarded  If no feasible state  increase II  Best candidate from the valid ones choosed  Figure of Merit

Figure of Merit UPC URACAM  Used to choose the best alternative in every partial schedule  A unique criteria to evaluate a schedule  Measuring the utilization of the most critical resources  Underlying concepts:  Scare resources are more valuable than abundant ones  Maximize the available resources of the most used ones  Set of percentages % Com Mem %... % % Regs 01NN+1N+N 2N+1 Percentages N = num_clusters

Using Figure of Merit UPC URACAM  Comparing two new states  Compute the percentage of remaining resources usage  Compare from the highest to the lowest percentages  Figure of Merit in transformations gives  Best candidate  Benefit of the transformation

An Example UPC URACAM A B C 2 D 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 Bus Mem Clust 1 Regs Clust Mem Clust 2 4 Regs Clust 2 16 Nodes DBACDBAC Cluster 1Cluster 2 4 Free resources Used resources

An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust Mem Clust 2 4 Regs Clust 2 16 Nodes D 0% 0% B A C Cluster 1Cluster 2 4 Free resources Used resources 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1

An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust Mem Clust 2 4 Regs Clust 2 16 Nodes D 0% 0% B 6,25% A C Cluster 1Cluster 2 4 6,25% Free resources Used resources 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1

An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust Mem Clust 2 4 Regs Clust 2 16 Nodes D 0% 0% B 25% - 6,25% A C Cluster 1Cluster 2 4 Communicatio n 25% 6,25% Free resources Used resources 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1

An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% -6,25% A C Cluster 1Cluster clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16

An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% -6,25% A 20% C Cluster 1Cluster % 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16

An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% - 6,25% A 50% -13,33% C Cluster 1Cluster ,33% 50% Communications 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16

An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% - 6,25% A 20% 50% -13,33% C Cluster 1Cluster clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16

An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% - 6,25% A 20% 50% -13,33% C 83,33% Cluster 1Cluster ,33% 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16

An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% - 6,25% A 20% 50% -13,33% C 83,33% 50% - 8,33% Cluster 1Cluster 2 4 8,33% 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16 Spill Code 50%

An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust Mem Clust 2 4 Regs Clust 2 16 Free resources Used resources Nodes D 0% 0% B 6,25% 25% - 6,25% A 20% 50% -13,33% C 25%-25%-... Cluster 1Cluster 2 4 Communicatio n Through mem. Communicatio n 25% 6,25% 25% 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 16

44 An Example UPC URACAM A B C 2 D Bus Mem Clust 1 Regs Clust Mem Clust 2 3 Regs Clust 2 15 Free resources Used resources Nodes D 0% 0% B 6,25% 25% - 6,25% A 20% 50% -13,33% C 83,33% 25%-25% % - 8,33% Cluster 1Cluster 2 4 St Ld Com 2 clusters 2 general-purpose FU x cluster 2 Memory port x cluster, Lat = 1 Unified mII = 2 cycles 8 registers x cluster 2 Bus, Latency = 1 A, B, D, C non mem ops latency of 1 Cluster 1 Cluster

Memory operations UPC URACAM  Additional memory operations  Spill Code  Communications through memory  Maybe operations from the original DDG cannot be scheduled  Solution:  Differentiate memory pressure in the figure of merit  Global  Original memory operations  Local 2N+2 Percentages N = num_clusters % Com Local Mem %... % % Regs 01N+1N+22N+1 % Global Mem N

Talk Outline UPC  Clustered VLIW Architecture  Our previous work  URACAM  Basic Ideas  Algorithm  Example  Evaluation  Conclusions

Evaluation UPC URACAM  Evaluated using SPECfp95  Using graphs generated by the ICTINEO compiler

Configuration UPC P ERFORMANCE E VALUATION LatenciesINTFP MEM 22 ARITH /ABS 13 MUL 26 DIV/SQR/TRG 618 ResourcesUnified2-cluster4-cluster INT/cluster 421 FP/cluster 421 MEM/cluster 421 REGS/cluster 64/3232/1616/8 2-cluster4-cluster Comm Buses 1/4 Bus Latency 1 1/4 1

IPC - 64 registers UPC P ERFORMANCE E VALUATION

IPC - 64 registers UPC P ERFORMANCE E VALUATION

IPC - 32 registers UPC P ERFORMANCE E VALUATION

IPC - 32 registers UPC P ERFORMANCE E VALUATION

URACAM Performance – 1 bus UPC P ERFORMANCE E VALUATION 64 Registers

URACAM Performance – 4 bus UPC P ERFORMANCE E VALUATION 64 Registers

Talk Outline UPC  Clustered VLIW Architecture  Our previous work  URACAM  Basic Ideas  Algorithm  Evaluation  Conclusions

Conclusions UPC  URACAM handles at the same time communications, memory pressure and registers  Search for the best overall solution  Figure of Merit: a unique criterion to compare partial schedules  Transformations to improve partial schedules  Spill Code to reduce register pressure  Communications through memory to reduce bus pressure  Communications through bus to reduce memory pressure  Undo Spill Code to reduce memory pressure  Spill Code for Clustered VLIW Architecture  Done during the scheduling

Conclusions UPC  URACAM achieves better schedules than previous work on Modulo Scheduling for a Clustered VLIW Architecture  Speed up of 18% for 2 clusters and 22% for 4 clusters [ For 1 inter-register bus with 1-cycle latency and 32 registers ]  Degradation with respect non-clustered architecture  3% for 2 clusters and 10% for 4 clusters [ For 4 inter-register bus with 1-cycle latency and 32 registers ]  URACAM is an adaptive and powerful technique  Figure of Merit  Transformations

A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer Architecture Universitat Politècnica de Catalunya Barcelona, SPAIN UPC UNIVERSITAT POLITÈCNICA DE CATALUNYA