1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé.

1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé Departament d’Arquitectura de Computadors, Universitat Politècnica de Catalunya

1-XII-98 Micro-31 2 Goals Modify resources to exploit ILP in –VLIW architectures –numerical code –innermost loops A study of performance and cost Technological projection

1-XII-98 Micro-31 3 Outline Replication and Widening Performance –maximum ILP achievable –effects of spill code Design considerations Performance under a technological limit Conclusions

1-XII-98 Micro-31 4 Basic architecture 1 bus between the register file (RF) and the first-level cache 1 general purpose floating point functional unit (FPU) FPU Register File Bus

1-XII-98 Micro-31 5 Basic architecture 2 operations can be issued per cycle: –1 memory –1 FPU FPU Register File Bus memory FPUothers VLIW

1-XII-98 Micro-31 6 Replication 2 buses + 2 FPU 4 operations can be issued per cycle: –2 memory (independent) –2 FPU (independent) FPU Register File Bus memory FPU others VLIW

1-XII-98 Micro-31 7 An alternative: widening The bus, the FPU and the RF are widened 4 operations can be issued per cycle: –2 memory (in consecutive memory addresses) –2 FPU (the same operation) less versatile Bus FPU Register File memory FPU others VLIW

1-XII-98 Micro-31 8 Software pipelined loops Loops performance is limited by –recurrences –resources Software pipelining overlaps the execution of several consecutive iterations With a perfect scheduling, at least one resource is occupied at 100% (unless the loop is recurrence-bound)

1-XII-98 Micro-31 9 How widening works? 3 memory operations: loads A and B, and store C A y B have stride 1 with themselves in the next iteration 3 floating point operations + has a recurrence with itself in the next iteration let’s assume a latency of 2 cycles BA * * + C D 1 11

1-XII-98 Micro-31 10 Execution of several iterations Load A Load B * * + Store C iteration 1234 1 bus + 1 FPU : 3 cycles / iteration 2 buses + 2 FPU: 1.5 cycles / iteration but 2 cycles are required due to the recurrence, so 2 cycles / iteration

1-XII-98 Micro-31 11 “Compactable” operations (width 2) Load A Load B * * + Store C iteration 1234 Reason (López et al. ICS97) No dependency and stride 1 Dependency No dependency, but no stride 1 2 cycles / iteration No dependency

1-XII-98 Micro-31 12 Limits on ILP Baseline configuration (1w1) :1 bus and 2 FPU Configurations XwY : –X: degree of replication: X buses, 2*X FPU –Y: degree of widening (width of the resources) Characteristics of the architecture: –store is served in 1 cycle, division (19 cycles) and SQRT (27 cycles) are not pipelined –the rest are fully pipelined with a latency of 4 cycles

1-XII-98 Micro-31 13 Performance: Replication Workbench: 1180 loops that account for 78% of the execution time of the Perfect Club

1-XII-98 Micro-31 14 Performance: R vs Widening

1-XII-98 Micro-31 15 Performance: R vs W vs Combined

1-XII-98 Micro-31 16 Scheduling and register assignment Loops have been software pipelined using HRMS (Llosa et al. MICRO-28), a register pressure sensitive heuristic. Register allocation has been performed using wands-only strategy and the end-fit with adjacency ordering (Rau et al. PLDI-92). When a loop requires more registers than the available, spill code is added.

1-XII-98 Micro-31 17 Register pressure Reducing the cycles required per iteration can increase the register requirements. Widening is also applied to the register file –more storage capacity (and less register pressure) –not cheating! If there are no compactable operations, we do not benefit from this additional capacity

1-XII-98 Micro-31 18 Effects of adding spill code Baseline : 1w1 with a 256 RF

1-XII-98 Micro-31 19 Area cost Cost of the FPU: widening and replication have the same cost The area of the RF grows as the square of the number of ports

1-XII-98 Micro-31 20 Register file access time Based on the CACTI model (Wilton & Jouppi J. of Solid-State Circuits 96) for cache memory. Normalised to configuration 1w1 with a 32-RF and a technology of =0.05. Widening the RF is cheaper than adding ports Increasing the number of registers is cheaper than adding ports

1-XII-98 Micro-31 21 Effect of the RF size Configuration 1w1

1-XII-98 Micro-31 22 Effect of the studied techniques

1-XII-98 Micro-31 23 Cost of widening and replication Area: –replication: quadratic increment –widening: linear increment Cycle time: –the increment of cycle time applying replication is greater than applying widening –the RF can be partitioned into several copies, reducing the access time but increasing the area

1-XII-98 Micro-31 24 Performance/cost trade-off Configurations XwY(Z:n) where: –X is the replication degree –Y is the widening degree –Z is the RF size (32, 64, 128 or 256) –n is the number of blocks in which the RF has been partitioned

1-XII-98 Micro-31 25 Configurations that can be implemented We use the SIA predictions FPU + RF area cost must be smaller than 20% of the total chip area available Semiconductor Industry Association 19982001200420072010  m) Size (mm 2 ) 2 per chip (x10 6 ) 0.25 300 4800 0.18 360 11,111 0.13 430 25,443 0.10 520 52,000 0.07 620 126,530

1-XII-98 Micro-31 26 Implementable configurations ( =0.25)

1-XII-98 Micro-31 31 FPU latency We compare configurations adapting the latency of the FPU to the processor cycle time A configuration with a relative cycle time Tc belongs to the z-cycles model where z=  4/Tc 

1-XII-98 Micro-31 32 Effect of the RF size The same configuration, changing the RF size A big RF needs less spill code, but has a big penalty in access time

1-XII-98 Micro-31 33 Effect of the studied techniques Only Replication Only Widening

1-XII-98 Micro-31 34 XwY where X*Y=8 The same RF size and peak performance Combining small degrees of replication and widening results in the best performance

1-XII-98 Micro-31 35 Top five configurations (i) The five configurations that achieve the best performance for =0.18 are showed. Blue ones: the ones with best performance/cost In all the technology generations, the best ones use widening = 0.18

1-XII-98 Micro-31 36 Top five configurations (ii) = 0.13 = 0.10

1-XII-98 Micro-31 37 Conclusions Study of two techniques to extract ILP: replication and widening Study of aggressive configurations in optimal conditions: –replication achieves best performance –widening costs less Study of the cost of both techniques

1-XII-98 Micro-31 38 Conclusions Applying small degrees of replication and widening results in best performance under a technology limit –widening has more storage capacity less spill code –replication has more area requirements some configurations become not implementable –RF access time is shorter using replication than using widening

1-XII-98 Micro-31 39 RF area cost Read data line Write data line Write select line Read select line

1-XII-98 Micro-31 40 Unrolling the loop BA * * + C D 1 11 A1A1 A0A0 *0*0 *0*0 +0+0 C0C0 D B1B1 B0B0 *1*1 *1*1 +1+1 C1C1 D 1

1-XII-98 Micro-31 41 Compacting A1A1 A0A0 *0*0 *0*0 +0+0 C0C0 D B1B1 B0B0 *1*1 *1*1 +1+1 C1C1 D 1 A 0,1 * 0,1 +0+0 C0C0 D B 0,1 +1+1 C1C1 1

1-XII-98 Micro-31 42 Execution of a compacted loop A 0,1 * 0,1 +0+0 C0C0 D B 0,1 +1+1 C1C1 1 FPU Register File Bus

1-XII-98 Micro-31 43 Limits A loop is bounded by recurrences and resources. Assume the basic architecture (1 bus and 1 FPU) with latency of 2 cycles A1A1 A0A0 *0*0 *0*0 +0+0 C0C0 D B1B1 B0B0 *1*1 *1*1 +1+1 C1C1 D 1

1-XII-98 Micro-31 44 Limits: resources and recurrences A1A1 A0A0 *0*0 *0*0 +0+0 C0C0 D B1B1 B0B0 *1*1 *1*1 +1+1 C1C1 D 1

1-XII-98 Micro-31 45 Reducing the resources limits A1A1 A0A0 *0*0 *0*0 +0+0 C0C0 D B1B1 B0B0 *1*1 *1*1 +1+1 C1C1 D 1

1-XII-98 Micro-31 46 Effect of replication and widening 1w1: 3 cycles/it 2w1: 2 cycles/it 1w2: 2 cycles/it

1-XII-98 Micro-31 47 Taxonomy of loops Recurrences Compactab. Non compa. Don’t care

1-XII-98 Micro-31 48 Top five configurations = 0.25

1-XII-98 Micro-31 49 007

1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé.

Similar presentations

Presentation on theme: "1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé.

Similar presentations

Presentation on theme: "1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé."— Presentation transcript:

Similar presentations

About project

Feedback