Presentation is loading. Please wait.

Presentation is loading. Please wait.

1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé.

Similar presentations


Presentation on theme: "1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé."— Presentation transcript:

1 1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé Departament d’Arquitectura de Computadors, Universitat Politècnica de Catalunya

2 1-XII-98 Micro-31 2 Goals Modify resources to exploit ILP in –VLIW architectures –numerical code –innermost loops A study of performance and cost Technological projection

3 1-XII-98 Micro-31 3 Outline Replication and Widening Performance –maximum ILP achievable –effects of spill code Design considerations Performance under a technological limit Conclusions

4 1-XII-98 Micro-31 4 Basic architecture 1 bus between the register file (RF) and the first-level cache 1 general purpose floating point functional unit (FPU) FPU Register File Bus

5 1-XII-98 Micro-31 5 Basic architecture 2 operations can be issued per cycle: –1 memory –1 FPU FPU Register File Bus memory FPUothers VLIW

6 1-XII-98 Micro-31 6 Replication 2 buses + 2 FPU 4 operations can be issued per cycle: –2 memory (independent) –2 FPU (independent) FPU Register File Bus memory FPU others VLIW

7 1-XII-98 Micro-31 7 An alternative: widening The bus, the FPU and the RF are widened 4 operations can be issued per cycle: –2 memory (in consecutive memory addresses) –2 FPU (the same operation) less versatile Bus FPU Register File memory FPU others VLIW

8 1-XII-98 Micro-31 8 Software pipelined loops Loops performance is limited by –recurrences –resources Software pipelining overlaps the execution of several consecutive iterations With a perfect scheduling, at least one resource is occupied at 100% (unless the loop is recurrence-bound)

9 1-XII-98 Micro-31 9 How widening works? 3 memory operations: loads A and B, and store C A y B have stride 1 with themselves in the next iteration 3 floating point operations + has a recurrence with itself in the next iteration let’s assume a latency of 2 cycles BA * * + C D 1 11

10 1-XII-98 Micro-31 10 Execution of several iterations Load A Load B * * + Store C iteration 1234 1 bus + 1 FPU : 3 cycles / iteration 2 buses + 2 FPU: 1.5 cycles / iteration but 2 cycles are required due to the recurrence, so 2 cycles / iteration

11 1-XII-98 Micro-31 11 “Compactable” operations (width 2) Load A Load B * * + Store C iteration 1234 Reason (López et al. ICS97) No dependency and stride 1 Dependency No dependency, but no stride 1 2 cycles / iteration No dependency

12 1-XII-98 Micro-31 12 Limits on ILP Baseline configuration (1w1) :1 bus and 2 FPU Configurations XwY : –X: degree of replication: X buses, 2*X FPU –Y: degree of widening (width of the resources) Characteristics of the architecture: –store is served in 1 cycle, division (19 cycles) and SQRT (27 cycles) are not pipelined –the rest are fully pipelined with a latency of 4 cycles

13 1-XII-98 Micro-31 13 Performance: Replication Workbench: 1180 loops that account for 78% of the execution time of the Perfect Club

14 1-XII-98 Micro-31 14 Performance: R vs Widening

15 1-XII-98 Micro-31 15 Performance: R vs W vs Combined

16 1-XII-98 Micro-31 16 Scheduling and register assignment Loops have been software pipelined using HRMS (Llosa et al. MICRO-28), a register pressure sensitive heuristic. Register allocation has been performed using wands-only strategy and the end-fit with adjacency ordering (Rau et al. PLDI-92). When a loop requires more registers than the available, spill code is added.

17 1-XII-98 Micro-31 17 Register pressure Reducing the cycles required per iteration can increase the register requirements. Widening is also applied to the register file –more storage capacity (and less register pressure) –not cheating! If there are no compactable operations, we do not benefit from this additional capacity

18 1-XII-98 Micro-31 18 Effects of adding spill code Baseline : 1w1 with a 256 RF

19 1-XII-98 Micro-31 19 Area cost Cost of the FPU: widening and replication have the same cost The area of the RF grows as the square of the number of ports

20 1-XII-98 Micro-31 20 Register file access time Based on the CACTI model (Wilton & Jouppi J. of Solid-State Circuits 96) for cache memory. Normalised to configuration 1w1 with a 32-RF and a technology of =0.05. Widening the RF is cheaper than adding ports Increasing the number of registers is cheaper than adding ports

21 1-XII-98 Micro-31 21 Effect of the RF size Configuration 1w1

22 1-XII-98 Micro-31 22 Effect of the studied techniques

23 1-XII-98 Micro-31 23 Cost of widening and replication Area: –replication: quadratic increment –widening: linear increment Cycle time: –the increment of cycle time applying replication is greater than applying widening –the RF can be partitioned into several copies, reducing the access time but increasing the area

24 1-XII-98 Micro-31 24 Performance/cost trade-off Configurations XwY(Z:n) where: –X is the replication degree –Y is the widening degree –Z is the RF size (32, 64, 128 or 256) –n is the number of blocks in which the RF has been partitioned

25 1-XII-98 Micro-31 25 Configurations that can be implemented We use the SIA predictions FPU + RF area cost must be smaller than 20% of the total chip area available Semiconductor Industry Association 19982001200420072010  m) Size (mm 2 ) 2 per chip (x10 6 ) 0.25 300 4800 0.18 360 11,111 0.13 430 25,443 0.10 520 52,000 0.07 620 126,530

26 1-XII-98 Micro-31 26 Implementable configurations ( =0.25)

27 1-XII-98 Micro-31 27 Implementable configurations ( =0.18)

28 1-XII-98 Micro-31 28 Implementable configurations ( =0.13)

29 1-XII-98 Micro-31 29 Implementable configurations ( =0.10)

30 1-XII-98 Micro-31 30 Implementable configurations ( =0.07)

31 1-XII-98 Micro-31 31 FPU latency We compare configurations adapting the latency of the FPU to the processor cycle time A configuration with a relative cycle time Tc belongs to the z-cycles model where z=  4/Tc 

32 1-XII-98 Micro-31 32 Effect of the RF size The same configuration, changing the RF size A big RF needs less spill code, but has a big penalty in access time

33 1-XII-98 Micro-31 33 Effect of the studied techniques Only Replication Only Widening

34 1-XII-98 Micro-31 34 XwY where X*Y=8 The same RF size and peak performance Combining small degrees of replication and widening results in the best performance

35 1-XII-98 Micro-31 35 Top five configurations (i) The five configurations that achieve the best performance for =0.18 are showed. Blue ones: the ones with best performance/cost In all the technology generations, the best ones use widening = 0.18

36 1-XII-98 Micro-31 36 Top five configurations (ii) = 0.13 = 0.10

37 1-XII-98 Micro-31 37 Conclusions Study of two techniques to extract ILP: replication and widening Study of aggressive configurations in optimal conditions: –replication achieves best performance –widening costs less Study of the cost of both techniques

38 1-XII-98 Micro-31 38 Conclusions Applying small degrees of replication and widening results in best performance under a technology limit –widening has more storage capacity less spill code –replication has more area requirements some configurations become not implementable –RF access time is shorter using replication than using widening

39 1-XII-98 Micro-31 39 RF area cost Read data line Write data line Write select line Read select line

40 1-XII-98 Micro-31 40 Unrolling the loop BA * * + C D 1 11 A1A1 A0A0 *0*0 *0*0 +0+0 C0C0 D B1B1 B0B0 *1*1 *1*1 +1+1 C1C1 D 1

41 1-XII-98 Micro-31 41 Compacting A1A1 A0A0 *0*0 *0*0 +0+0 C0C0 D B1B1 B0B0 *1*1 *1*1 +1+1 C1C1 D 1 A 0,1 * 0,1 +0+0 C0C0 D B 0,1 +1+1 C1C1 1

42 1-XII-98 Micro-31 42 Execution of a compacted loop A 0,1 * 0,1 +0+0 C0C0 D B 0,1 +1+1 C1C1 1 FPU Register File Bus

43 1-XII-98 Micro-31 43 Limits A loop is bounded by recurrences and resources. Assume the basic architecture (1 bus and 1 FPU) with latency of 2 cycles A1A1 A0A0 *0*0 *0*0 +0+0 C0C0 D B1B1 B0B0 *1*1 *1*1 +1+1 C1C1 D 1

44 1-XII-98 Micro-31 44 Limits: resources and recurrences A1A1 A0A0 *0*0 *0*0 +0+0 C0C0 D B1B1 B0B0 *1*1 *1*1 +1+1 C1C1 D 1

45 1-XII-98 Micro-31 45 Reducing the resources limits A1A1 A0A0 *0*0 *0*0 +0+0 C0C0 D B1B1 B0B0 *1*1 *1*1 +1+1 C1C1 D 1

46 1-XII-98 Micro-31 46 Effect of replication and widening 1w1: 3 cycles/it 2w1: 2 cycles/it 1w2: 2 cycles/it

47 1-XII-98 Micro-31 47 Taxonomy of loops Recurrences Compactab. Non compa. Don’t care

48 1-XII-98 Micro-31 48 Top five configurations = 0.25

49 1-XII-98 Micro-31 49 007


Download ppt "1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé."

Similar presentations


Ads by Google