Presentation is loading. Please wait.

Presentation is loading. Please wait.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Similar presentations


Presentation on theme: "Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001."— Presentation transcript:

1 Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001

2 Outline Introduction Block Structured Architecture Methodology Results Conclusions

3 Introduction Out-of-order architecture –dynamically schedules independent instructions Higher ILP through –more powerful processor core –fast instruction delivery But … this increases the hardware complexity significantly!

4 Hardware complexity instruction window bypass logic register file fetch bandwidth O (n 2 ) long wires many ports multiple branches cache access multiple branches cache access [Palacharla et al. 1996] [Farkas et al. 1995] processor core fetching

5 processor core Solutions decentralization: trace processor [Rotenberg et al. ‘97] multiscalar architecture [Sohi et al. ‘95] clusters (Alpha 21264) decentralization: trace processor [Rotenberg et al. ‘97] multiscalar architecture [Sohi et al. ‘95] clusters (Alpha 21264) fetching bigger units of work: trace in trace processors task in multiscalar architecture block in block-structured ISA [Melvin and Patt ‘95; Hao et al. ‘96] bigger units of work: trace in trace processors task in multiscalar architecture block in block-structured ISA [Melvin and Patt ‘95; Hao et al. ‘96]

6 Basic idea of BSA Fixed-Length Block Structured Architecture (BSA) addresses processor core problem fetching problem by appropriate microarchitectural and implementational design decisions Fixed-Length Block Structured Architecture (BSA) addresses processor core problem fetching problem by appropriate microarchitectural and implementational design decisions BSA is a feasible architectural paradigm for future processors BSA is a feasible architectural paradigm for future processors

7 BSA-block BSA-block is atomic unit of work  no control flow  predication  static register renaming  data-flow execution  fixed-length BSA-block is atomic unit of work  no control flow  predication  static register renaming  data-flow execution  fixed-length  Advantages: predication: elimination of unbiased branches intra-block communication: less register file ports required fixed-length BSA-blocks: easier fetching  Disadvantages:  BSA-block not always filled  higher memory bandwidths  bigger instruction caches  BSA-block compression  Advantages: predication: elimination of unbiased branches intra-block communication: less register file ports required fixed-length BSA-blocks: easier fetching  Disadvantages:  BSA-block not always filled  higher memory bandwidths  bigger instruction caches  BSA-block compression basic block basic block basic block basic block (~p1) basic block basic block (p1) basic block basic block (p2) basic block basic block (~p2) Block Structured Architecture overcoming the fetch problem

8 instruction window FU1 FU2 fixed-length BSA-block fixed-length BSA-block instruction cache block engine block engine block engine block engine block engine block engine block engine block engine Block Structured Architecture overcoming the processor core problem data cache register file fast intra-block communication slow inter-block communication fetch unit branch predictor speculative execution speculative execution

9 Decentralization (1) out-of-order architectures with higher levels of ILP: complex design out-of-order architectures with higher levels of ILP: complex design wiring delay will dominate in future technologies wiring delay will dominate in future technologies scaling out-of-order architectures to higher levels of ILP for future technologies is infeasible scaling out-of-order architectures to higher levels of ILP for future technologies is infeasible decentralization small, and thus very fast, block engines communicating through longer, and thus slower, interconnects decentralization small, and thus very fast, block engines communicating through longer, and thus slower, interconnects

10 Decentralization (2) lower IPC slower interconnections (1 cycle latency) bad virtual instruction window utilization due to higher granularity lower IPC slower interconnections (1 cycle latency) bad virtual instruction window utilization due to higher granularity higher clock frequency F decentralization higher clock frequency F decentralization performance = IPC x F higher performance for large virtual window sizes performance = IPC x F higher performance for large virtual window sizes

11 Outline Introduction Block Structured Architecture Methodology Results Conclusions

12 Statistical Modeling benchmark trace: e.g. SPECint benchmark trace: e.g. SPECint statictical profiler statistical profile: distributions statistical profile: distributions 1 1 2 2 3 3 synthetic trace 4 4 5 5 IPC microarchitectural parameters 6 6 extraction of distributions extraction of distributions synthetic trace generator synthetic trace generator BSA-block size b trace-driven simulator trace-driven simulator

13 Synthetic BSA-trace Generation  determine basic block size  add basic block to most likely execution path  until b instructions in BSA-block  determine basic block size  add basic block to most likely execution path  until b instructions in BSA-block  instruction type  number of operands  age of register operands  instruction type  number of operands  age of register operands  determine actually executed control flow path generate control flow generate data flow basic block basic block BSA-block 0.350.65 basic block basic block basic block basic block basic block basic block basic block basic block 0.250.40 0.200.05 0.20 0.15 1 1 2 2 4 4 5 5 3 3 actually executed

14 Benchmarks SPECint95: integer SPECfp95: floating-point MediaBench: signal and multimedia processing MPEG-4 like algorithms measuring program characteristics through instrumentation (ATOM) on Alpha architecture

15 Outline Introduction Block Structured Architecture Methodology Results Conclusions

16 Instruction Mix Load/store instructions –SPECint9540.6% –SPECfp9537.7% –multimedia29.2% Branch instructions –SPECint9514.0% –SPECfp953.6% –multimedia8.5% Some multimedia applications have floating- point instructions

17 Control-intensitivity Good measure: “Number of instructions between 2 mispredicted branches” = number of instructions between 2 branches branch misprediction rate SPECint9580.17.39.1% SPECfp95415.325.06.0% multimedia156.914.39.1%

18 BSA-block formation number of useful instructions 50% 60% 70% 80% 90% 100% 163264128 BSA-block size fraction useful instructions avg media avg SPECint95 avg SPECfp95

19 BSA-block formation predictability of multi-way branch 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% multi-way branch predictability 16-instruction block 32-instruction block 64-instruction block multimediaintegerfloating-point 16-instruction block: 90% in most cases 32-instruction block: low for several integer applications 64-instruction block: only for floating-point applications

20 Conclusions Multimedia applications are less control- intensive than integer applications –due to larger basic block size under comparable branch predictability Multimedia applications are more control- intensive than floating-point applications –due to smaller basic block size and lower branch predictability 16 instructions per BSA-block is appropriate –larger blocks result in higher (multi-way) branch misprediction rates


Download ppt "Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001."

Similar presentations


Ads by Google