Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001

Outline Introduction Block Structured Architecture Methodology Results Conclusions

Introduction Out-of-order architecture –dynamically schedules independent instructions Higher ILP through –more powerful processor core –fast instruction delivery But … this increases the hardware complexity significantly!

Hardware complexity instruction window bypass logic register file fetch bandwidth O (n 2 ) long wires many ports multiple branches cache access multiple branches cache access [Palacharla et al. 1996] [Farkas et al. 1995] processor core fetching

processor core Solutions decentralization: trace processor [Rotenberg et al. ‘97] multiscalar architecture [Sohi et al. ‘95] clusters (Alpha 21264) decentralization: trace processor [Rotenberg et al. ‘97] multiscalar architecture [Sohi et al. ‘95] clusters (Alpha 21264) fetching bigger units of work: trace in trace processors task in multiscalar architecture block in block-structured ISA [Melvin and Patt ‘95; Hao et al. ‘96] bigger units of work: trace in trace processors task in multiscalar architecture block in block-structured ISA [Melvin and Patt ‘95; Hao et al. ‘96]

Basic idea of BSA Fixed-Length Block Structured Architecture (BSA) addresses processor core problem fetching problem by appropriate microarchitectural and implementational design decisions Fixed-Length Block Structured Architecture (BSA) addresses processor core problem fetching problem by appropriate microarchitectural and implementational design decisions BSA is a feasible architectural paradigm for future processors BSA is a feasible architectural paradigm for future processors

BSA-block BSA-block is atomic unit of work  no control flow  predication  static register renaming  data-flow execution  fixed-length BSA-block is atomic unit of work  no control flow  predication  static register renaming  data-flow execution  fixed-length  Advantages: predication: elimination of unbiased branches intra-block communication: less register file ports required fixed-length BSA-blocks: easier fetching  Disadvantages:  BSA-block not always filled  higher memory bandwidths  bigger instruction caches  BSA-block compression  Advantages: predication: elimination of unbiased branches intra-block communication: less register file ports required fixed-length BSA-blocks: easier fetching  Disadvantages:  BSA-block not always filled  higher memory bandwidths  bigger instruction caches  BSA-block compression basic block basic block basic block basic block (~p1) basic block basic block (p1) basic block basic block (p2) basic block basic block (~p2) Block Structured Architecture overcoming the fetch problem

instruction window FU1 FU2 fixed-length BSA-block fixed-length BSA-block instruction cache block engine block engine block engine block engine block engine block engine block engine block engine Block Structured Architecture overcoming the processor core problem data cache register file fast intra-block communication slow inter-block communication fetch unit branch predictor speculative execution speculative execution

Decentralization (1) out-of-order architectures with higher levels of ILP: complex design out-of-order architectures with higher levels of ILP: complex design wiring delay will dominate in future technologies wiring delay will dominate in future technologies scaling out-of-order architectures to higher levels of ILP for future technologies is infeasible scaling out-of-order architectures to higher levels of ILP for future technologies is infeasible decentralization small, and thus very fast, block engines communicating through longer, and thus slower, interconnects decentralization small, and thus very fast, block engines communicating through longer, and thus slower, interconnects

Decentralization (2) lower IPC slower interconnections (1 cycle latency) bad virtual instruction window utilization due to higher granularity lower IPC slower interconnections (1 cycle latency) bad virtual instruction window utilization due to higher granularity higher clock frequency F decentralization higher clock frequency F decentralization performance = IPC x F higher performance for large virtual window sizes performance = IPC x F higher performance for large virtual window sizes

Statistical Modeling benchmark trace: e.g. SPECint benchmark trace: e.g. SPECint statictical profiler statistical profile: distributions statistical profile: distributions 1 1 2 2 3 3 synthetic trace 4 4 5 5 IPC microarchitectural parameters 6 6 extraction of distributions extraction of distributions synthetic trace generator synthetic trace generator BSA-block size b trace-driven simulator trace-driven simulator

Synthetic BSA-trace Generation  determine basic block size  add basic block to most likely execution path  until b instructions in BSA-block  determine basic block size  add basic block to most likely execution path  until b instructions in BSA-block  instruction type  number of operands  age of register operands  instruction type  number of operands  age of register operands  determine actually executed control flow path generate control flow generate data flow basic block basic block BSA-block 0.350.65 basic block basic block basic block basic block basic block basic block basic block basic block 0.250.40 0.200.05 0.20 0.15 1 1 2 2 4 4 5 5 3 3 actually executed

Benchmarks SPECint95: integer SPECfp95: floating-point MediaBench: signal and multimedia processing MPEG-4 like algorithms measuring program characteristics through instrumentation (ATOM) on Alpha architecture

Instruction Mix Load/store instructions –SPECint9540.6% –SPECfp9537.7% –multimedia29.2% Branch instructions –SPECint9514.0% –SPECfp953.6% –multimedia8.5% Some multimedia applications have floating- point instructions

Control-intensitivity Good measure: “Number of instructions between 2 mispredicted branches” = number of instructions between 2 branches branch misprediction rate SPECint9580.17.39.1% SPECfp95415.325.06.0% multimedia156.914.39.1%

BSA-block formation number of useful instructions 50% 60% 70% 80% 90% 100% 163264128 BSA-block size fraction useful instructions avg media avg SPECint95 avg SPECfp95

BSA-block formation predictability of multi-way branch 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% multi-way branch predictability 16-instruction block 32-instruction block 64-instruction block multimediaintegerfloating-point 16-instruction block: 90% in most cases 32-instruction block: low for several integer applications 64-instruction block: only for floating-point applications

Conclusions Multimedia applications are less control- intensive than integer applications –due to larger basic block size under comparable branch predictability Multimedia applications are more control- intensive than floating-point applications –due to smaller basic block size and lower branch predictability 16 instructions per BSA-block is appropriate –larger blocks result in higher (multi-way) branch misprediction rates

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Similar presentations

Presentation on theme: "Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Similar presentations

Presentation on theme: "Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001."— Presentation transcript:

Similar presentations

About project

Feedback