Download presentation
Presentation is loading. Please wait.
Published byMagnus Hardy Modified over 8 years ago
1
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001
2
Outline Introduction Block Structured Architecture Methodology Results Conclusions
3
Introduction Out-of-order architecture –dynamically schedules independent instructions Higher ILP through –more powerful processor core –fast instruction delivery But … this increases the hardware complexity significantly!
4
Hardware complexity instruction window bypass logic register file fetch bandwidth O (n 2 ) long wires many ports multiple branches cache access multiple branches cache access [Palacharla et al. 1996] [Farkas et al. 1995] processor core fetching
5
processor core Solutions decentralization: trace processor [Rotenberg et al. ‘97] multiscalar architecture [Sohi et al. ‘95] clusters (Alpha 21264) decentralization: trace processor [Rotenberg et al. ‘97] multiscalar architecture [Sohi et al. ‘95] clusters (Alpha 21264) fetching bigger units of work: trace in trace processors task in multiscalar architecture block in block-structured ISA [Melvin and Patt ‘95; Hao et al. ‘96] bigger units of work: trace in trace processors task in multiscalar architecture block in block-structured ISA [Melvin and Patt ‘95; Hao et al. ‘96]
6
Basic idea of BSA Fixed-Length Block Structured Architecture (BSA) addresses processor core problem fetching problem by appropriate microarchitectural and implementational design decisions Fixed-Length Block Structured Architecture (BSA) addresses processor core problem fetching problem by appropriate microarchitectural and implementational design decisions BSA is a feasible architectural paradigm for future processors BSA is a feasible architectural paradigm for future processors
7
BSA-block BSA-block is atomic unit of work no control flow predication static register renaming data-flow execution fixed-length BSA-block is atomic unit of work no control flow predication static register renaming data-flow execution fixed-length Advantages: predication: elimination of unbiased branches intra-block communication: less register file ports required fixed-length BSA-blocks: easier fetching Disadvantages: BSA-block not always filled higher memory bandwidths bigger instruction caches BSA-block compression Advantages: predication: elimination of unbiased branches intra-block communication: less register file ports required fixed-length BSA-blocks: easier fetching Disadvantages: BSA-block not always filled higher memory bandwidths bigger instruction caches BSA-block compression basic block basic block basic block basic block (~p1) basic block basic block (p1) basic block basic block (p2) basic block basic block (~p2) Block Structured Architecture overcoming the fetch problem
8
instruction window FU1 FU2 fixed-length BSA-block fixed-length BSA-block instruction cache block engine block engine block engine block engine block engine block engine block engine block engine Block Structured Architecture overcoming the processor core problem data cache register file fast intra-block communication slow inter-block communication fetch unit branch predictor speculative execution speculative execution
9
Decentralization (1) out-of-order architectures with higher levels of ILP: complex design out-of-order architectures with higher levels of ILP: complex design wiring delay will dominate in future technologies wiring delay will dominate in future technologies scaling out-of-order architectures to higher levels of ILP for future technologies is infeasible scaling out-of-order architectures to higher levels of ILP for future technologies is infeasible decentralization small, and thus very fast, block engines communicating through longer, and thus slower, interconnects decentralization small, and thus very fast, block engines communicating through longer, and thus slower, interconnects
10
Decentralization (2) lower IPC slower interconnections (1 cycle latency) bad virtual instruction window utilization due to higher granularity lower IPC slower interconnections (1 cycle latency) bad virtual instruction window utilization due to higher granularity higher clock frequency F decentralization higher clock frequency F decentralization performance = IPC x F higher performance for large virtual window sizes performance = IPC x F higher performance for large virtual window sizes
11
Outline Introduction Block Structured Architecture Methodology Results Conclusions
12
Statistical Modeling benchmark trace: e.g. SPECint benchmark trace: e.g. SPECint statictical profiler statistical profile: distributions statistical profile: distributions 1 1 2 2 3 3 synthetic trace 4 4 5 5 IPC microarchitectural parameters 6 6 extraction of distributions extraction of distributions synthetic trace generator synthetic trace generator BSA-block size b trace-driven simulator trace-driven simulator
13
Synthetic BSA-trace Generation determine basic block size add basic block to most likely execution path until b instructions in BSA-block determine basic block size add basic block to most likely execution path until b instructions in BSA-block instruction type number of operands age of register operands instruction type number of operands age of register operands determine actually executed control flow path generate control flow generate data flow basic block basic block BSA-block 0.350.65 basic block basic block basic block basic block basic block basic block basic block basic block 0.250.40 0.200.05 0.20 0.15 1 1 2 2 4 4 5 5 3 3 actually executed
14
Benchmarks SPECint95: integer SPECfp95: floating-point MediaBench: signal and multimedia processing MPEG-4 like algorithms measuring program characteristics through instrumentation (ATOM) on Alpha architecture
15
Outline Introduction Block Structured Architecture Methodology Results Conclusions
16
Instruction Mix Load/store instructions –SPECint9540.6% –SPECfp9537.7% –multimedia29.2% Branch instructions –SPECint9514.0% –SPECfp953.6% –multimedia8.5% Some multimedia applications have floating- point instructions
17
Control-intensitivity Good measure: “Number of instructions between 2 mispredicted branches” = number of instructions between 2 branches branch misprediction rate SPECint9580.17.39.1% SPECfp95415.325.06.0% multimedia156.914.39.1%
18
BSA-block formation number of useful instructions 50% 60% 70% 80% 90% 100% 163264128 BSA-block size fraction useful instructions avg media avg SPECint95 avg SPECfp95
19
BSA-block formation predictability of multi-way branch 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% multi-way branch predictability 16-instruction block 32-instruction block 64-instruction block multimediaintegerfloating-point 16-instruction block: 90% in most cases 32-instruction block: low for several integer applications 64-instruction block: only for floating-point applications
20
Conclusions Multimedia applications are less control- intensive than integer applications –due to larger basic block size under comparable branch predictability Multimedia applications are more control- intensive than floating-point applications –due to smaller basic block size and lower branch predictability 16 instructions per BSA-block is appropriate –larger blocks result in higher (multi-way) branch misprediction rates
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.