Exploring Core Designs for Chip Multiprocessors Allison Holloway Matthew Allen
Outline Motivation Hypotheses Methodology Results Conclusions
Motivation What should core of a CMP look like? Workloads: commercial, scientific OOO wide-issue superscalar? Tradeoffs: Performance, Power, Area, Complexity
Hypotheses Commercial workloads will not benefit much from OOO / wide-issue Scientific workloads will benefit significantly from OOO / wide-issue OOO & wide-issue will be less beneficial for larger scale systems Augmenting an in-order processor with non-blocking caches will close OOO gap
Methodology Simulator: Multifacet, Ruby, Opal (OOO) In-order processor model Looked at Simics functional – not comparable Restrict Opal to in-order issue Register renaming not removed Limitations: Can’t recompile code for scheduling Does not model UltraSPARC issue rules
Methodology Workloads Issues Commercial: Apache, SPECjbb, OLTP, Zeus Scientific: Barnes-Hut, Ocean Issues No 4 processor simulation No cache warmup files
Methodology Baseline configuration used ROB, instruction window, and # functional units halved for 2-wide processor
Results OOO vs. in-order provides more performance benefit than widening issue from 2 to 4 Tolerating cache misses is the key
Results Hypothesis 1: Commercial workloads will not benefit much from OOO / wide-issue ~30% speedup Hypothesis 2: Scientific workloads will benefit significantly from OOO / wide-issue ~60% speedup Commercial workloads DO benefit from OOO, but not as much as scientific.
Results OOO & wide-issue will be less beneficial for larger scale systems True, BUT Workloads don’t scale above 8 processors (except apache)
(Non) Results Hypothesis 4: Augmenting an in-order processor with non-blocking caches will close OOO gap Simulations still running!
Future Work Analyze performance trade-offs vs. power? vs. area? 4 processor runs (if possible) Vary # of MSHRs
Conclusions Out-of-order provides substantial benefit over in-order, even for commercial workloads Other methods for tolerating/reducing cache misses may be effective Diminishing returns for larger systems, but workloads don’t scale well Need to consider power and area constraints