Download presentation
Presentation is loading. Please wait.
1
CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005
2
Niagara Commercial servers require high thread-level throughput and suffer from cache misses Sun’s Niagara focuses on: simple cores (low power, design complexity, can accommodate more cores) fine-grain multi-threading (to tolerate long memory latencies)
3
Niagara Overview
4
SPARC Pipe No branch predictor Low clock speed (1.2 GHz) One FP unit shared by all cores
5
Thread Selection Round-Robin Threads that are speculating on a load-hit receive lower priority Threads are unavailable if they suffer from cache misses, long-latency ops
6
Register File Each procedure has eight local and eight in registers (and eight out registers that serve as in registers for the callee) – each thread has eight such windows Total register file size: 640! 3 read and 2 write ports (1 write/cycle for long and short latency ops) Implemented as a 2-level structure: 1 st level contains the current register windows
7
Cache Hierarchy 16KB L1I and 8KB L1D, write-thru, read-allocate, write-no-allocate Invalidate-based directory protocol – the shared L2 cache (3MB, 4 banks) identifies sharers and sends out the invalidates Rather than store sharers per L2 line, the L1 tags are replicated – such a structure is more efficient to search through
8
Next Generation: Rock 4 cores; each core has 4 pipelines; each pipeline can execute two threads: 32 threads
9
Design Space Exploration: Methodology Workloads: SPEC-JBB (Java middleware), TPC-C (OLTP), TPC-W (transactional web), XML-Test (XML parsing) – all are thread-oriented Sun’s chip design databases were examined to derive area overheads of various features (primarily to evaluate the overhead of threading and ooo execution)
10
Pipelines 8-stage pipelines Scalar proc is fine-grain multi-threaded Superscalar proc is SMT Frequency not more than ½ of the max ITRS-projected frequency 400mm 2 die 25% devoted to off-chip interfaces: mem controllers, I/O, clocking 11% devoted to the inter-core xbar Of the remaining area, 25-75% are allocated to cores/L2-cache
11
Area Effect of Multi-Threading The curve is linear for a while – study is restricted to such designs Multi-threading adds a 5-8% area overhead per thread (primary caches are included in the baseline) A thread is statically assigned to an IDP – multiple threads can share an IDP
12
Design Space Exploration
13
Single Core IPC 4 bars correspond to 4 different L2 sizes IPC range for different L1 sizes
14
Aggregate IPC C1: 2p4t with 64KB L1 caches C2: 2p4t with 32KB L1 caches *L1 latencies are always constant
15
Maximal Aggregate IPCs
17
Observations Scalar cores are better than ooo superscalars Too many threads (> 8) can saturate the caches and memory buses Processor-centric design is often better (medium sized L2s are good enough)
18
PACT 2001 Paper on CMP Designs Different workload: SPEC2k (multi-programmed) Private L2 caches (no cache coherence)
19
Effect of L2 Size
20
Effect of Memory Bandwidth
21
Optimal Configurations
22
Title Bullet
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.