SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU
SYNAR Systems Networking and Architecture Group Overview 8 cores 4 threads per core 3MB L2 cache (4-banks) 12-way, write-back One FPU per chip © David Yen BUS
SYNAR Systems Networking and Architecture Group Memory Latency Limits Performance © David Yen
SYNAR Systems Networking and Architecture Group Hardware Multithreading © David Yen While one thread is blocked on memory, others continue computing – results in higher number of instructions per cycle
SYNAR Systems Networking and Architecture Group Eight Multithreaded Cores © David Yen
SYNAR Systems Networking and Architecture Group Niagara Chip © Poonacha Kongetira
SYNAR Systems Networking and Architecture Group Niagara Core 4 threads per core Multithreading increases core area by 20% 6 stage single-issue in-order pipeline IFU – instruction fetch unit LSU – load/store unit EXU – execution unit L1 D-cache: 4-way, 8KB, 16 byte line L1 I-cache: 4-way, 16KB, 32 byte line Why simple in-order core? Why small caches?
SYNAR Systems Networking and Architecture Group Switching Threads Switch between available threads every cycle giving priority to least recently executed thread Fine-grained multithreading Threads become unavailable due to: – Long latency ops like loads, branch, mul, div. – Pipeline stalls such as cache misses, traps, and resource conflicts