Download presentation
Presentation is loading. Please wait.
Published byFrank Goodwin Modified over 9 years ago
1
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based Memory Architecture TMA
2
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Simultaneous Multithreading (SMT) Diminishing performance from ILP Increased chip parallelism from hardware threading (TLP) IBM Power5, Intel Pentium4, Sun T1 (Niagara) “No processor should come without multiple threads” [Dr. Tremblay] fetch unit decode, rename etc. integer pipe floating-point pipe memory pipe branch pipe L1IL1D
3
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Chip Multiprocessors (CMPs) interconnect I D I D I D I D P P P P L2 Chip Multiprocessors (CMPs) Piranha, IBM Power4, IBM Power5, Sun UltraSPARC IV+, Sun T1, Intel Duo, AMD Dual-Core Opteron
4
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Multi-CMP Systems CMP 3CMP 4 CMP 2CMP 1 interconnect I D I D I D I D P P P P L2 Larger systems sometimes built from multiple CMPs Piranha, IBM Power4 and IBM Power5 interconnect
5
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Multi-CMP Coherence Inter-CMP Coherence Intra-CMP Coherence Intra-CMP protocol for coherence within CMP Inter-CMP protocol for coherence between CMPs Interactions between protocols increase complexity CMP 3CMP 4 CMP 2CMP 1 interconnect
6
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Shared-Memory Trends Today’s chips = yesterday’s mid-range servers Sun T1 has 32 hardware threads on a single die Is it worth to implement multi-CMP systems? Increased development cost Increased verification cost How big is the market?
7
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Trap-Based Memory Architectures TMA: Trap-based Memory Architecture Basic idea Optimize for commercial singe-chip performance Let simple HW and SW support enable scalability Coherence violation detection in hardware Trap on inter-chip coherence violations Solve inter-chip coherence misses in software
8
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Outline Introduction TMA and TMA Lite Evaluation methodology Results Related work Future work Conclusions
9
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se TMA Lite TMA Lite is a “minimal” TMA implementation Runtime system Deadlock avoidance Coherence protocol Per application “scalability” Binary transparency No memory system modifications Simple processor core modifications An inter-node load coherence check An inter-node store coherence check
10
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se A TMA Lite System TMA Lite nodes Single-chip system Load and store coherence check support HW maintains intra-chip coherence TMA Lite cluster network “InfiniBand like” High-bandwidth Low-latency Remote memory access (put, get and atomic) TMA Lite software Coherence and consistency between nodes
11
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se The Load Check Magic value convention Each cache line in state invalid contains a predefined value Hardware Comparator at the load path detects this value Trap generated when the value is found magic value register =? data & load check enabled? load trap? Controlled by system software False misses When the magic value is used within an application Easy to detect and solve within the coherence protocol Rare
12
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se The Store Check Write permission cache (WPC) Can be seen as a very small cache Operates on virtual addresses Accessed in parallel with the data TLB Write permission for lines in the WPC guaranteed by protocol trap? data TLB WPC Data L1 Address generation TLB access WPC access Start L1 access Tag compare TLB trap? WPC trap? End L1 access... hit? data The write permission cache has to be filled A fill occurs at all WPC misses Even if the node already has write permission Overhead often severe
13
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Simulator and Benchmarks Simics: full-system simulator Vasa: timing- and memory-model extension Cycle accurate Power5 like SMT processor model Latency and bandwidth of caches, memory and network SPLASH-2 benchmarks
14
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se System Parameters Scaled down Power5 chip 1 or 2 processor cores per chip 2 SMT threads per processor core Write through L1 Write back L2 and L3 L2 on-die, L3 tags on-die The HW distributed shared memory system Directory: fully mapped bit vector, dedicated SRAM Coherence protocol: HW, highly optimized, non-blocking The TMA Lite system Directory: fully mapped bit vector, in ordinary DRAM memory Coherence protocol SW Binary patch to Solaris modifies the trap vector Coherence protocol run on the hardware thread that caused the miss
15
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Execution Time Breakdown Execution time is normalized to the HW DSM. 4 nodes, load comparator + 16 entry WPC.
16
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Coherence Protocol Breakdown
17
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se SW Flexibility: Coherence Unit Size Execution time is normalized to the HW DSM. 4 nodes, load comparator + 16 entry WPC.
18
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Related Work SW only Page-based systems IVY, Munin, Cashmere, GeNIMA, Treadmarks + many more Virtual memory used for coherence detection Fine-grained systems Shasta, Blizzard, Sirocco, DSZOOM Coherence checks instrumented into applications HW support + software protocol FLASH, Typhoon, S3.mp Coherence processor executes the coherence protocol SMTp SMT thread executes the coherence protocol
19
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Future Work More mature TMA implementations Coherence detection on physical addresses System (instead of application) scalability (Proceedings figure text error: Internet pdf is OK!) One proposal is already available as a tech. report Available at: http://www.it.uu.se/research/publications/reports/2006-031 New coherence detection scheme No “false” load or store coherence misses A new way to decouple inter- and intra-chip coherence In DRAM memory remote access caching Commercial applications Much more experiments Very promising results
20
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Conclusions Shared memory trends SMT and CMP Mid-range servers on a single chip Trap-based Memory Architecture Design for commercial single chip performance Simple and small HW structures for scalable shared memory TMA Lite “Minimal” TMA implementation Competitive to HW DSM when flexibility is used Promising for HPC when runtime system is under control Given the right HW/SW tradeoff simple and efficient scalable shared memory is possible More mature TMA arch. in next paper (the tech. report)
21
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Questions?
22
June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se The Coherence Protocol
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.