Download presentation
Presentation is loading. Please wait.
1
Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors http://iacoma.cs.uiuc.edu Karin StraussAMD Advanced Architecture and Technology Lab Xiaowei ShenIBM Research Josep TorrellasUniversity of Illinois at Urbana-Champaign
2
Karin Strauss - “Uncorq”2 Motivation CMPs are ubiquitous Shared memory + caches = cache coherence directory-based: indirection, storage shared bus-based: electrical, layout issues Traditional cache coherence solutions
3
Karin Strauss - “Uncorq”3 Embedded-ring cache coherence [ISCA 2006] logical ring is embedded in network control messages use ring Simple and inexpensive to implement Novel snoopy cache coherence for mid-sized machines data messages use any path Snoop requests can have long latencies
4
Karin Strauss - “Uncorq”4 Contributions Propose invariant for transaction serialization Propose performance enhancements reduces cache-to-cache transfer latency Uncorq: unconstrained snoop request delivery Simple hardware data prefetching technique reduces memory-to-cache transfer latency
5
Karin Strauss - “Uncorq”5 logical ring Embedded-ring terminology snoop request snoop response snoop request + response data Types of messages: + request response snoop op. outcome positive snoop op. outcome + positive response data AB control messages Snoopy, invalidate protocol response request Single supplier protocol request
6
Ordering invariant
7
Karin Strauss - “Uncorq”7 Transaction serialization inv ack read data S MSI S AB SIS time inv I old valuenew value
8
Karin Strauss - “Uncorq”8 Serialization enforcement with embedded-ring Logical unidirectional ring provides partial ordering Distributed algorithm establishes global order for same-address transactions one is declared the “winner” (first to reach supplier) others have to retry On simultaneous transactions to same address: A request response
9
Karin Strauss - “Uncorq”9 How to serialize transactions A B S A’s request and response B’s request and response No clear “first” transaction B’s request reaches S first A receives B’s positive response before its own A retries: B A Ring guarantees responses are forwarded in the order S performed snoop operations + +
10
Karin Strauss - “Uncorq”10 Enforcing transaction serialization Ordering Invariant: the order in which responses travel the ring after leaving the supplier must be the same as the order in which the supplier processed their corresponding requests. Node whose request arrives at supplier node first is the “winner” What we need to enforce transaction serialization: loser node sees other node’s positive response before its own + S + request response
11
Uncorq: Unconstrained snoop request delivery
12
Karin Strauss - “Uncorq”12 Uncorq idea Baseline request response Idea: requests do not have to follow the ring (but responses do) Uncorq
13
Karin Strauss - “Uncorq”13 Benefit of Uncorq Baseline Reduced cache-to-cache transfer latency Uncorq savings request snoopdata request reaches supplier node time
14
Karin Strauss - “Uncorq”14 Implications of Uncorq Uncorq no longer restricts order of requests Nodes may receive and process requests in any order Responses may also get reordered Problem: distributed algorithm relies on the fact that response order reflects order of requests at supplier
15
Karin Strauss - “Uncorq”15 Example: incorrect transaction ordering A B S S + + + S Ordering invariant A node cannot forward any other response if it has an outstanding positive snoop outcome request response
16
Karin Strauss - “Uncorq”16 How Uncorq stalls responses + + addr C requests responses + Local transaction table (per-node structure) AB…C records messages that node is currently processing + request response
17
Karin Strauss - “Uncorq”17 Optimization: prefetching from memory Predict when no node will supply data Access memory in parallel with ring snoop R memory (1) (2) R memory (1) Goal: reduce latency of memory-to-cache transfers unoptimized optimized
18
Evaluation
19
Karin Strauss - “Uncorq”19 64 nodes in a single CMP Experimental setup SESC simulator (sesc.sourceforge.net) SPLASH-2, SPECjbb and SPECweb workloads Interconnection network: 2D torus with embedded-ring
20
Karin Strauss - “Uncorq”20 Cache-to-cache transfer latency substantial reduction in latency Uncorq 0 2 4 6 8 10 0100200300400500600 distribution (%) 0 20 40 60 80 100 cumulative distribution (%) cache-to-cache transfer latency Baseline 0 2 4 6 8 10 0100200300400500600 distribution (%) 0 20 40 60 80 100 cumulative distribution (%) cache-to-cache transfer latency
21
Karin Strauss - “Uncorq”21 Execution Time 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 SPLASH-2SPECjbbSPECweb normalized execution time Baseline Uncorq Uncorq+Pref Uncorq + Pref performs the best (reduction: 13-26%) Uncorq significantly reduces execution time (reduction: 5-23%)
22
Karin Strauss - “Uncorq”22 Also in the paper Serialization mechanism for case with no supplier System and node forward progress Fences and memory consistency issues Characterization of prefetching mechanism Comparison against ccHyperTransport
23
Karin Strauss - “Uncorq”23 Conclusion Propose invariant for transaction serialization Propose performance enhancements Uncorq: unconstrained snoop request delivery Reduce execution time by 13-26% Simple hardware data prefetching technique
24
Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors http://iacoma.cs.uiuc.edu Karin StraussAMD Advanced Architecture and Technology Lab Xiaowei ShenIBM Research Josep TorrellasUniversity of Illinois at Urbana-Champaign
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.