Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 9 Outline MESI protocol Dragon update-based protocol

Similar presentations


Presentation on theme: "Lecture 9 Outline MESI protocol Dragon update-based protocol"— Presentation transcript:

1 Lecture 9 Outline MESI protocol Dragon update-based protocol
Impact of protocol optimizations Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

2 Lower-Level Protocol Choices
BusRd observed in M state: what transition to make? Change to S: assume I’ll read again soon good for mostly read data what about “migratory” data, thus: Change to I: assume other will write to it (Synapse) I read and write, then you read and write, then X reads and writes... Sequent Symmetry and MIT Alewife use adaptive protocols Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

3 MESI (4-state) Invalidation Protocol
Problem with MSI protocol Rd, Wr sequence incurs 2 transactions even when no one is sharing (e.g., serial program!) BusRd (I  S) followed by BusRdX or BusUpgr (S  M) In general, penalizing serial programs is unacceptable Add exclusive state: Invalid Modified (dirty) Shared (two or more caches may have copies) Exclusive: (only this cache has clean copy, same value as in memory) How to decide I  E or I  S? Need to check whether someone else has copy “Shared” signal on bus: wired-or line asserted in response to BusRd Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

4 MESI: Processor-Initiated Transactions
PrRd/- PrRd/- PrWr/- PrWr/- M E PrWr/BusRdX PrWr/BusRdX PrRd/BusRd(~S) S I PrRd/BusRd(S) PrRd/- Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

5 MESI: Bus-Initiated Transactions
BusRdX/Flush E BusRd/Flush BusRd/Flush BusRdX/Flush S I BusRdX/Flush1 BusRd/Flush1 BusRd/- BusRdX/- Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

6 MESI State Transition Diagram
BusRd(S) means shared line asserted on BusRd transaction Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

7 Flush vs. Flush1 (Flush' in textbook)
Flush: mandatory Flush' (Flush1): happens only when Cache-to-cache sharing is used, and, Only one cache flushes data Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

8 MESI Visualization P1 P2 P3 Cache Snooper Snooper Snooper Bus Mem Ctrl
Main Memory X=1 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

9 MESI Visualization P1 P2 P3 rd &X BusRd Snooper Snooper Snooper
Mem Ctrl X=1 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

10 MESI Visualization P1 P2 P3 X=1 E Snooper Snooper Snooper Mem Ctrl X=1
Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

11 MESI Visualization P1 P2 P3 wr &X (X=2) X=1 M 2 E Snooper Snooper
Mem Ctrl One less bus request due to Exclusive state, esp. for serial programs X=1 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

12 MESI Visualization P1 P2 P3 rd &X X=2 M BusRd Snooper Snooper Snooper
Mem Ctrl X=1 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

13 MESI Visualization P1 P2 P3 X=2 S Flush M X=2 S Snooper Snooper
Mem Ctrl X=1 2 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

14 MESI Visualization P1 P2 P3 wr &X X=3 X=2 S I X=2 M 3 S BusUpgr
Snooper Snooper Snooper Mem Ctrl Note: BusUpgr instead of BusRdX X=2 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

15 MESI Visualization P1 P2 P3 rd &X X=2 S 3 I X=3 M S BusRd Flush
Snooper Snooper Snooper Mem Ctrl X=2 3 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

16 MESI Visualization P1 P2 P3 rd &X X=3 S X=3 S Snooper Snooper Snooper
Mem Ctrl X=3 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

17 MESI Visualization P1 P2 P3 rd &X X=3 S X=3 S X=3 S BusRd Flush1
Snooper Snooper Snooper Mem Ctrl Referred to as Cache-to-cache transfer in Illinois MESI protocol X=3 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

18 MESI Example (Cache-to-Cache Transfer)
Proc Action State P1 State P2 State P3 Bus Action Data From R1 E - BusRd Mem W1 M Own cache R3 S BusRd/Flush P1 cache W3 I BusRdX P3 cache R2 BusRd/Flush1 P1/P3 Cache* * Data from memory if no cache2cache transfer, BusRd/- Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

19 MESI Example (Cache-to-Cache Transfer+BusUpgr)
Proc Action State P1 State P2 State P3 Bus Action Data From R1 E - BusRd Mem W1 M Own cache R3 S BusRd/Flush P1 cache W3 I BusUpgr P3 cache R2 BusRd/Flush1 P1/P3 Cache* * Data from memory if no cache2cache transfer, BusRd/- Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

20 Lower-Level Protocol Choices
Who supplies data on miss when not in M state: memory or cache? Original, lllinois MESI: cache assume cache faster than memory (Cache-to-cache transfer) Not necessarily true Adds complexity How does memory know it should supply data? (must wait for caches) Selection algorithm if multiple caches have valid data Valuable for distributed memory May be cheaper to obtain from nearby cache than distant memory Especially when constructed out of SMP nodes (Stanford DASH) Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

21 Lecture 9 Outline MESI protocol Dragon update-based protocol
Impact of protocol optimizations Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

22 Dragon Writeback Update Protocol
Four states Exclusive-clean (E): I and memory have it Shared clean (Sc): I, others, and maybe memory, but I’m not owner Shared modified (Sm): I and others but not memory, and I’m the owner Sm and Sc can coexist in different caches, with at most one Sm Modified or dirty (M): I and, no one else On replacement: Sc can silently drop, Sm has to flush No invalid state If in cache, cannot be invalid If not present in cache, can view as being in not-present or invalid state New processor events: PrRdMiss, PrWrMiss Introduced to specify actions when block not present in cache New bus transaction: BusUpd Broadcasts single word written on bus; updates other relevant caches Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

23 Dragon State Transition Diagram
Sc Sm M PrWr/— PrRd/— PrRdMiss/ BusRd(S) PrWrMiss/ (BusRd(S); BusUpd) PrWr/ BusUpd(S) PrWr/BusUpd(S) BusRd/— BusRd/Flush BusUpd/Update Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

24 Dragon: Processor-Initiated Transactions
PrRd/- PrRd/- E Sc PrRdMiss/BusRd(~S) PrWr/BusUpd(S) PrRdMiss/BusRd(S) PrWr/BusUpd(~S) PrWr/- PrWrMiss/ (BusRd(S);BusUpd) Sm PrRdMiss/BusRd(~S) M PrWr/BusUpd(~S) PrRd/- PrWr/BusUpd(S) PrRd/- PrWr/- Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

25 Dragon: Bus-Initiated Transactions
BusRd/- BusUpd/Update BusRd/- E Sc BusUpd/Update BusRd/Flush Sm M BusRd/Flush Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

26 Dragon Visualization P1 P2 P3 Cache Snooper Snooper Snooper Bus
Mem Ctrl Main Memory X=1 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

27 Dragon Visualization P1 P2 P3 rd &X BusRd Snooper Snooper Snooper
Mem Ctrl X=1 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

28 Dragon Visualization P1 P2 P3 X=1 E Snooper Snooper Snooper Mem Ctrl
Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

29 Dragon Visualization P1 P2 P3 wr &X (X=2) X=1 M 2 E Snooper Snooper
Mem Ctrl One less bus request due to Exclusive state, esp. for serial programs X=1 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

30 Dragon Visualization P1 P2 P3 rd &X X=2 M BusRd Snooper Snooper
Mem Ctrl X=1 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

31 Dragon Visualization P1 P2 P3 X=2 Sm M X=2 Sc Snooper Snooper Snooper
Mem Ctrl X=1 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

32 Dragon Visualization P1 P2 P3 wr &X X=3 X=2 Sc 3 Sm X=2 Sm 3 Sc BusUpd
Snooper Snooper Snooper Mem Ctrl Note: BusUpdate instead of BusUpgr (no inval is performed) X=1 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

33 Dragon Visualization P1 P2 P3 rd &X X=3 Sc X=3 Sm Snooper Snooper
Mem Ctrl This is a miss in the MESI and MSI protocols X=1 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

34 Dragon Visualization P1 P2 P3 rd &X X=3 Sc X=3 Sm Snooper Snooper
Mem Ctrl X=1 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

35 Dragon Visualization P1 P2 P3 rd &X X=3 Sc X=3 Sc X=3 Sm BusRd Snooper
Mem Ctrl Note: only one with Sm is responsible for cache- to-cache transfer X=1 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

36 Dragon Visualization P1 P2 P3 X=3 Sc X=3 Sc X=3 Sm Snooper Snooper
Mem Ctrl P1 replaces X X=1 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

37 Dragon Visualization P1 P2 P3 X=3 Sc X=3 Sc X=3 Sm Snooper Snooper
Mem Ctrl P3 replaces X Owner responsible for writing back to mem vs. MSI or MESI where write-back only when the line is in M state X=1 3 Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

38 Dragon Example Proc Action State P1 State P2 State P3 Bus Action
Data From R1 E - BusRd Mem W1 M Own cache R3 Sm Sc BusRd/Flush P1 cache W3 BusUpd/Upd R2 P3 cache Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

39 Lower-Level Protocol Choices
Can shared-modified state be eliminated? If update memory as well on BusUpd transactions (DEC Firefly) Dragon protocol doesn’t (assumes DRAM memory slow to update) Should replacement of an Sc block be broadcast? Would allow last copy to go to Exclusive state and not generate updates Replacement bus transaction is not in critical path, later update may be Shouldn’t update local copy on write hit before controller gets bus Can mess up serialization Coherence, consistency considerations much like write-through case In general, many subtle race conditions in protocols But first, let’s illustrate quantitative assessment at logical level Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

40 Lecture 9 Outline MESI protocol Dragon update-based protocol
Impact of protocol optimizations Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

41 Assessing Protocol Tradeoffs
Methodology: Use simulator; choose parameters per earlier methodology (default 1MB, 4-way cache, 64-byte block, 16 processors; 64K cache for some) Focus on frequencies, not end performance for now transcends architectural details, but not what we’re really after Use idealized memory performance model to avoid changes of reference interleaving across processors with machine parameters Cheap simulation: no need to model contention Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

42 Impact of Protocol Optimizations
MESI vs. MSI (w/ BusUpgr) vs. MSI (w/ BusRdX) T r a f i c ( M B / s ) x d l t I E 2 4 6 8 1 D b u A e 3 5 7 Barnes/III Barnes/3St Barnes/3St-RdEx LU/III Radix/3St-RdEx LU/3St LU/3St-RdEx Radix/3St Ocean/III Ocean/3S Radiosity/3St-RdEx Ocean/3St-RdEx Radix/III Radiosity/III Radiosity/3St Raytrace/III Raytrace/3St Raytrace/3St-RdEx Appl-Code/III Appl-Code/3St Appl-Code/3St-RdEx Appl-Data/III Appl-Data/3St Appl-Data/3St-RdEx OS-Code/III OS-Code/3St OS-Data/3St OS-Data/III OS-Code/3St-RdEx OS-Data/3St-RdEx MSI = MESI Upgrades instead of read-exclusive helps Same story when working sets don’t fit for Ocean, Radix, Raytrace Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

43 Impact of Cache-Block Size
Multiprocessors add new kind of miss to cold, capacity, conflict Coherence misses: Due to invalidations True sharing: Write to same word False sharing: Write to different words Reducing misses architecturally in invalidation protocol Capacity: enlarge cache; increase block size (if spatial locality) Conflict: increase associativity Cold and coherence: only block size Increasing block size has advantages and disadvantages Can reduce misses if spatial locality is good Can hurt too increase misses due to false sharing if spatial locality not good increase misses due to conflicts in fixed-size cache increase traffic due to fetching unnecessary data and due to false sharing can increase miss penalty and perhaps hit cost Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

44 Impact of Block Size on Miss Rate
For default problem size: vary block/line size from Bytes C o l d a p c i t y T r u e s h n g F U 8 . 1 2 3 4 5 6 Miss rate (%) Barnes/8 Barnes/16 Barnes/32 Barnes/64 Barnes/128 Barnes/256 Lu/8 Lu/16 Lu/32 Lu/64 Lu/128 Lu/256 Radiosity/8 Radiosity/16 Radiosity/32 Radiosity/64 Radiosity/128 Radiosity/256 Ocean/8 Ocean/16 Ocean/32 Ocean/64 Ocean/128 Ocean/256 Radix/8 Radix/16 Radix/32 Radix/64 Radix/128 Radix/256 Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256 - Capacity misses decrease because of the ambiguous categorization of cache misses used by the authors - True sharing misses go down because there are fewer lines that the cache can store, so there is less likelyhood that true-shared line is in the cache Decreases with larger lines: cold, capacity (due to spatial locality), true sharing (due to spatial locality) Increases with larger lines: false sharing Working set doesn’t fit: impact of capacity misses large: (Ocean, Radix) Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin

45 Impact of Block Size on Traffic
Traffic (bytes/inst) affects performance indirectly through contention 2 4 8 . 6 1 D a t b u s A d r e Barnes/16 Traffic (bytes/instructions) Barnes/8 Barnes/32 Barnes/64 Barnes/128 Barnes/256 Radiosity/8 Radiosity/16 Radiosity/32 Radiosity/64 Radiosity/128 Radiosity/256 Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256 Results different than for miss rate: traffic almost always increases When working sets fits, overall traffic still small, except for Radix Fixed overhead is significant component So total traffic often minimized at byte block, not smaller Working set doesn’t fit: even 128-byte good for Ocean due to capacity Address bus traffic behaves in opposite way as the data bus traffic Lecture 9 ECE/CSC Summer E. F. Gehringer, based on slides by Yan Solihin


Download ppt "Lecture 9 Outline MESI protocol Dragon update-based protocol"

Similar presentations


Ads by Google