Performance of the Shasta distributed shared memory protocol Daniel J. Scales Kourosh Gharachorloo 創造情報学専攻 M グェン トアン ドゥク
Agenda Basic design of Shasta Protocol optimizations Performance result
Basic design of Shasta
Cache coherent protocol 3 states: invalid, shared, exclusive Shared miss: (invalid) read, (invalid/shared) write Block size can be different for different ranges of shared address space Address space is divided into fixed-size ranges, call lines –Block = n * lines Maintain state information for each line in state table
Basic shared miss check
Shared miss check optimizations Invalid flag technique: –Set invalid line’s long word (4 byte) value = Special flag value –Compare word value with flag value -> miss or not Batching miss checks –Batch together checks for multiple loads / stores
Protocol optimizations Minimizing protocol messages –Owner node guarantees to service request forwarded to it –No need retry request due to transient states or deadlock: save request into queue Multiple coherence granularity –Block size based on data structure Small object: single unit Large object: divide into lines –Associate different granularities to different virtual pages
Protocol optimizations (2) Exploiting relaxed memory model –Non-blocking load / store –Non-blocking release –Eager exclusive replies Read exclusive: sending data back immediately to the requested processor, delay request from other processors Batching Detecting migratory shared patterns –Migratory sharing: data is read and modified by different processors -> migration from one processor to another
Performance
Effect of Release Consistency Non blocking release
Effect of upgrade & sharing writeback Support for upgrade messages is important for some application (VolRend) Sharing writeback messages hurt performance
Effect of migratory optimization Disappointing!
Summary of results Support for variable granularity communication is the most important optimization in Shasta Support for upgrade messages and a dirty- sharing protocol are also important Exploiting RC provides small performance gains because processors are busing handling protocol messages while they are waiting for their own request to complete Migratory optimization is not useful in Shasta
Conclusion Shasta supports fine-grain access to shared memory by inserting code before load / store instructions to check state of the shared data Shasta supports shared memory entirely in software -> flexibility in granularity & optimizations Variable granularity is the most important optimization