WormBench A Configurable Application for Evaluating Transactional Memory Systems MEDEA Workshop Ferad Zyulkyarov 1, 2, Sanja Cvijic 3, Osman Unsal 1, Adrian Cristal 1, Eduard Ayguade 1, 2, Tim Harris 4, Mateo Valero 1, 2 1 Barcelona Supercomputing Center, 2 Universitat Politecnica de Catalunya, 3 Belgrade University, 4 Microsoft Research Cambridge UK
Outline Transactional Memory Idea Motivation WormBench Features WormBench main components WormBench input – run configuration Analysis Modeling STAMP’s genome Conclusion
Transactional Memory atomic { }
Idea Inspired by the Snake game Worms are active objects Worms move in a BenchWorld On every move Worms do computation
Motivation - General We don’t know how exactly to write TM applications 1:1 Converting applications from locks is not correct approach –For example, is it the same to convert lock based application into message passing synchronization 1:1?
Motivation - Existing TM Applications (1/2) STAMP [IISWC’2008] –specific to TL2 [ISCA’2007] –does not have lock based implementation –tm_write() and tm_read() carefully used – thus assuming perfect compiler STMBench7 [EuroSys’2007] –Suitable for STM –Too big data structures ( bytes); too long transactions (10 tx/s)
Motivation - Existing TM Applications (2/2) SPLASH-2 [ISCA’1995] –Embarrassingly parallel –Fine grain locking –Not suitable for the intended TM usage pattern (coarse grain locking) Haskell STM Benchmark [CF’2007] –Implemented in declarative language –Depends on language and type system enforced constraints (TVar, monads)
WormBench’s Goal Unify the features of existing TM applications A tool for instrumenting multi-threaded applications Set of run configurations to serve as a baseline to evaluate TM systems among each other and locks Specific run configuration that stresses a particular design or implementation aspect of a TM system such as the sizes of internally used buffers.
WormBench Features (1/2) Implemented in imperative language C# –Compiling with Bartok Follows the object oriented programming concepts Critical sections are marked with atomic –Can be used to test the compiler infrastructure Represents typical parallel application with shared data Highly configurable through run configurations
WormBench Features (2/2) Suitable for HTM, STM and Hybrid TM variants No assumptions about TM system design and implementation Lock based and transactional implementation for comparison purposes Sanity check verification for the underlying TM system
Main Objects in MainBench BenchWorld –BenchWorldNode Worm –Body –Head Message
Example Worm –Body length 8 –Head Size 4 Operations –Sum – ahead –Average – right –Min - ahead
WormBench Input – Run configuration Size of the BenchWorld; Number of worms (number of threads); Body length of each worm; Head size of each worm; The number and type of worm operations that each worm has to perform while moving
Instantiates Common Sync Scenarios (1/2) Object access serializability –Guarding a shared variable with locks Two phase locking and its derivatives –Locking protocol which attempts non-blocking fine grain locking avoiding dead-lock Multiple granularity locking –Fine-grain locking technique used to lock a region in a collection/hierarchical data structure
Instantiates Common Sycnh Scenarios (2/2) Dining Philosophers –Deadlock scenario Barrier synchronization –Worms wait until all the group (or all worms) reach certain point in execution
Retry or Conditional Atomic Retry Mostly neglected utilization of retry or conditional atomic.
Currently Available Worm Operations (1/2) Read-only –Sum –Average –Min –Max –Median I/O –Checkpoint –Undo
Currently Available Worm Operations (1/2) Read dominated –Replace min with average –Replace max with average –Replace median with average –Replace min and max Write dominated –Sort –Transpose Leave message – for complex synchronization scenarios –Goto node message
Worm Operations – Execution Distribution OpB[1.1]H[1.1]B[4.4]H[4.4]B[8.8]H[8.8]B[1.8]H[1.8] Sum Avg Median Min Max Rep Max with Avg Rep Min with Avg Rep Med with Avg Rep Max and Min Rep Med with Min Rep Med and Max Sort Transpose Checkpoint Undo Total
Worm Operations – Execution Distribution OpB[1.1]H[1.1]B[4.4]H[4.4]B[8.8]H[8.8]B[1.8]H[1.8] Sum Avg Median Min Max Rep Max with Avg Rep Min with Avg Rep Med with Avg Rep Max and Min Rep Med with Mix Rep Med and Max Sort Transpose Checkpoint Undo Total
Worm Operations – TM Characteristics Op 1248 RWRWRWRW Worm Head Size is fixed to 1 and body length is 1, 2, 4, 8
Worm Operations – TM Characteristics Op 1248 RWRWRWRW Worm Head Size is fixed to 1 and body length is 1, 2, 4, 8
Worm Operations – TM Characteristics Op 1248 RWRWRWRW Body Length is fixed to 1 and head size is 1, 2, 4, 8
Worm Operations – TM Characteristics Op 1248 RWRWRWRW Body Length is fixed to 1 and head size is 1, 2, 4, 8
Analyzing Sample Run Configurations Lock vs Transactions Change in BenchWorld size Change in worms’ body length and size Initialization of worms for smaller BenchWorld
Lock vs Transactions
Throughput ~ Worms ~ BenchWorld Relationship between throughput, worms’ size and BenchWorld
Initializing Worms for Smaller BenchWorld How the conflict rate is affected when worms are initialized for smaller BenchWorld. Averaged Results Worms initialized for 128x128 and run in 128x, 256x, 512x and 1024xsize BenchWorld
Modeling Genome App. From STAMP To obtain the results shown on Table IV we used the following run configuration: –Worms body length = 1 –Worms head size = 4 –BenchWorld of size 52x52 –Worm Operations: Randomly generated stream of worm operations, where the ration between the worm operations was Operations (1:2:3:4:5:6:7:8:9:10:11:12:13:14:15) = Ration(1:1:1:0:0:2:1:1:1:1:1:1:2:0:0) T# Commit RateRead per TXWrite per TXSpeedup Gen.WBGen.WBGen.WBGen.WB
Future Work Toolset that automatically generates a run configuration representing a user defined transactional and runtime behavior, e.g.: –Commit rate 80% –Reads per TX = 6 –Writes per TX 2 –Runtime = 100 moves/ms Implement BenchWorld as –Linked list –Sparse matrix
Future Work Understand how the Messaging works in BenchWorld Prepare a baseline set of run configurations to benchmark TM systems (HTM, STM and hybrid TMs) Fine grain version using two-phase locking
Conclusion WormBench highly configurable workload for TM TM design and implementation independent Critical sections defined by language level atomic blocks Coarse lock based version Sanity check for the overall TM system But still small that does not exercise language extensions for TM and their semantics
Край