WormBench A Configurable Application for Evaluating Transactional Memory Systems MEDEA Workshop 26.10.2008 Ferad Zyulkyarov 1, 2, Sanja Cvijic 3, Osman.

Slides:



Advertisements
Similar presentations
TRAMP Workshop Some Challenges Facing Transactional Memory Craig Zilles and Lee Baugh University of Illinois at Urbana-Champaign.
Advertisements

QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal.
Exploiting Distributed Version Concurrency in a Transactional Memory Cluster Kaloian Manassiev, Madalin Mihailescu and Cristiana Amza University of Toronto,
Monitoring Data Structures Using Hardware Transactional Memory Shakeel Butt 1, Vinod Ganapathy 1, Arati Baliga 2 and Mihai Christodorescu 3 1 Rutgers University,
© 2013 IBM Corporation Enabling easy creation of HW reconfiguration scenarios for system level pre-silicon simulation Erez Bilgory Alex Goryachev Ronny.
Timed Automata.
1 Concurrency Specification. 2 Outline 4 Issues in concurrent systems 4 Programming language support for concurrency 4 Concurrency analysis - A specification.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
Transactional Memory Guest Lecture Design of Parallel and High-Performance Computing Georg Ofenbeck TexPoint fonts used in EMF. Read the TexPoint manual.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison
Atomicity in Multi-Threaded Programs Prachi Tiwari University of California, Santa Cruz CMPS 203 Programming Languages, Fall 2004.
1 Johannes Schneider Transactional Memory: How to Perform Load Adaption in a Simple And Distributed Manner Johannes Schneider David Hasenfratz Roger Wattenhofer.
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Swiss Federal Institute of Technology Computer Engineering and Networks Laboratory Influence of different system abstractions on the performance analysis.
Topic ? Course Overview. Guidelines Questions are rated by stars –One Star Question  Easy. Small definition, examples or generic formulas –Two Stars.
Chapter 4.1 Interprocess Communication And Coordination By Shruti Poundarik.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
Dynamic Runtime Testing for Cycle-Accurate Simulators Saša Tomić, Adrián Cristal, Osman Unsal, Mateo Valero Barcelona Supercomputing Center (BSC) Universitat.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Discovering and Understanding Performance Bottlenecks in Transactional Applications Ferad Zyulkyarov 1,2, Srdjan Stipic 1,2, Tim Harris 3, Osman S. Unsal.
Programming Paradigms for Concurrency Part 2: Transactional Memories Vasu Singh
Microsoft Research Faculty Summit Panacea or Pandora’s Box? Software Transactional Memory Panacea or Pandora’s Box? Christos Kozyrakis Assistant.
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.
EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris,
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Lecture 12 Recoverability and failure. 2 Optimistic Techniques Based on assumption that conflict is rare and more efficient to let transactions proceed.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
TECHNIQUES FOR REDUCING CONSISTENCY- RELATED COMMUNICATION IN DISTRIBUTED SHARED-MEMORY SYSTEMS J. B. Carter University of Utah J. K. Bennett and W. Zwaenepoel.
WG5: Applications & Performance Evaluation Pascal Felber
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.
Lecture 8 Page 1 CS 111 Online Other Important Synchronization Primitives Semaphores Mutexes Monitors.
Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
OOPSLA 2001 Choosing Transaction Models for Enterprise Applications Jim Tyhurst, Ph.D. Tyhurst Technology Group LLC.
CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
CAPP: Change-Aware Preemption Prioritization Vilas Jagannath, Qingzhou Luo, Darko Marinov Sep 6 th 2011.
AtomCaml: First-class Atomicity via Rollback Michael F. Ringenburg and Dan Grossman University of Washington International Conference on Functional Programming.
D A C U C P Speculative Alias Analysis for Executable Code Manel Fernández and Roger Espasa Computer Architecture Department Universitat Politècnica de.
Sunpyo Hong, Hyesoon Kim
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Verifying Transactional Programs with Programmer-Defined Conflict Detection Omer Subasi, Serdar Tasiran (Koç University) Tim Harris (Microsoft Research)
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
Tuning Threaded Code with Intel® Parallel Amplifier.
Lecture 20: Consistency Models, TM
(Nested) Open Memory Transactions in Haskell
Computer Engg, IIT(BHU)
MODERN OPERATING SYSTEMS Third Edition ANDREW S
Department of Computer Science University of California, Santa Barbara
Predictive Performance
Changing thread semantics
Transactional Memory Semaphores, monitors, and conditional critical regions all suffer from limitations based on lock semantics Naïve synchronization may.
Lecture 6: Transactions
Shared Memory Programming
Lecture 22: Consistency Models, TM
Concurrency: Mutual Exclusion and Process Synchronization
Why Threads Are A Bad Idea (for most purposes)
Department of Computer Science University of California, Santa Barbara
Lecture: Consistency Models, TM
Why Threads Are A Bad Idea (for most purposes)
Why Threads Are A Bad Idea (for most purposes)
Tim Harris (MSR Cambridge)
Presentation transcript:

WormBench A Configurable Application for Evaluating Transactional Memory Systems MEDEA Workshop Ferad Zyulkyarov 1, 2, Sanja Cvijic 3, Osman Unsal 1, Adrian Cristal 1, Eduard Ayguade 1, 2, Tim Harris 4, Mateo Valero 1, 2 1 Barcelona Supercomputing Center, 2 Universitat Politecnica de Catalunya, 3 Belgrade University, 4 Microsoft Research Cambridge UK

Outline Transactional Memory Idea Motivation WormBench Features WormBench main components WormBench input – run configuration Analysis Modeling STAMP’s genome Conclusion

Transactional Memory atomic { }

Idea Inspired by the Snake game Worms are active objects Worms move in a BenchWorld On every move Worms do computation

Motivation - General We don’t know how exactly to write TM applications 1:1 Converting applications from locks is not correct approach –For example, is it the same to convert lock based application into message passing synchronization 1:1?

Motivation - Existing TM Applications (1/2) STAMP [IISWC’2008] –specific to TL2 [ISCA’2007] –does not have lock based implementation –tm_write() and tm_read() carefully used – thus assuming perfect compiler STMBench7 [EuroSys’2007] –Suitable for STM –Too big data structures ( bytes); too long transactions (10 tx/s)

Motivation - Existing TM Applications (2/2) SPLASH-2 [ISCA’1995] –Embarrassingly parallel –Fine grain locking –Not suitable for the intended TM usage pattern (coarse grain locking) Haskell STM Benchmark [CF’2007] –Implemented in declarative language –Depends on language and type system enforced constraints (TVar, monads)

WormBench’s Goal Unify the features of existing TM applications A tool for instrumenting multi-threaded applications Set of run configurations to serve as a baseline to evaluate TM systems among each other and locks Specific run configuration that stresses a particular design or implementation aspect of a TM system such as the sizes of internally used buffers.

WormBench Features (1/2) Implemented in imperative language C# –Compiling with Bartok Follows the object oriented programming concepts Critical sections are marked with atomic –Can be used to test the compiler infrastructure Represents typical parallel application with shared data Highly configurable through run configurations

WormBench Features (2/2) Suitable for HTM, STM and Hybrid TM variants No assumptions about TM system design and implementation Lock based and transactional implementation for comparison purposes Sanity check verification for the underlying TM system

Main Objects in MainBench BenchWorld –BenchWorldNode Worm –Body –Head Message

Example Worm –Body length 8 –Head Size 4 Operations –Sum – ahead –Average – right –Min - ahead

WormBench Input – Run configuration Size of the BenchWorld; Number of worms (number of threads); Body length of each worm; Head size of each worm; The number and type of worm operations that each worm has to perform while moving

Instantiates Common Sync Scenarios (1/2) Object access serializability –Guarding a shared variable with locks Two phase locking and its derivatives –Locking protocol which attempts non-blocking fine grain locking avoiding dead-lock Multiple granularity locking –Fine-grain locking technique used to lock a region in a collection/hierarchical data structure

Instantiates Common Sycnh Scenarios (2/2) Dining Philosophers –Deadlock scenario Barrier synchronization –Worms wait until all the group (or all worms) reach certain point in execution

Retry or Conditional Atomic Retry Mostly neglected utilization of retry or conditional atomic.

Currently Available Worm Operations (1/2) Read-only –Sum –Average –Min –Max –Median I/O –Checkpoint –Undo

Currently Available Worm Operations (1/2) Read dominated –Replace min with average –Replace max with average –Replace median with average –Replace min and max Write dominated –Sort –Transpose Leave message – for complex synchronization scenarios –Goto node message

Worm Operations – Execution Distribution OpB[1.1]H[1.1]B[4.4]H[4.4]B[8.8]H[8.8]B[1.8]H[1.8] Sum Avg Median Min Max Rep Max with Avg Rep Min with Avg Rep Med with Avg Rep Max and Min Rep Med with Min Rep Med and Max Sort Transpose Checkpoint Undo Total

Worm Operations – Execution Distribution OpB[1.1]H[1.1]B[4.4]H[4.4]B[8.8]H[8.8]B[1.8]H[1.8] Sum Avg Median Min Max Rep Max with Avg Rep Min with Avg Rep Med with Avg Rep Max and Min Rep Med with Mix Rep Med and Max Sort Transpose Checkpoint Undo Total

Worm Operations – TM Characteristics Op 1248 RWRWRWRW Worm Head Size is fixed to 1 and body length is 1, 2, 4, 8

Worm Operations – TM Characteristics Op 1248 RWRWRWRW Worm Head Size is fixed to 1 and body length is 1, 2, 4, 8

Worm Operations – TM Characteristics Op 1248 RWRWRWRW Body Length is fixed to 1 and head size is 1, 2, 4, 8

Worm Operations – TM Characteristics Op 1248 RWRWRWRW Body Length is fixed to 1 and head size is 1, 2, 4, 8

Analyzing Sample Run Configurations Lock vs Transactions Change in BenchWorld size Change in worms’ body length and size Initialization of worms for smaller BenchWorld

Lock vs Transactions

Throughput ~ Worms ~ BenchWorld Relationship between throughput, worms’ size and BenchWorld

Initializing Worms for Smaller BenchWorld How the conflict rate is affected when worms are initialized for smaller BenchWorld. Averaged Results Worms initialized for 128x128 and run in 128x, 256x, 512x and 1024xsize BenchWorld

Modeling Genome App. From STAMP To obtain the results shown on Table IV we used the following run configuration: –Worms body length = 1 –Worms head size = 4 –BenchWorld of size 52x52 –Worm Operations: Randomly generated stream of worm operations, where the ration between the worm operations was Operations (1:2:3:4:5:6:7:8:9:10:11:12:13:14:15) = Ration(1:1:1:0:0:2:1:1:1:1:1:1:2:0:0) T# Commit RateRead per TXWrite per TXSpeedup Gen.WBGen.WBGen.WBGen.WB

Future Work Toolset that automatically generates a run configuration representing a user defined transactional and runtime behavior, e.g.: –Commit rate 80% –Reads per TX = 6 –Writes per TX 2 –Runtime = 100 moves/ms Implement BenchWorld as –Linked list –Sparse matrix

Future Work Understand how the Messaging works in BenchWorld Prepare a baseline set of run configurations to benchmark TM systems (HTM, STM and hybrid TMs) Fine grain version using two-phase locking

Conclusion WormBench highly configurable workload for TM TM design and implementation independent Critical sections defined by language level atomic blocks Coarse lock based version Sanity check for the overall TM system But still small that does not exercise language extensions for TM and their semantics

Край