Presented by: Ofer Kiselov & Omer Kiselov Supervised by: Dmitri Perelman Final Presentation.

Slides:

Advertisements

Similar presentations

Time-based Transactional Memory with Scalable Time Bases Torvald Riegel, Christof Fetzer, Pascal Felber Presented By: Michael Gendelman.

Advertisements

Concurrency Issues Motivation, Problems, Directions Dennis Kafura - CS Operating Systems1.

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Presented by: Dmitri Perelman.  Intro  “Don’t touch my read-set” approach  “Precedence graphs” approach  On avoiding spare aborts  Your questions.

IDIT KEIDAR DMITRI PERELMAN RUI FAN EuroTM 2011 Maintaining Multiple Versions in Software Transactional Memory 1.

Concurrency Important and difficult (Ada slides copied from Ed Schonberg)

Transactional Locking Nir Shavit Tel Aviv University (Joint work with Dave Dice and Ori Shalev)

5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

Concurrency 101 Shared state. Part 1: General Concepts 2.

Pessimistic Software Lock-Elision Nir Shavit (Joint work with Yehuda Afek Alexander Matveev)

PRAM (Parallel Random Access Machine)

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.

Heap Shape Scalability Scalable Garbage Collection on Highly Parallel Platforms Kathy Barabash, Erez Petrank Computer Science Department Technion, Israel.

PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.

An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu

DMITRI PERELMAN ANTON BYSHEVSKY OLEG LITMANOVICH IDIT KEIDAR DISC 2011 SMV: Selective Multi-Versioning STM 1.

Submitted by: Omer & Ofer Kiselov Supevised by: Dmitri Perelman Networked Software Systems Lab Department of Electrical Engineering, Technion.

DMITRI PERELMAN IDIT KEIDAR TRANSACT 2010 SMV: Selective Multi-Versioning STM 1.

Idit Keidar and Dmitri Perelman Technion 1 SPAA 2009.

Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Lock vs. Lock-Free memory Fahad Alduraibi, Aws Ahmad, and Eman Elrifaei.

Selfishness in Transactional Memory Raphael Eidenbenz, Roger Wattenhofer Distributed Computing Group Game Theory meets Multicore Architecture.

CS510 Concurrent Systems Class 13 Software Transactional Memory Should Not be Obstruction-Free.

Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

Concurrency and Software Transactional Memories Satnam Singh, Microsoft Faculty Summit 2005.

29-Jun-15 Java Concurrency. Definitions Parallel processes—two or more Threads are running simultaneously, on different cores (processors), in the same.

Software Transaction Memory for Dynamic-Sized Data Structures presented by: Mark Schall.

Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.

A Transaction-Friendly Dynamic Memory Manager for Embedded Multicore Systems Maurice Herlihy Joint with Thomas Carle, Dimitra Papagiannopoulou Iris Bahar,

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

1 Testing Concurrent Programs Why Test?  Eliminate bugs?  Software Engineering vs Computer Science perspectives What properties are we testing for? 

Software Transactional Memory for Dynamic-Sized Data Structures Maurice Herlihy, Victor Luchangco, Mark Moir, William Scherer Presented by: Gokul Soundararajan.

Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.

1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.

Internet Software Development Controlling Threads Paul J Krause.

WG5: Applications & Performance Evaluation Pascal Felber

Distributed Galois Andrew Lenharth 2/27/2015. Goals An implementation of the operator formulation for distributed memory – Ideally forward-compatible.

This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/ ] under.

Kernel Locking Techniques by Robert Love presented by Scott Price.

On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.

Consider the program fragment below left. Assume that the program containing this fragment executes t1() and t2() on separate threads running on separate.

CH 13 Server and Network Monitoring. Hands-On Microsoft Windows Server Objectives Understand the importance of server monitoring Monitor server.

Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.

Performance Performance is about time and the software system’s ability to meet timing requirements.

MULTIVIE W Slide 1 (of 21) Software Transactional Memory Should Not Be Obstruction Free Paper: Robert Ennals Presenter: Emerson Murphy-Hill.

Concurrency in Java MD. ANISUR RAHMAN. slide 2 Concurrency  Multiprogramming  Single processor runs several programs at the same time  Each program.

ECE 1747: Parallel Programming Short Introduction to Transactions and Transactional Memory (a.k.a. Speculative Synchronization)

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Maurice Herlihy, Victor Luchangco, Mark Moir, William N. Scherer III

Maurice Herlihy and J. Eliot B. Moss, ISCA '93

Healing Data Races On-The-Fly

Faster Data Structures in Transactional Memory using Three Paths

Effective Data-Race Detection for the Kernel

A Lock-Free Algorithm for Concurrent Bags

Anders Gidenstam Håkan Sundell Philippas Tsigas

Maurice Herlihy, Victor Luchangco, Mark Moir, William N. Scherer III

Race Conditions & Synchronization

Software Transactional Memory Should Not be Obstruction-Free

Pointer analysis John Rollinson & Kaiyuan Li

Synchronization and liveness

Presentation transcript:

Presented by: Ofer Kiselov & Omer Kiselov Supervised by: Dmitri Perelman Final Presentation

Overview  Repeating midterm presentation on the following subjects * Software Transactional Memory abstraction * STM implementation example - TL2 overview * Aborts in STM * Unnecessary aborts in STM * Project goal * Implementation * Overview  Online part – implementation  Online logging  Evaluation  Hardware  Deuce  Benchmarks  Results  Conclusion and analysis  Nice to have  Future work

Importance Of Parallel Programming  Frequency barrier – the single core processor’s performance can not improve.  Switch to multi-cores.  Parallel programs allow utilizing multicore processors.  Need for synchronization for accessing shared data

Transactional Memory – why?  Current synchronization – locks  Coarse-grained – limit parallelism  Fine-grained – high programming complexity  Error-prone (deadlocks / livelocks)  Transactional memory solution  Intuitive for a programmer  Provides a “transaction” abstraction for a critical section (operations executed atomically)  Implemented in both software and hardware.

Why Do Aborts Happen? OBJECT1 OBJECT2 T1T2T3 T4 T1 T2 T3 Read from O1 T4 Reads from O2 and writes to O1 To maintain consistency if T4 commits T1 T2 & T3 must abort! Aborted Committed T1 T2 T3 write to O2

Unnecessary Aborts  Aborts are bad  work is lost, resources are wasted, throughput decreases  Some aborts are necessary  continuing the run would violate correctness  And some aborts are not  Analysis whether the algorithm should is too expensive.  “Unnecessary” abort: it could be avoided  keep more versions, better check of transactional dependencies. o1 o2 C A T1 T2 T3

Project Goals  Build a software analysis tool:  measures aborts statistics for a given run  evaluate how many of them were unnecessary  evaluate the damage to performance  “Will it pay off to add designs to stop the unnecessary aborts?”

Project Formation  An offline part for analyzing the run:  reads the log of the run.  gathers statistics.  analyzes unnecessary aborts.  An online part for logging the run:  is inserted to a specific algorithm  run in a benchmark  flushes the run info to an XML log file

Offline Part Parser  Every log line represents transactional action  represented by LogLine abstract class  Parser responsibility:  iterate over the xml  create appropriate LogLine instances  LogLine factories for different operation types  transactional start  read operation  write operation  transactional commit Analyzer  Gives basic statistics regarding the transactions run.  Counts aborts per reason.  Counts reads, writes  Count transactions  Inserting the Path into Run Descriptor ADT Struct.

Transactional Dependencies Run Descriptor is a precedence graph!

RUN DESCRIPTOR T1 T4 Reader OBJECT1 OBJECT2 Reader OBJECT1 Version2 OBJECT2 Version2 Writer WaR In order to create the graph we needed to establish A way to make the basic run into a graph 

ABORTS ANALYZER  Searches for unnecessary aborts in RUN DESCRIPTOR  Speculatively adds the edges of the aborted transaction to the RUN DESCRIPTOR  Using DFS – Finds circles in the precedence graph.  Circles represent necessary aborts  Removes the edges at the end of analysis.  Built as visitor pattern  Flexible for more complex analysis

Online part Our goals:  Run benchmarks to prepare the statistics for offline part.  Be sure that the measurements don’t distort the scheduling picture.

Platform Supporting STM  Deuce STM is an open source java STM environment.  With Deuce STM, if the method: public void doThing() {…} is not Public void doThing() {…} is!! Introducing: Created By: Guy Korland, Nir Shavit, Pascal Felber, Igor Berman Source Code final public class Context implements org.deuce.transaction.Context { private static String objectId(Object reference, long field) { return Long.toString(System.identityHas hCode(reference) + field); } final static AtomicInteger clock = new AtomicInteger(0);

How To Utilize Deuce for Logging  Modified code to call logging utils.  More exceptions type to distinct between different aborts types. Logger Deuce Framework TL2 Algorithm Transactions Code: Start Read Write Commit A Perfectly Scalable Code

Online Part Implementation Version 1 Main Problem : Adding to priority queue damages Adding to priority queue damages parallelism and lowers performance parallelism and lowers performance

Online Part Implementation Version 2 The Back End Collector The threads don’t do any Extra actions to log the run. The Loglines have ended The program has ended 1 2 3

What Do we Check?  Commit rate  Unnecessary aborts (classified by types)  Wasted work

Testbenches  SSCA2 – Short transactions, low contention, high memory utilization  Vacation – High contention, Medium length transaction, Mostly reads.  AVL tree – customizable contention, medium length transactions.  Random choice between add, remove or search for a random integer in the tree.  Ability to change integer range for custom contention.  Created by us.  Created by us.

Hardware  Benchmarks run on Trinity:  8 quad-cores  132 GB RAM  Machine was idle for our use.

Simulation Results – AVL tree Commit Ratio Percentage of Unnecessary Aborts All graphs are a function of the thread amount Amount of Aborts & Unnecessary Aborts Percentage of Wasted Reads

Simulation Results – SSCA2 Commit Ratio Percentage of Unnecessary Aborts All graphs are a function of the thread amount Amount of Aborts & Unnecessary Aborts Percentage of Wasted Reads

Simulation Results – Vacation Commit Ratio Percentage of Unnecessary Aborts All graphs are a function of the thread amount Amount of Aborts & Unnecessary Aborts Percentage of Wasted Reads

Simulation Results – AVL tree All graphs are a function of the thread amount

Simulation Results – SSCA2 All graphs are a function of the thread amount

Simulation Results – Vacation All graphs are a function of the thread amount

Simulation Results – AVL tree All graphs are a function of the thread amount Percentage of Aborts by typesAmount of Aborts by types

Simulation Results – SSCA2 All graphs are a function of the thread amount Percentage of Aborts by typesAmount of Aborts by types

Simulation Results – Vacation All graphs are a function of the thread amount Percentage of Aborts by typesAmount of Aborts by types

Logger impact on performance  Logger access obviously demands more from the Deuce framework.  More memory accesses  More exception types  On every read & write  How much distortion does the logger cause? AVL test with logging – commit ratio

Conclusions  Parallelism increases → aborts rate, unnecessary abort rate and the wasted work rate increase as well.  Parallelism increases more aborts are caused by locked objects.  Parallelism increases → more aborts are caused by locked objects.  To improve STM performance over highly parallel workloads, algorithms may be improved to prevent unnecessary aborts.

Nice To Have  Drawing the precedence graph automatically to a drawing in Microsoft Visio.  Possibility to analyze according to abort types.  GUI.  Expansion of the simulation to more algorithms and test benches – makes the comparison of performance between algorithms possible.

Future Work  Drop in abort rates after 128 threads due to a drop in concurrency – further analysis is required.  Unfit versions cause a lot of aborts.  The new SMV algorithm may solve this problem.

BIBLIOGRAPHY  I. Keidar and D. Perelman. On avoiding spare aborts in transactional memory. In Proceedings of the twenty- ﬁ rst annual symposium on Parallelism in algorithms and architectures, pages 59–68,  I. Keidar and D. Perelman.SMV: Selective Multi-Versioning STM  O. S. D. Dice and N. Shavit. Transactional locking II. In Proceedings of the 20th International Symposium on Distributed Computing, pages 194–208,  M. Herlihy, V. Luchangco, M. Moir, and W. N. Scherer, III. Soft- ware transactional memory for dynamic-sized data structures. In Pro-ceedings of the twenty-second annual symposium on Principles of distributed computing, pages 92–101, 2003.

?QUESTIONS