Behavior of Synchronization Methods in Commonly Used Languages and Systems Yiannis Nikolakopoulos Joint work with: D. Cederman, B.

Slides:



Advertisements
Similar presentations
Critical Sections: Re-emerging Concerns for DBMS Ryan JohnsonIppokratis Pandis Anastasia Ailamaki Carnegie Mellon University École Polytechnique Féderale.
Advertisements

Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 1 Evaluation of Message Passing Synchronization Algorithms in Embedded Systems.
Wait-Free Reference Counting and Memory Management Håkan Sundell, Ph.D.
Performance Evaluation of Lock-free Data Structures on GPUs Performance Evaluation of Lock-free Data Structures on GPUs
Enterprise Job Scheduling for Clustered Environments Stratos Paulakis, Vassileios Tsetsos, and Stathes Hadjiefthymiades P ervasive C omputing R esearch.
Performance and power consumption evaluation of concurrent queue implementations 1 Performance and power consumption evaluation of concurrent queue implementations.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
ParMarkSplit: A Parallel Mark- Split Garbage Collector Based on a Lock-Free Skip-List Nhan Nguyen Philippas Tsigas Håkan Sundell Distributed Computing.
Progress Guarantee for Parallel Programs via Bounded Lock-Freedom Erez Petrank – Technion Madanlal Musuvathi- Microsoft Bjarne Steensgaard - Microsoft.
TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.
Lock-free Cuckoo Hashing Nhan Nguyen & Philippas Tsigas ICDCS 2014 Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden.
Lock vs. Lock-Free memory Fahad Alduraibi, Aws Ahmad, and Eman Elrifaei.
Review for Test 2 i206 Fall 2010 John Chuang. 2 Topics  Operating System and Memory Hierarchy  Algorithm analysis and Big-O Notation  Data structures.
Programming Languages Structure
CS533 Concepts of Operating Systems Class 2 Thread vs Event-Based Programming.
Concurrency and Software Transactional Memories Satnam Singh, Microsoft Faculty Summit 2005.
A Dynamic Elimination-Combining Stack Algorithm Gal Bar-Nissan, Danny Hendler and Adi Suissa Department of Computer Science, BGU, January 2011 Presnted.
Copyright © 2010, Oracle and/or its affiliates. All rights reserved. Who’s Afraid of a Big Bad Lock Nir Shavit Sun Labs at Oracle Joint work with Danny.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.
Advanced Operating Systems CIS 720 Lecture 1. Instructor Dr. Gurdip Singh – 234 Nichols Hall –
Simple Wait-Free Snapshots for Real-Time Systems with Sporadic Tasks Håkan Sundell Philippas Tsigas.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Software & the Concurrency Revolution by Sutter & Larus ACM Queue Magazine, Sept For CMPS Halverson 1.
Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing.
QoS-enabled Tree-based Distributed Mutexes James Edmondson Brian Sulcer.
BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
Håkan Sundell, Chalmers University of Technology 1 NOBLE: A Non-Blocking Inter-Process Communication Library Håkan Sundell Philippas.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Motions for Permanent Undergraduate Course Numbers Brian L. Evans On Behalf of the ECE Curriculum Committee September 21, 2015.
A Consistency Framework for Iteration Operations in Concurrent Data Structures Yiannis Nikolakopoulos A. Gidenstam M. Papatriantafilou P. Tsigas Distributed.
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
Challenges in Non-Blocking Synchronization Håkan Sundell, Ph.D. Guest seminar at Department of Computer Science, University of Tromsö, Norway, 8 Dec 2005.
OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.
A summary by Nick Rayner for PSU CS533, Spring 2006
P-GAS: Parallelizing a Many-Core Processor Simulator Using PDES Huiwei Lv, Yuan Cheng, Lu Bai, Mingyu Chen, Dongrui Fan, Ninghui Sun Institute of Computing.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.
GeantV scheduler, concurrency Andrei Gheata GeantV FNAL meeting Fermilab, October 20, 2014.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Introduction to Objects and Encapsulation Computer Science 4 Mr. Gerb Reference: Objective: Understand Encapsulation and abstract data types.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator Paper Presentation Yifeng (Felix) Zeng University of Missouri.
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
Scalable Computing model : Lock free protocol By Peeyush Agrawal 2010MCS3469 Guided By Dr. Kolin Paul.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Accelerators to Applications
Java 9: The Quest for Very Large Heaps
Concurrent Data Structures for Near-Memory Computing
The University of Adelaide, School of Computer Science
Challenges in Concurrent Computing
A Lock-Free Algorithm for Concurrent Bags
Anders Gidenstam Håkan Sundell Philippas Tsigas
Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos
Lottery Meets Wireless
Yiannis Nikolakopoulos
Erlang Multicore support
Fine-grained vs Coarse-grained multithreading
CS510 - Portland State University
CSC3050 – Computer Architecture
DMP: Deterministic Shared Memory Multiprocessing
Presentation transcript:

Behavior of Synchronization Methods in Commonly Used Languages and Systems Yiannis Nikolakopoulos Joint work with: D. Cederman, B. Chatterjee, N. Nguyen, M. Papatriantafilou, P. Tsigas Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden

Developing a multithreaded application… Yiannis Nikolakopoulos 2 The boss wants.NET The client wants speed… (C++?) Java is nice Multicores everywhere

Yiannis Nikolakopoulos 3 The worker threads need to access data Concurrent Data Structures Then we need Synchronization. Developing a multithreaded application…

Implementation Coarse Grain Locking Fine Grain Locking Test And SetArray LocksAnd more! Yiannis Nikolakopoulos 4 Implementing Concurrent Data Structures Performance Bottleneck

Implementation Coarse Grain Locking Fine Grain Locking Test And SetArray LocksAnd more!Lock Free Yiannis Nikolakopoulos 5 Implementing Concurrent Data Structures Hardware platform Which is the fastest/most scalable?

Implementing concurrent data structures Yiannis Nikolakopoulos 6

Problem Statement How the interplay of the above parameters and the different synchronization methods, affect the performance and the behavior of concurrent data structures. Yiannis Nikolakopoulos 7

Outline Introduction Experiment Setup Highlights of Study and Results Conclusion Yiannis Nikolakopoulos 8

Which data structures to study? Represent different levels of contention: Queue - 1 or 2 contention points Hash table - multiple contention points Yiannis Nikolakopoulos 9

How do we choose implementation? Possible criteria: Framework dependencies Programmability “Good” performance Yiannis Nikolakopoulos 10

Interpreting “good” Throughput: The more operations completed per time unit the better. Is this enough? Yiannis Nikolakopoulos 11

Non-fairness Yiannis Nikolakopoulos 12

What to measure? Yiannis Nikolakopoulos 13 Operations by thread i Average operations per thread

Implementation Parameters Yiannis Nikolakopoulos 14 Programming Environments C++JavaC# (.NET,Mono) Synchronization Methods TAS, TTAS, Lock-free, Array lock PMutex, Lock-free memory management Reentrant, synchronized lock construct, Mutex NUMA Architectures Intel Nehalem, 2 x 6 core (24 HW threads) AMD Bulldozer, 4 x 12 core (48 HW threads) Do they influence fairness?

Experiment Parameters Different levels of contention Number of threads Measured time intervals Yiannis Nikolakopoulos 15

Outline Queue – Fairness – Intel vs AMD – Throughput vs Fairness Hash Table – Intel vs AMD – Scalability Introduction Experiment Setup Highlights of Study and Results Conclusion Yiannis Nikolakopoulos 16

Fairness can change along different time intervals 24 Threads, High contention Yiannis Nikolakopoulos 17 Observations: Queue

Significantly different fairness behavior in different architectures 24 Threads, High contention Yiannis Nikolakopoulos 18 Observations: Queue Fairness

Significantly different fairness behavior in different architectures 24 Threads, High contention Lock-free is less affected in this case Yiannis Nikolakopoulos 19 Observations: Queue Fairness

Queue: Throughput vs Fairness Fairness 0.6 s, IntelThroughput Yiannis Nikolakopoulos ,2 0,4 0,6 0, Fairness Threads C++ TTASLock-freePMutex Operations per ms (thousands) Threads C++

Observations: Hash table Operations are distributed in different buckets Things get interesting when #threads > #buckets Tradeoff between throughput and fairness – Different winners and losers – Contention is lowered in the linked list components Yiannis Nikolakopoulos 21

Fairness differences in Hash table across architectures 24 Threads, High contention Yiannis Nikolakopoulos 22 Observations: Hash table

Fairness differences in Hash table across architectures 24 Threads, High contention Lock-free is again not affected Yiannis Nikolakopoulos 23 Observations: Hash table

In C++, custom memory management and lock-free implementations excel in scalability and performance. Yiannis Nikolakopoulos 24

Conclusion Complex synchronization mechanisms (Pmutex, Reentrant lock) pay off in heavily contended hot spots Scalability via more complex, inherently parallel designs and implementations Tradeoff between throughput and fairness – LF Hash table – Reentrant lock vs Array Lock vs LF Queue Fairness can be heavily influenced by HW – Interesting exceptions Yiannis Nikolakopoulos 25 Which is the fastest/most scalable? Is fairness influenced by NUMA?