Elad Gidron, Idit Keidar, Dmitri Perelman, Yonathan Perez

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Wait-Free Linked-Lists Shahar Timnat, Anastasia Braginsky, Alex Kogan, Erez Petrank Technion, Israel Presented by Shahar Timnat 469-+
CAFÉ: Scalable Task Pool with Adjustable Fairness and Contention Dmitry Basin, Rui Fan, Idit Keidar, Ofer Kiselov, Dmitri Perelman Technion, Israel Institute.
Wait-Free Queues with Multiple Enqueuers and Dequeuers
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
“FENDER” AUTOMATIC MEMORY FENCE INFERENCE Presented by Michael Kuperstein, Technion Joint work with Martin Vechev and Eran Yahav, IBM Research 1.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Wait-Free Reference Counting and Memory Management Håkan Sundell, Ph.D.
Scalable Synchronous Queues By William N. Scherer III, Doug Lea, and Michael L. Scott Presented by Ran Isenberg.
Locality-Conscious Lock-Free Linked Lists Anastasia Braginsky & Erez Petrank 1.
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
Lock-free Cache-friendly Software Queue for Decoupled Software Pipelining Student: Chen Wen-Ren Advisor: Wuu Yang 學生 : 陳韋任 指導教授 : 楊武 Abstract Multicore.
TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.
Chapter 17 Parallel Processing.
Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Copyright © 2010, Oracle and/or its affiliates. All rights reserved. Who’s Afraid of a Big Bad Lock Nir Shavit Sun Labs at Oracle Joint work with Danny.
Synchronization and Scheduling in Multiprocessor Operating Systems
OSE 2013 – synchronization (lec3) 1 Operating Systems Engineering Locking & Synchronization [chapter #4] By Dan Tsafrir,
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
CPU Scheduling - Multicore. Reading Silberschatz et al: Chapter 5.5.
PMIT-6102 Advanced Database Systems
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Embedded System Lab. 김해천 Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist,
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Concurrency Patterns Emery Berger and Mark Corner University.
1 Shared Files Sharing files among team members A shared file appearing simultaneously in different directories Share file by link File system becomes.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
Maged M.Michael Michael L.Scott Department of Computer Science Univeristy of Rochester Presented by: Jun Miao.
Supporting Multi-Processors Bernard Wong February 17, 2003.
Embedded System Lab 김해천 Thread and Memory Placement on NUMA Systems: Asymmetry Matters.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects Maged M. Michael Presented by Abdulai Sei.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Martin Kruliš by Martin Kruliš (v1.1)1.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Load Rebalancing for Distributed File Systems in Clouds.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Optimizing Distributed Actor Systems for Dynamic Interactive Services
lecture 5: CPU Scheduling
Thread & Processor Scheduling
EMERALDS Landon Cox March 22, 2017.
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Lecture: Large Caches, Virtual Memory
Distributed Processors
Håkan Sundell Philippas Tsigas
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Task Scheduling for Multicore CPUs and NUMA Systems
A Lock-Free Algorithm for Concurrent Bags
Anders Gidenstam Håkan Sundell Philippas Tsigas
The University of Adelaide, School of Computer Science
Capriccio – A Thread Model
The University of Adelaide, School of Computer Science
Department of Computer Science University of California, Santa Barbara
Parallel and Multiprocessor Architectures – Shared Memory
Yiannis Nikolakopoulos
Chapter 6: CPU Scheduling
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Database System Architectures
Lecture 17 Multiprocessors and Thread-Level Parallelism
Department of Computer Science University of California, Santa Barbara
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Elad Gidron, Idit Keidar, Dmitri Perelman, Yonathan Perez SALSA: Scalable and Low-synchronization NUMA-aware Algorithm for Producer-Consumer Pools Elad Gidron, Idit Keidar, Dmitri Perelman, Yonathan Perez

New Architectures – New Software Development Challenges Increasing number of computing elements Need scalability Memory latency more pronounced Need cache-friendliness Asymmetric memory access in NUMA multi-CPU Need local memory accesses for reduced contention Large systems less predictable Need robustness to unexpected thread stalls, load fluctuations Explain NUMA – important because of bandwidth rather then latency

Producer-Consumer Task Pools Put(Task) Get() Task Pool Producers Consumers Ubiquitous programming pattern for parallel programs

Typical Implementations I/II FIFO queue Producers Consumers Tn .. T2 T1 Inherently not scalable due to contention Consumer 1 Get (T1) Exec(T1) Consumer 2 Get (T2) Exec(T2) FIFO is about task retrieval, not execution

Typical Implementations II/II Multiple queues with work-stealing Consumers Producers Consumers always pay overhead of synchronization with potential stealers Load balancing not trivial

And Now, to Our Approach Single-consumer pools as building block Framework for multiple pools with stealing SALSA – novel single-consumer pool Evaluation

Building Block: Single Consumer Pool Produce() Consume() Producers SCPool Owner Steal() Other consumers Possible implementations: FIFO queues SALSA (coming soon)

System Overview SCPool SCPool SCPool Produce() Consume() Consumers Producers SCPool Steal() SCPool

Our Management Policy NUMA-aware Producers and consumers are paired Producer tries to insert to closest SCPool Consumer tries to steal from closest SCPool CPU1 CPU2 Memory 1 Memory 2 interconnect cons 1 prod 1 prod 3 cons 3 SCPool 1 SCPool 3 cons 2 prod 2 prod 4 cons 4 SCPool 2 SCPool 4 Prod 2 access list: cons2, cons1, cons3, cons4 Cons 4 access list : cons3, cons1, cons2

SCPool Implementation Goals Goal 1: “Fast path” Use synchronization, atomic operations only when stealing Goal 2: Minimize stealing Goal 3: Locality Cache friendliness, low contention Goal 4: Load balancing Robustness to stalls, load fluctuations

SALSA – Scalable and Low Synchronization Algorithm SCPool implementation Synchronization-free when no stealing occurs Low contention Tasks held in page-size chunks Cache-friendly Consumers steal entire chunks of tasks Reduces number of steals Producer-based load-balancing Robust to stalls, load fluctuations

SALSA Overview chunkLists idx=2 idx=-1 idx=4 idx=0 prod0 . . . Prodn-1 steal chunkLists task ┴ owner=c1 1 2 3 4 TAKEN no contention among producers if another process wants to take a task from the chunk, it should first steal it. Chunk owned by one consumer – the only one taking tasks from it Tasks kept in chunks, organized in per-producer chunk lists

SALSA Fast Path (No Stealing) Producer: Put new value Increment local index Consumer: Increment idx Verify ownership Change chunk entry to TAKEN No strong atomic operations Cache-friendly Extremely lightweight idx=0 idx=1 owner=c1 1 2 3 4 TAKEN Task TAKEN Task prod local index ┴ Task ┴

Chunk Stealing Steal a chunk of tasks Reduces the number of steal operations Stealing consumer changes the owner field When a consumer sees its chunk has been stolen Takes one task using CAS, Leaves the chunk

Stealing Stealing is complicated See details in paper Data races Liveness issues Fast-path means no memory fences See details in paper

Chunk Pools & Load Balancing Slow Consumer Small chunk pool Automatic load balancing Chunks stolen Producers avoid inserting tasks Chunk Pools & Load Balancing Where do chunks come from? Pool of free chunks per consumer Stolen chunks move to stealers If a consumer's pool is empty, producers go elsewhere Same for slow consumers Fast consumer Large chunk pool Automatic load balancing Chunk stealing Producers can insert tasks

Getting It Right Liveness: we ensure that operations are lock-free Ensure progress whenever operations fail due to steals Safety: linearizability mandates that Get() return Null only if all pools were simultaneously empty Tricky

Evaluation - Compared Algorithms SALSA SALSA+CAS: every consume operation uses CAS (no fast path optimization) ConcBag: Concurrent Bags algorithm [Sundell et al. 2011] Per producer chunk-list, but requires CAS for consume and stealing granularity of a single task. WS-MSQ: work-stealing based on Michael-Scott queues [M. M. Michael M. L. Scott 1996] WS-LIFO: work-stealing based on Michael’s LIFO stacks [M. M. Michael 2004] MSQ not here, because not scalable at all Setup.

N producers/N consumers System Throughput Balanced workload N producers/N consumers linearly scalable x5 faster than state-of-the-art concurrent bags Throughput x20 faster than WS with MSQ

Highly Contended Workloads: 1 Producer, N Consumers Effective load balancing Throughput CAS per task retrieval Other algorithms suffer throughput degradation High contention among stealers

Producer-Based Balancing in Highly Contended Workload 50% faster with balancing Throughput

NUMA effects affinity hardly matters as long as you’re cache effective Throughput memory allocations should be decentralized Performance degradation is small as long as the interconnect / memory controller is not saturated

Conclusions Techniques for improving performance: Great performance Lightweight, synchronization-free fast path NUMA-aware memory management (most data accesses are inside NUMA nodes) Chunk-based stealing amortizes stealing costs Elegant load-balancing using per-consumer chunk pools Great performance Linear scalability x20 faster than other work stealing techniques, x5 faster than state-of-the art non-FIFO pools Highly robust to imbalances and unexpected thread stalls

Backup

Chunk size The optimal chunk size for SALSA is 1000. This is about the size of a page. This may allow to migrate chunks from one node to another when stealing.

Chunk Stealing - Overview Point to the chunk from the special “steal list” Update the ownership via CAS Remove the chunk from the original list CAS the entry at idx + 1 from Task to TAKEN Consumer c1 Consumer c2 prod0 idx=1 prod0 idx=1 prod1 prod1 steal owner=c1 owner=c2 steal owner=c2 TAKEN TAKEN TAKEN TAKEN TAKEN task TAKEN task task ┴ ┴

Chunk Stealing – Case 1 Original consumer (c1) Stealing consumer (c2) idx++ Verify ownership If still the owner, then take task at idx without a CAS Otherwise take task at idx with a CAS and leave chunk Stealing consumer (c2) Change ownership with CAS i ← original idx Take task at i+1 with CAS idx=0 idx=1 This shows why CAS must be used for both consumers i=1 Take task 1 owner=c1 owner=c2 TAKEN Take task 2 Task Task ┴ ┴

Chunk Stealing – Case 2 Original consumer (c1) Stealing consumer (c2) idx++ Verify ownership If still the owner, then take task at idx without a CAS Otherwise take task at idx with a CAS and leave chunk Stealing consumer (c2) Change ownership with CAS i ← original idx Take task at i+1 with CAS idx=0 idx=1 This shows why CAS must be used for both consumers i=0 Take task 1 owner=c1 owner=c2 TAKEN Take task 1 Task Task ┴ ┴

Chunk Lists The lists are managed by the producers, empty nodes are lazily removed When a producer fills a chunk, it takes a new chunk from the chunk pool and adds it to the list List nodes are not stolen, so the idx is updated by the owner only Chunks must be stolen from the owner’s list, to make sure the correct idx field is read prod0 idx=4 idx=2 idx=-1 owner=c1 owner=c1 TAKEN task 1 TAKEN 1 task 2 TAKEN 2 ┴ 3 task 3 ┴ 4 task 4 ┴

NUMA – Non Uniform Memory Access Systems with large number of processors may have high contention on the memory bus. In NUMA systems, every processor has its own memory controller connected to a memory bank. Accessing remote memory is more expensive. CPU1 CPU2 Interconnect Memory Memory CPU3 CPU4 Memory Memory