Lock-free Cache-friendly Software Queue for Decoupled Software Pipelining Student: Chen Wen-Ren Advisor: Wuu Yang 學生 : 陳韋任指導教授 : 楊武 Abstract Multicore.

Slides:

Advertisements

Similar presentations

Operating Systems Part III: Process Management (Process Synchronization)

Advertisements

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

© Krste Asanovic, 2014CS252, Spring 2014, Lecture 12 CS252 Graduate Computer Architecture Spring 2014 Lecture 12: Synchronization and Memory Models Krste.

D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.

Lecture 5 Concurrency and Process/Thread Synchronization Mutual Exclusion Dekker's Algorithm Lamport's Bakery Algorithm.

Chapter 6 Process Synchronization Bernard Chen Spring 2007.

Chapter 6: Process Synchronization

Background Concurrent access to shared data can lead to inconsistencies Maintaining data consistency among cooperating processes is critical What is wrong.

5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.

Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.

Mutual Exclusion.

CH7 discussion-review Mahmoud Alhabbash. Q1 What is a Race Condition? How could we prevent that? – Race condition is the situation where several processes.

CS492B Analysis of Concurrent Programs Consistency Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.

CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.

By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.

1 Sharing Objects – Ch. 3 Visibility What is the source of the issue? Volatile Dekker’s algorithm Publication and Escape Thread Confinement Immutability.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

1 Thread Synchronization: Too Much Milk. 2 Implementing Critical Sections in Software Hard The following example will demonstrate the difficulty of providing.

Parallel Programming Philippas Tsigas Chalmers University of Technology Computer Science and Engineering Department © Philippas Tsigas.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

Concurrency, Mutual Exclusion and Synchronization.

The HDF Group Multi-threading in HDF5: Paths Forward Current implementation - Future directions May 30-31, 2012HDF5 Workshop at PSI 1.

6.3 Peterson’s Solution The two processes share two variables: Int turn; Boolean flag[2] The variable turn indicates whose turn it is to enter the critical.

Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Chapter 7 -1 CHAPTER 7 PROCESS SYNCHRONIZATION CGS Operating System Concepts UCF, Spring 2004.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

CY2003 Computer Systems Lecture 04 Interprocess Communication.

Chapter 6 – Process Synchronisation (Pgs 225 – 267)

Multiprocessor Cache Consistency (or, what does volatile mean?) Andrew Whitaker CSE451.

Fundamentals of Parallel Computer Architecture - Chapter 71 Chapter 7 Introduction to Shared Memory Multiprocessors Yan Solihin Copyright.

CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.

CS533 Concepts of Operating Systems Jonathan Walpole.

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Synchronization Emery Berger and Mark Corner University.

Martin Kruliš by Martin Kruliš (v1.1)1.

Barriers and Condition Variables

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Gauss Students’ Views on Multicore Processors Group members: Yu Yang (presenter), Xiaofang Chen, Subodh Sharma, Sarvani Vakkalanka, Anh Vo, Michael DeLisi,

Homework-6 Questions : 2,10,15,22.

1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.

740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Software Coherence Management on Non-Coherent-Cache Multicores

Background on the need for Synchronization

Memory Consistency Models

Memory Consistency Models

Task Scheduling for Multicore CPUs and NUMA Systems

Threads and Memory Models Hal Perkins Autumn 2011

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Background and Motivation

EE 4xx: Computer Architecture and Performance Programming

Programming with Shared Memory Specifying parallelism

CSE 153 Design of Operating Systems Winter 19

Programming with Shared Memory - 3 Recognizing parallelism

Programming with Shared Memory Specifying parallelism

Problems with Locks Andrew Whitaker CSE451.

Atomicity, Mutex, and Locks

Don Porter Portions courtesy Emmett Witchel

Presentation transcript:

Lock-free Cache-friendly Software Queue for Decoupled Software Pipelining Student: Chen Wen-Ren Advisor: Wuu Yang 學生 : 陳韋任指導教授 : 楊武 Abstract Multicore has become a trend on server and client computers in recent years. Parallelization is one way to fully utilize the computing power provided by multicore architectures. Most applications of interest have complex data and control dependency, which make traditional parallelization techniques, such as DOALL and DOACROSS, inapplicable. Decoupled Software Pipelining (DSWP), a new parallelization technique, shows its potential on parallelizing general applications. However, its success relies on fast inter-core synchronization and communication. On commodity multicore platforms, the performance of current DSWP disappoints us since the overhead involving lock-based, cache dishonored software approach offsets the benefit from DSWP. We present a lock-free, cache-friendly software queue designed for DSWP. A lock-free, cache-friendly solution need take two different aspects of memory system, memory coherence and memory consistency, into consideration. We show how inattention to these two aspects leads to incorrect or inefficient solutions. We also present our approach to providing a correct and efficient solution with detailed explanation. Due the nondeterministic nature of parallel programs, traditional testing techniques cannot be used to fully verify the correctness of the implementation. We also discuss the correctness of our implementation both in informal formal ways. Dekker’s and Peterson’s Algorithm could be broken on multicore system As shown in Figure 1, mutual exclusion is guaranteed only if variables flag1 and flag2 are both zero at the end of execution. Otherwise, mutual exclusion will be violated. In order to improve the performance of sequential programs, compilers, CPU, and cache put much emphasis on optimizing memory reads and writes. They may reorder, insert, or remove memory reads and writes in order to avoid or delay memory accesses. Figure 2 gives a possible execution of Dekker’s and Peterson’s algorithm after reordering memory operations by compilers, CPU, or cache. As shown in Figure 2, variables flag1 and flag2 are zero which means P1 and P2 will enter the critical section at the same time. Our Approach - Class QueueBuffer Data Members We declare shared, mutable variables m_front and m_back as ordered atomic variables by using template class atomic provided by Intel Thread Building Blocks library. An atomic class supports atomic read, write, fetch-and-add, fetch-and- store, and compare-swap operations. For reads and writes, their default memory fences are acquire and release, respectively. Since false sharing hurts performance, we also take false sharing avoidance into consideration when layout class QueueBuffer data members. According to their locality, we group class QueueBuffer data members into different chunks that are multiples of the cache line size and aligned on cache line boundaries by using alignment and padding. Our Approach - Class QueueBuffer Member Functions Since atomic support atomic read and atomic write, it is safe for member functions push and front to access m_front and m_back concurrently without lock. Besides, atomic associates acquire and release memory fence with read and write operation respectively. Those memory fences ensure that member function push won’t update m_back until data is inserted into m_buf. Finally, we use local variables (e.g., local_back) as much as possible since they can be cached. Accessing ordered atomic variables, however, might involve expensive memory accesses.