Nikos Anastopoulos Nectarios Koziris

Slides:

Advertisements

Similar presentations

Computer-System Structures Er.Harsimran Singh

Advertisements

Part IV: Memory Management

CAS3SH3 Midterm Review. The midterm 50 min, Friday, Feb 27 th Materials through CPU scheduling closed book, closed note Types of questions: True & False,

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Chapter 6: Process Synchronization

Mutual Exclusion.

OS2-1 Chapter 2 Computer System Structures. OS2-2 Outlines Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.

3.5 Interprocess Communication

Advanced OS Chapter 3p2 Sections 3.4 / 3.5. Interrupts These enable software to respond to signals from hardware. The set of instructions to be executed.

Chapter 11 Operating Systems

I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p

Introduction to Embedded Systems

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.

1 CSE Department MAITSandeep Tayal Computer-System Structures Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Cosc 4740 Chapter 6, Part 3 Process Synchronization.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.

Embedded System Lab. 김해천 The TURBO Diaries: Application-controlled Frequency Scaling Explained.

Processor Architecture

CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.

1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.

Chapter 2: Computer-System Structures(Hardware) or Architecture or Organization Computer System Operation I/O Structure Storage Structure Storage Hierarchy.

CHAPTER 7 CONCURRENT SOFTWARE Copyright © 2000, Daniel W. Lewis. All Rights Reserved.

Implementing Lock. From the Previous Lecture  The “too much milk” example shows that writing concurrent programs directly with load and store instructions.

CSCI1600: Embedded and Real Time Software Lecture 17: Concurrent Programming Steven Reiss, Fall 2015.

Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.

Embedded Real-Time Systems Processing interrupts Lecturer Department University.

CS162 Section 2. True/False A thread needs to own a semaphore, meaning the thread has called semaphore.P(), before it can call semaphore.V() False: Any.

Tutorial 2: Homework 1 and Project 1

REAL-TIME OPERATING SYSTEMS

Chapter 2: Computer-System Structures(Hardware)

Chapter 2: Computer-System Structures

Multi Threading.

CS703 – Advanced Operating Systems

Mechanism: Limited Direct Execution

Computer Architecture

Computer Structure Multi-Threading

INTEL HYPER THREADING TECHNOLOGY

Section 10: Last section! Final review.

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 2: Computer-System Structures Computer System Operation I/O Structure Storage.

Chapter 9: Virtual-Memory Management

Module 2: Computer-System Structures

Operating Systems Chapter 5: Input/Output Management

Chapter 2: The Linux System Part 3

Thread Implementation Issues

Multithreaded Programming

Implementing Mutual Exclusion

Concurrency: Mutual Exclusion and Process Synchronization

CSCI1600: Embedded and Real Time Software

Presented by Neha Agrawal

Introduction to Operating Systems

Module 2: Computer-System Structures

Implementing Mutual Exclusion

Kernel Synchronization II

CSE 451: Operating Systems Autumn 2003 Lecture 7 Synchronization

CSE 451: Operating Systems Autumn 2005 Lecture 7 Synchronization

CSE 451: Operating Systems Winter 2003 Lecture 7 Synchronization

CSE 153 Design of Operating Systems Winter 19

CS333 Intro to Operating Systems

CSCI1600: Embedded and Real Time Software

Programming with Shared Memory Specifying parallelism

Chapter 2: Computer-System Structures

Chapter 2: Computer-System Structures

Module 2: Computer-System Structures

Module 2: Computer-System Structures

COMP755 Advanced Operating Systems

CSE 542: Operating Systems

Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.

Presentation transcript:

Facilitating Efficient Thread Synchronization of Asymmetric Threads on Hyper-Threaded Processors Nikos Anastopoulos Nectarios Koziris National Technical University of Athens School of Electrical and Computer Engineering Computing Systems Laboratory Presented by James Coleman

CSE 520 Advanced Computer Architecture Agenda Introduction Hyper Threading Resource Sharing Resource Partitioning Thread Synchronization Spin-Locks HT Contention PAUSE Instruction Halt with IPI (Inter Processor Interrupt) Release of Shared resources Power savings Overhead Monitor/mWait Overview Hardware Implementation Example Code CSE 520 Advanced Computer Architecture

CSE 520 Advanced Computer Architecture Agenda (cont.) Frame Work Provide user level access to privileged instructions Minimal overhead Usage model Performance Results Evaluate: Resource Consumption Responsiveness Call Overhead Compare: Normal Spin-Locks Halt/IPI Spin-Locks pThreads Monitor/mWait Conclusion CSE 520 Advanced Computer Architecture

Introduction: Hyper-Threading (HT) Presents software with two logical processors even though only one physical processor is present. Effectively doubling the number of CPUs seen by the OS. The OS is tricked into seeing two processors because a HT processor has two sets of architectural sate resources. It is still technically only a single processor because the compute resources (execution units) are not doubled. CSE 520 Advanced Computer Architecture

Hyper-Threading (cont.) Resource Sharing: The logical CPUs share the execution units. A speed up is only realized when one thread is able to take advantage of execution units left idle by the other thread. For example one thread is integer intensive, the other is floating point intensive, together they will ensure maximum utilization of the CPUs compute resources. Resource Partitioning: The CPU has resources that are statically partitioned between the threads, such as μ-op queues, load-store queues, re-order buffers, ect. These resources can be allocated to a single logical CPU when HT is disabled, but they are split 50/50 when HT is enabled. Intel Atom Processor: Atom represents a completely new extremely low power μArchitecture. One of the most notable changes that results in a drastic power savings is the removal of out-of-order execution. Out-of-order CPUs are able to keep their execution units busier than in-order CPUs. As a result Atom sees a much larger performance boost from HT than the P4, or the Core i7. CSE 520 Advanced Computer Architecture

Thread Synchronization When a task is broken up into parallelizable sub tasks (Threaded), these cooperative threads must synchronize their efforts. To both get the work for their sub task, as well as returning the results for consolidation. Efficient Thread Synchronization becomes a major limiting factor in the scaling of multi-threaded applications. Many mechanisms exist today which handle thread synchronization for the software engineer: Barriers Locks Semaphores Mutexs CSE 520 Advanced Computer Architecture

CSE 520 Advanced Computer Architecture Spin Locks A Spin Lock is an extremely popular method of synchronization due to its simplicity and low latency (responsiveness). A single lock variable is contend for by all interested threads, the first one to access it gains the lock, all subsequent accesses spin waiting for the lock to be freed. Since the waiting threads never yield the CPU to the OS one of the waiters will get the lock and proceeded with execution shortly after the lock holder frees it. Hyper-Threading considerations: If one thread is spinning on a lock it will flood the execution units with instructions hampering the performance of the neighboring thread even though the spinning does no useful work. PAUSE instruction: Intel introduced the PAUSE instruction for use inside of the spinning loops to let the processor know that it is in a spin loop so it can relax execution. CSE 520 Advanced Computer Architecture

CSE 520 Advanced Computer Architecture Halt with IPI Even with use of the PAUSE instruction the CPU still wastes execution unit cycles, and in the case of a hyper-threaded system the statically partitioned resources (μ-op, and load-store queues, and re-order buffers) are wasted. Hyper-Threading and the HALT instruction: When a logical CPU is halted it no longer feeds instructions into the execution unit, and all statically allocated resources are freed and can be fully utilized by the other thread. To un Halt a CPU it needs to receive an IPI. While a CPU is halted it is in an extremely low power state. Resulting in significant power savings over a spinning CPU. (Important today) Halt/IPI can provide the same functionality as a spinlock, but with significant power savings as well as reduced HT contention. There is significant software overhead involved in generating an IPI. As well as hardware overhead for the transition into and out of the halted state. The Halt instruction is a privileged ring 0 instruction. CSE 520 Advanced Computer Architecture

CSE 520 Advanced Computer Architecture Monitor/mWait The Monitor/mWait instruction pair provide the same functionality as a Halt/IPI with a significant reduction in the overhead. The monitor instruction specifies a region of memory to watch or “Monitor”. The mWait instruction “Halts” the CPU until the monitored region is written to. The CPU can be woken up for other reasons, such as an interrupt, so software must check for the value written before proceeding to know why it was woken up. The region specified to monitor should be large enough to ensure the write intended to trigger the wake up falls with in the region. The region should not be larger than needed, and no other writes should occur in the monitored region or false wake ups will hamper the efficiency. Since these instructions halt the CPU they are privileged they can only be run from ring 0 (the Operating System). CSE 520 Advanced Computer Architecture

Monitor/mWait Hardware Implementation The Monitor instruction: Informs the hardware what region of memory to monitor for writes. Arms the triggering hardware. Clears the flag. The mWait instruction puts the processor into a low power state until the flag is set. Example code: CSE 520 Advanced Computer Architecture

CSE 520 Advanced Computer Architecture Proposed Frame Work To use the privileged instructions at the user level a framework is need, any overhead introduced by the frame work will erode any performance gains of the Monitor/mWait scheme. An extension to the Linux kernel providing user level system calls that access the privileged Monitor/mWait instructions. The memory to monitor needs to be either in kernel space or in user space: In kernel space the user level application would have to use a system call for every update. (To notify the waiter) In user space the kernel would have to copy from user space for every check. (During false wake ups, as well as the final wake up) The frame work uses a workaround eliminating the overhead for each case. A kernel space character driver is mapped to user space, allowing direct access from both kernel and user space (no additional overhead). CSE 520 Advanced Computer Architecture

Usage Model of Frame Work Initialization: The user application opens the character device and maps its memory to user space (One time initialization overhead). Wait: The thread wanting to wait makes one system call. When the system call returns the thread is safe to proceed. Notify: The thread that wants to notify the waiter simply writes to the mapped region. The per synchronization overhead of this frame work is zero for the lock holder, and only the cost of one system call for the waiter. CSE 520 Advanced Computer Architecture

Performance Evaluation The Proposed frame work is measured by a two thread application that has one thread as a heavy weight worker, and the other as a light weight helper. The worker runs 100% of the time, the helper is given a small task and upon completion waits for more work. The Performance Evaluation looks at three aspects: Resource Consumption: Measured by the time taken by the worker to complete its job. In an HT situation with the helper spinning the worker will take longer since it has to contend for the execution units with the helper threads spinning. Responsiveness: Measured by the time from when the worker needs the helper to when the helper starts helping. Call Overhead: The time the worker spends telling the helper he is needed. The evaluation compares: Spin-Locks (Pause) Spin-Locks (Halt with IPI) futexes synchronization in pThreads Monitor/mWait CSE 520 Advanced Computer Architecture

CSE 520 Advanced Computer Architecture Performance Results The results show that the latency in traditional spin-locks is the lowest, but the resource contention is the highest. The Monitor/mWait has the lowest resource contention, and the second lowest latency, second only to spinlocks. Anomalies: The work time of the spin-Locks with halt and Monitor/mWait are expected to be the same but they are not. The call latency of Monitor/mWait is expected to be the same as spinlocks, but it is not. CSE 520 Advanced Computer Architecture

Performance Results (cont.) Now with varying the ratio of work done by the helper with zero being none, and 10 being as much as the worker we see the following trends. CSE 520 Advanced Computer Architecture

CSE 520 Advanced Computer Architecture Conclusion Monitor/mWait based synchronization provides an excellent balance between resource waste, wake up latencies, and call overhead. Future Work: Evaluate the proposed synchronization primitives for parallel programs with fine-grained synchronization. The authors argue: “With the advent of hybrid architectures that encompass multitude of hardware contexts within a single chip, architecture-aware hierarchical synchronization schemes will play a significant role in parallel application performance and thus seem to be worthwhile to investigate” CSE 520 Advanced Computer Architecture