@gfraiteur a deep investigation on how.NET multithreading primitives map to hardware and Windows Kernel You Thought You Understood Multithreading Gaël.

Slides:



Advertisements
Similar presentations
Computer-System Structures Er.Harsimran Singh
Advertisements

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Chapter 6: Process Synchronization
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
OS2-1 Chapter 2 Computer System Structures. OS2-2 Outlines Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.
1 Lecture 2: Review of Computer Organization Operating System Spring 2007.
1 Computer System Overview OS-1 Course AA
Computer System Overview
Semaphores CSCI 444/544 Operating Systems Fall 2008.
Computer System Overview Chapter 1. Basic computer structure CPU Memory memory bus I/O bus diskNet interface.
CPS110: Implementing threads/locks on a uni-processor Landon Cox.
Race Conditions CS550 Operating Systems. Review So far, we have discussed Processes and Threads and talked about multithreading and MPI processes by example.
1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.
Chapter 2: Computer-System Structures
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
General System Architecture and I/O.  I/O devices and the CPU can execute concurrently.  Each device controller is in charge of a particular device.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 1: Introduction.
Objectives To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization.
More on Locks: Case Studies
Computer System Overview Chapter 1. Operating System Exploits the hardware resources of one or more processors Provides a set of services to system users.
Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.
Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment.
Chapter 2: Computer-System Structures
1 CSE Department MAITSandeep Tayal Computer-System Structures Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.
2: Computer-System Structures
Operating Systems and Networks AE4B33OSS Introduction.
Implementing Synchronization. Synchronization 101 Synchronization constrains the set of possible interleavings: Threads “agree” to stay out of each other’s.
1 Chapter 2: Computer-System Structures  Computer System Operation  I/O Structure  Storage Structure  Storage Hierarchy  Hardware Protection  General.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
4.1 Introduction to Threads Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads.
1 Announcements The fixing the bug part of Lab 4’s assignment 2 is now considered extra credit. Comments for the code should be on the parts you wrote.
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
Java Thread and Memory Model
1 CS.217 Operating System By Ajarn..Sutapart Sappajak,METC,MSIT Chapter 2 Computer-System Structures Slide 1 Chapter 2 Computer-System Structures.
Silberschatz, Galvin and Gagne  Applied Operating System Concepts Chapter 2: Computer-System Structures Computer System Architecture and Operation.
Processes, Threads, and Process States. Programs and Processes  Program: an executable file (before/after compilation)  Process: an instance of a program.
Lecture 1: Review of Computer Organization
Review of Computer System Organization. Computer Startup For a computer to start running when it is first powered up, it needs to execute an initial program.
OSes: 2. Structs 1 Operating Systems v Objective –to give a (selective) overview of computer system architectures Certificate Program in Software Development.
Managing Processors Jeff Chase Duke University. The story so far: protected CPU mode user mode kernel mode kernel “top half” kernel “bottom half” (interrupt.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Big Picture Lab 4 Operating Systems C Andras Moritz
Chapter 2: Computer-System Structures(Hardware)
Chapter 2: Computer-System Structures
Processes and threads.
CS703 – Advanced Operating Systems
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 2: Computer-System Structures Computer System Operation I/O Structure Storage.
CSCI 315 Operating Systems Design
Computer-System Architecture
Module 2: Computer-System Structures
I/O Systems I/O Hardware Application I/O Interface
Synchronization Issues
Lecture 4- Threads, SMP, and Microkernels
Threads Chapter 4.
Module 2: Computer-System Structures
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 1 Computer System Overview
CSE 451: Operating Systems Autumn 2003 Lecture 7 Synchronization
CSE 451: Operating Systems Autumn 2005 Lecture 7 Synchronization
CSE 451: Operating Systems Winter 2003 Lecture 7 Synchronization
CSE 153 Design of Operating Systems Winter 19
System Calls System calls are the user API to the OS
Chapter 2: Computer-System Structures
Chapter 2: Computer-System Structures
Module 2: Computer-System Structures
Module 2: Computer-System Structures
Sarah Diesburg Operating Systems CS 3430
Presentation transcript:

@gfraiteur a deep investigation on how.NET multithreading primitives map to hardware and Windows Kernel You Thought You Understood Multithreading Gaël Fraiteur PostSharp Technologies Founder & Principal Engineer

@gfraiteur Hello! My name is GAEL. No, I don’t think my accent is funny. my twitter

@gfraiteur Agenda 1. Hardware 2. Windows Kernel 3. Application

@gfraiteur Microarchitecture Hardware

@gfraiteur There was no RAM in Colossus, the world’s first electronic programmable computing device (1944).

@gfraiteur Hardware Processor microarchitecture Intel Core i7 980x

@gfraiteur Hardware Hardware latency Measured on Intel® Core™ i7-970 Processor (12M Cache, 3.20 GHz, 4.80 GT/s Intel® QPI) with SiSoftware Sandra

@gfraiteur Hardware Cache Hierarchy Distributed into caches L1 (per core) L2 (per core) L3 (per die) Die 1 Core 4 Processor Core 5 Processor Core 7 Processor L2 Cache Core 6 Processor L2 Cache Memory Store (DRAM) Die 0 Core 1 Processor Core 0 Processor Core 2 Processor L2 Cache Core 3 Processor L2 Cache L3 Cache L2 Cache L3 Cache L2 Cache Data Transfer Core Interconnec t Processor Interconnect Synchronization Synchronized through an asynchronous “bus”: Request/response Message queues Buffers

@gfraiteur Hardware Memory Ordering Issues due to implementation details: Asynchronous bus Caching & Buffering Out-of-order execution

@gfraiteur Hardware Memory Ordering Processor 0Processor 1 Store A := 1Load A Store B := 1Load B Possible values of (A, B) for loads? Processor 0Processor 1 Load ALoad B Store B := 1Store A := 1 Possible values of (A, B) for loads? Processor 0Processor 1 Store A := 1Store B := 1 Load BLoad A Possible values of (A, B) for loads?

@gfraiteur Hardware Memory Models: Guarantees X84, AMD64: strong orderingARM: weak ordering

@gfraiteur Hardware Memory Models Compared Processor 0Processor 1 Store A := 1Load A Store B := 1Load B Processor 0Processor 1 Load ALoad B Store B := 1Store A := 1

@gfraiteur Hardware Compiler Memory Ordering The compiler (CLR) can: Cache (memory into register), Reorder Coalesce writes volatile keyword disables compiler optimizations. And that’s all!

@gfraiteur Hardware Thread.MemoryBarrier() Processor 0Processor 1 Store A := 1Store B := 1 Thread.MemoryBarrier() Load BLoad A Barriers serialize access to memory.

@gfraiteur Hardware Atomicity Processor 0Processor 1 Store A := (long) 0xFFFFFFFFLoad A Up to native word size (IntPtr) Properly aligned fields (attention to struct fields!)

@gfraiteur Hardware Interlocked access The only locking mechanism at hardware level. System.Threading.Interlocked Add Add(ref x, int a ) { x += a; return x; } Increment Inc(ref x ) { x += 1; return x; } Exchange Exc( ref x, int v ) { int t = x; x = a; return t; } CompareExchange CAS( ref x, int v, int c ) { int t = x; if ( t == c ) x = a; return t; } Read(ref long) Implemented by L3 cache and/or interconnect messages. Implicit memory barrier

@gfraiteur Hardware Cost of interlocked Measured on Intel® Core™ i7-970 Processor (12M Cache, 3.20 GHz, 4.80 GT/s Intel® QPI) with C# code. Cache latencies measured with SiSoftware Sandra.

@gfraiteur Hardware Cost of cache coherency

@gfraiteur Multitasking No thread, no process at hardware level There no such thing as “wait” One core never does more than 1 “thing” at the same time (except HyperThreading) Task-State Segment: CPU State Permissions (I/O, IRQ) Memory Mapping A task runs until interrupted by hardware or OS

@gfraiteur Lab: Non-Blocking Algorithms Non-blocking stack (4.5 MT/s) Ring buffer (13.2 MT/s)

@gfraiteur Windows Kernel Operating System

@gfraiteur SideKick (1988).

@gfraiteur Program (Sequence of Instructions) CPU State Wait Dependencies (other things) Virtual Address Space Resources At least 1 thread (other things) Processes and Threads ProcessThread Windows Kernel

@gfraiteur Windows Kernel Processed and Threads 8 Logical Processors (4 hyper-threaded cores) 2384 threads 155 processes

@gfraiteur Windows Kernel Dispatching threads to CPU Executing CPU 0 CPU 1 CPU 2 CPU 3 Thread H Thread G Thread I Thread J Mutex Semaphore Event Timer Thread A Thread C Thread B Wait Thread D Thread E Thread F Ready

@gfraiteur Live in the kernel Sharable among processes Expensive! Event Mutex Semaphore Timer Thread Dispatcher Objects (WaitHandle) Kinds of dispatcher objectsCharacteristics Windows Kernel

@gfraiteur Windows Kernel Cost of kernel dispatching Measured on Intel® Core™ i7-970 Processor (12M Cache, 3.20 GHz, 4.80 GT/s Intel® QPI)

@gfraiteur Windows Kernel Wait Methods Thread.Sleep Thread.Yield Thread.Join WaitHandle.Wait (Thread.SpinWait)

@gfraiteur Windows Kernel SpinWait: smart busy loops 1. Thread.SpinWait (Hardware awareness, e.g. Intel HyperThreading) 2. Thread.Sleep(0) (give up to threads of same priority) 3. Thread.Yield() (give up to threads of any priority) 4. Thread.Sleep(1) (sleeps for 1 ms) Processor 0Processor 1 A = 1; B = 1; SpinWait w = new SpinWait(); int localB; if ( A == 1 ) { while ( ((localB = B) == 0 ) w.SpinOnce(); }

@gfraiteur.NET Framework Application Programming

@gfraiteur Application Programming Design Objectives Ease of use High performance from applications!

@gfraiteur Application Programming Instruments from the platform Hardware Interlocked SpinWait Volatile Operating System Thread WaitHandle (mutex, event, semaphore, …) Timer

@gfraiteur Application Programming New Instruments Concurrency & Synchronization Monitor (Lock) , SpinLock ReaderWriterLock  Lazy, LazyInitializer Barrier, CountdownEvent Data Structures BlockingCollection  Concurrent(Queue  |Stack  |Bag  |Dictionary  ) ThreadLocal, ThreadStatic Thread Dispatching ThreadPool  Dispatcher , ISynchronizeInvoke BackgroundWorker Frameworks PLINQ Tasks Async/Await

@gfraiteur Concurrency & Synchronization Application Programming

@gfraiteur Concurrency & Synchronization Monitor: the lock keyword 1. Start with interlocked operations (no contention) 2. Continue with “spin wait”. 3. Create kernel event and wait  Good performance if low contention

@gfraiteur Concurrency & Synchronization ReaderWriterLock Concurrent reads allowed Writes require exclusive access

@gfraiteur Concurrency & Synchronization Lab: ReaderWriterLock

@gfraiteur Data Structures Application Programming

@gfraiteur Data Structures BlockingCollection Use case: pass messages between threads TryTake: wait for an item and take it Add: wait for free capacity (if limited) and add item to queue CompleteAdding: no more item

@gfraiteur Data Structures Lab: BlockingCollection

@gfraiteur Data Structures ConcurrentStack node top Producer A Producer B Producer C Consumer A Consumer B Consumer C Consumers and producers compete for the top of the stack node

@gfraiteur Data Structures Lab: Concurrent Stack

@gfraiteur Data Structures ConcurrentQueue node tail Producer A Producer B Producer C Consumer A Consumer B Consumer C head Consumers compete for the head. Producers compete for the tail. Both compete for the same node if the queue is empty.

@gfraiteur Data Structures ConcurrentBag Thread A Bag Thread Local Storage AThread Local Storage B node Producer AConsumer A Thread B Producer BConsumer B Threads preferentially consume nodes they produced themselves. No ordering

@gfraiteur Data Structures ConcurrentDictionary TryAdd TryUpdate (CompareExchange) TryRemove AddOrUpdate GetOrAdd

@gfraiteur Thread Dispatching Application Programming

@gfraiteur Thread Dispatching ThreadPool Problems solved: Execute a task asynchronously (QueueUserWorkItem) Waiting for WaitHandle (RegisterWaitForSingleObject) Creating a thread is expensive Good for I/O intensive operations Bad for CPU intensive operations

@gfraiteur Thread Dispatching Dispatcher Execute a task in the UI thread Synchronously: Dispatcher.Invoke Asynchonously: Dispatcher.BeginInvoke Under the hood, uses the HWND message loop

@gfraiteur Thread Dispatching

@gfraiteur Q&A Gael

Summary

@gfraiteur