Bart J.F. De Smet Software Development Engineer Microsoft Corporation Session Code: DTL206 Wishful thinking?
Agenda The concurrency landscape Language headaches.NET 4.0 facilities Task Parallel Library PLINQ Coordination Data Structures Asynchronous programming Incubation projects Summary
Moore’s law The number of transistors incorporated in a chip will approximately double every 24 months. Gordon Moore – Intel – 1965 Let’s sell processors
Moore’s law today It can't continue forever. The nature of exponentials is that you push them out and eventually disaster happens. Gordon Moore – Intel – 2005 Let’s sell even more processors Let’s sell even more processors
Hardware Paradigm Shift “… we see a very significant shift in what architectures will look like in the future... fundamentally the way we've begun to look at doing that is to move from instruction level concurrency to … multiple cores per die. But we're going to continue to go beyond there. And that just won't be in our server lines in the future; this will permeate every architecture that we build. All will have massively multicore implementations.” Intel Developer Forum, Spring 2004 Pat Gelsinger Chief Technology Officer, Senior Vice President Intel Corporation February, 19, ,0001, ‘70‘80‘90‘00‘10 Power Density (W/cm 2 ) Pentium ® processors Hot Plate Nuclear Reactor Rocket Nozzle Sun’s Surface Intel Developer Forum, Spring Pat Gelsinger Many-core Peak Parallel GOPs Single Threaded Perf 10% per year To Grow, To Keep Up, We Must Embrace Parallel Computing GOPS 32,7682, Today’s Architecture: Heat becoming an unmanageable problem! Parallelism Opportunity 80X
Problem statement Shared mutable state Needs synchronization primitives Locks are problematic Risk for contention Poor discoverability (SyncRoot anyone?) Not composable Difficult to get right (deadlocks, etc.) Coarse-grained concurrency Threads well-suited for large units of work Expensive context switching Asynchronous programming
What can go wrong? Races Deadlocks Livelocks Lock convoys Cache coherency Overheads Lost event notifications Broken serializability Priority inversion
Microsoft Parallel Computing Initiative Applications Domain libraries Programming models & languages Developer Tooling Runtime, platform, OS, HyperVisor Hardware VB C#C# F#F# Constructing Parallel Applications Executing fine-grain Parallel Applications Coordinating system resources/services
Agenda The concurrency landscape Language headaches.NET 4.0 facilities Task Parallel Library PLINQ Coordination Data Structures Asynchronous programming Incubation projects Summary
Languages: two extremes LISP heritage (Haskell, ML) No mutable stateMutable state Fortran heritage (C, C++, C#, VB) Fundamentalist functional programming F#
Mutability Mutable by default (C# et al) Immutable by default (F# et al) int x = 5; // Share out x x++; let x = 5 // Share out x // Can’t mutate x let mutable x = 5 // Share out x x <- x + 1 Synchronization required No locking required Explicit opt-in
Side-effects will kill you Elimination of common sub-expressions? Runtime out of control Can’t optimize code Types don’t reveal side-effects Haskell concept of IO monad Did you know? LINQ is a monad! Source: let now = DateTime.Now in (now, now) (DateTime.Now, DateTime.Now) static DateTime Now { get; }
T IO - Promote (Return) Monads for dummies IO T
IO - Combine (Bind) T Monads for dummies Source: IO TIO R IEnumerable SelectMany(IEnumerable, Func >)
Languages: two roadmaps? Making C# better Add safety nets? Immutability Purity constructs Linear types Software Transactional Memory Kamikaze-style of concurrency Simplify common patterns Making Haskell mainstream Just right? Too academic? Not a smooth upgrade path? C# Haskell NirvanaNirvana
Taming side-effects in F# Bart J.F. De Smet Software Development Engineer Microsoft Corporation
Agenda The concurrency landscape Language headaches.NET 4.0 facilities Task Parallel Library PLINQ Coordination Data Structures Asynchronous programming Incubation projects Summary
Parallel Extensions Architecture.NET Program Proc 1 … … PLINQ Execution Engine C# Compiler VB Compiler C++ Compiler IL OS Scheduling Primitives (also UMS in Windows 7 and up) DeclarativeQueriesDeclarativeQueries Data Partitioning Chunk Range Hash Striped Repartitioning Data Partitioning Chunk Range Hash Striped Repartitioning Operator Types Map Scan Build Search Reduction Operator Types Map Scan Build Search Reduction Merging Async (pipeline) Synch Order Preserving Sorting ForAll Merging Async (pipeline) Synch Order Preserving Sorting ForAll Proc p Parallel Algorithms Query Analysis Task Parallel Library (TPL) Coordination Data Structures Thread-safe Collections Synchronization Types Coordination Types Thread-safe Collections Synchronization Types Coordination Types Task APIs Task Parallelism Futures Scheduling Task APIs Task Parallelism Futures Scheduling PLINQ TPL or CDS F# Compiler Other.NET Compiler
Task Parallel Library – Tasks System.Threading.Tasks Task Parent-child relationships Explicit grouping Waiting and cancelation Task Tasks that produce values Also known as futures Parallel Task 1Task 2…Task N
Work Stealing Internally, the runtime uses Work stealing techniques Lock-free concurrent task queues Work stealing has provably Good locality Work distribution properties p1p2p
22 Example code to parallelize void MultiplyMatrices(int size, double[,] m1, double[,] m2, double[,] result) { for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { result[i, j] = 0; for (int k = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; }
23 Solution today int N = size; int P = 2 * Environment.ProcessorCount; int Chunk = N / P; // size of a work chunk ManualResetEvent signal = new ManualResetEvent(false); int counter = P; // counter limits kernel transitions for (int c = 0; c < P; c++) { // for each chunk ThreadPool.QueueUserWorkItem(o => { int lc = (int)o; for (int i = lc * Chunk; // process one chunk i < (lc + 1 == P ? N : (lc + 1) * Chunk); // respect upper bound i++) { // original loop body for (int j = 0; j < size; j++) { result[i, j] = 0; for (int k = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; } if (Interlocked.Decrement(ref counter) == 0) { // efficient interlocked ops signal.Set(); // and kernel transition only when done } }, c); } signal.WaitOne(); Error Prone High Overhead Tricks Static Work Distribution Knowledge of Synchronization Primitives Heavy Synchronization Lack of Thread Reuse
24 Solution with Parallel Extensions void MultiplyMatrices(int size, double[,] m1, double[,] m2, double[,] result) { Parallel.For (0, size, i => { for (int j = 0; j < size; j++) { result[i, j] = 0; for (int k = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; } }); } Structured parallelism
Task Parallel Library – Loops Common source of work in programs System.Threading.Parallel class Parallelism when iterations are independent Body doesn’t depend on mutable state E.g. static variables, writing to local variables used in subsequent iterations Synchronous All iterations finish, regularly or exceptionally for (int i = 0; i < n; i++) work(i); … foreach (T e in data) work(e); Parallel.For(0, n, i => work(i)); … Parallel.ForEach(data, e => work(e)); Why immutability gains attention
Task Parallel Library Bart J.F. De Smet Software Development Engineer Microsoft Corporation
Amdahl’s law Maximum speedup: S k – speed-up factor for portion k P k – percentage of instructions in part k that can parallelized Simplified: P – percentage of instructions that can be parallelized N – number of processors Sky is not the limit
Amdahl’s law by example Theoretical maximum speedup determined by amount of linear code
Performance Tips Compute intensive and/or large data sets Work done should be at least 1,000s of cycles Do not be gratuitous in task creation Lightweight, but still requires object allocation, etc. Parallelize only outer loops where possible Unless N is insufficiently large to offer enough parallelism Prefer isolation & immutability over synchronization Synchronization == !Scalable Try to avoid shared data Have realistic expectations Amdahl’s Law Speedup will be fundamentally limited by the amount of sequential computation Gustafson’s Law But what if you add more data, thus increasing the parallelizable percentage of the application?
Enable LINQ developers to leverage parallel hardware Fully supports all.NET Standard Query Operators Abstracts away the hard work of using parallelism Partitions and merges data intelligently (classic data parallelism) Minimal impact to existing LINQ programming model AsParallel extension method Optional preservation of input ordering (AsOrdered) Query syntax enables runtime to auto-parallelize Automatic way to generate more Tasks, like Parallel Graph analysis determines how to do it Very little synchronization internally: highly efficient Parallel LINQ (PLINQ) var q = from p in people where p.Name == queryInfo.Name && p.State == queryInfo.State && p.Year >= yearStart && p.Year <= yearEnd orderby p.Year ascending select p; Query Task 1…Task N
PLINQ Bart J.F. De Smet Software Development Engineer Microsoft Corporation
Coordination Data Structures New synchronization primitives (System.Threading) Barrier Multi-phased algorithm Tasks signal and wait for phases CountdownEvent Has an initial counter value Gets signaled when count reaches zero LazyInitializer Lazy initialization routines Reference type variable gets initialized lazily SemaphoreSlim Slim brother to Semaphore (goes kernel mode) SpinLock, SpinWait Loop-based wait (“spinning”) Avoids context switch or kernel mode transition
Coordination Data Structures Concurrent collections (System.Collections.Concurrent) BlockingCollection Producer/consumer scenarios Blocks when no data is available (consumer) Blocks when no space is available (producer) ConcurrentBag ConcurrentDictionary ConcurrentQueue, ConcurrentStack Thread-safe and scalable collections As lock-free as possible Partitioner Facilities to partition data in chunks E.g. PLINQ partitioning problems
Coordination Data Structures Bart J.F. De Smet Software Development Engineer Microsoft Corporation
Asynchronous workflows in F# Language feature unique to F# Based on theory of monads But much more exhaustive compared to LINQ… Overloadable meaning for specific keywords Continuation passing style Not: ‘a -> ‘b But: ‘a -> (‘b -> unit) -> unit In C# style: Action > Core concept: async { /* code */ } Syntactic sugar for keywords inside block E.g. let!, do!, use! Function takes computation result
36 Asynchronous workflows in F# let processAsync i = async { use stream = File.OpenRead(sprintf "Image%d.tmp" i) let! pixels = stream.AsyncRead(numPixels) let pixels' = transform pixels i use out = File.OpenWrite(sprintf "Image%d.done" i) do! out.AsyncWrite(pixels') } let processAsyncDemo = printfn "async demo..." let tasks = [ for i in 1.. numImages -> processAsync i ] Async.RunSynchronously (Async.Parallel tasks) |> ignore printfn "Done!" Run tasks in parallel stream.Read(numPixels, pixels -> let pixels' = transform pixels i use out = File.OpenWrite(sprintf "Image%d.done" i) do! out.AsyncWrite(pixels') )
Asynchronous workflows in F# Bart J.F. De Smet Software Development Engineer Microsoft Corporation
Reactive Fx First-class events in.NET Dualism of IEnumerable interface IObservable Pull versus push Pull (active): IEnumerable and foreach Push (passive): raise events and event handlers Events based on functions Composition at its best Definition of operators: LINQ to Events Realization of the continuation monad
39 IObservable and IObserver // Dual of IEnumerable public interface IObservable { IDisposable Subscribe(IObserver observer); } // Dual of IEnumerator public interface IObserver { // IEnumerator.MoveNext return value void OnCompleted(); // IEnumerator.MoveNext exceptional return void OnError(Exception error); // IEnumerator.Current property void OnNext(T value); } Way to unsubscribe Signaling the last event Virtually two return types Contra- variance Co- variance
ReactiveFx Bart J.F. De Smet Software Development Engineer Microsoft Corporation Visit channel9.msdn.com for info
Agenda The concurrency landscape Language headaches.NET 4.0 facilities Task Parallel Library PLINQ Coordination Data Structures Asynchronous programming Incubation projects Summary
DevLabs project (previously “Maestro”) Coordination between components “Disciplined sharing” Actor model Agents communicate via messages Channels to exchange data via ports Language features (based on C#) Declarative data pipelines and protocols Side-effect-free functions Asynchronous methods Isolated methods Also suitable in distributed setting
43 Channels for message exchange agent Program : channel Microsoft.Axum.Application { public Program() { string[] args = receive(PrimaryChannel::CommandLine); PrimaryChannel::ExitCode <-- 0; }
44 Agents and channels channel Adder { input int Num1; input int Num2; output int Sum; } agent AdderAgent : channel Adder { public AdderAgent() { int result = receive ( PrimaryChannel ::Num1) + receive ( PrimaryChannel ::Num2); PrimaryChannel ::Sum <-- result; } Send / receive primitives
45 Protocols channel Adder { input int Num1; input int Num2; output int Sum; Start: { Num1 -> GotNum1; } GotNum1: { Num2 -> GotNum2; } GotNum2: { Sum -> End; } } State transition diagram
46 Use of pipelines agent MainAgent : channel Microsoft.Axum.Application { function int Fibonacci(int n) { if (n <= 1) return n; return Fibonacci(n - 1) + Fibonacci(n - 2); } int c = 10; void ProcessResult(int n) { Console.WriteLine(n); if (--c == 0) PrimaryChannel::ExitCode <-- 0; } public MainAgent() { var nums = new OrderedInteractionPoint (); nums ==> Fibonacci ==> ProcessResult; for (int i = 0; i < c; i++) nums < i; } Description of data flow Mathematical function
47 Domains domain Chatroom { private string m_Topic; private int m_UserCount; reader agent User : channel UserCommunication { //... } writer agent Administrator : channel AdminCommunication { //... } Unit of sharing between agents
48 Asynchronous methods private asynchronous void ReadFile(string path) { Stream stream = new Stream(...); int numRead = stream.Read(...); while (numRead > 0) {... numRead = stream.Read(...); } Blocking operations inside
Axum in a nutshell Bart J.F. De Smet Software Development Engineer Microsoft Corporation
Another DevLabs project Cutting edge, released 7/28 Specialized fork from.NET 4.0 Beta 1 CLR modifications required First-class transactions on memory As an alternative to locking “Optimistic” concurrency methodology Make modifications Rollback changes on conflict Core concept: atomic { /* code */ }
Transactional memory Subtle difference Problems with locks: Potential for deadlocks… …and more ugliness Granularity matters a lot Don’t compose well atomic { m_x++; m_y--; throw new MyException() } lock (GlobalStmLock) { m_x++; m_y--; throw new MyException() }
52 Bank account sample public static void Transfer(BankAccount from, BankAccount backup, BankAccount to, int amount) { Atomic.Do(() => { // Be optimistic, credit the beneficiary first to.ModifyBalance(amount); // Find the appropriate funds in source accounts try { from.ModifyBalance(-amount); } catch (OverdraftException) { backup.ModifyBalance(-amount); } }); }
53 Atomic cell update public class SingleCellQueue where T : class { T m_item; public void T Get() { atomic { T temp = m_item; if (temp == null) retry; m_item = null; return temp; } public void T Put(T item) { atomic { if (m_item != null) retry; m_item = item; } Don’t forget
The hard truth about STM Great features ACIDACID Optimistic concurrency Transparent rollback and re-execute System.Transactions (LTM) and DTC support Implementation Instrumentation of shared state access JIT compiler modification No hardware support currently Result: 2x to 7x serial slowdown (in alpha prototype) But improved parallel scalability
STM.NET Bart J.F. De Smet Software Development Engineer Microsoft Corporation Visit msdn.microsoft.com/devlabs
DryadLINQ Dryad Infrastructure for cluster computation Concept of job DryadLINQ LINQ over Dryad Decomposition of query Distribution over computation nodes Roughly similar to PLINQ A la “map-reduce” Declarative approach works
DryadLINQ = LINQ + Dryad C# Vertex code Query plan (Dryad job) Data collection results Collection collection; bool IsLegal(Key k); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};
DryadLINQ Bart J.F. De Smet Software Development Engineer Microsoft Corporation Visit research.microsoft.com/dryad
Agenda The concurrency landscape Language headaches.NET 4.0 facilities Task Parallel Library PLINQ Coordination Data Structures Asynchronous programming Incubation projects Summary
Parallel programming requires thinking Avoid side-effects Prefer immutability Act 1 = Library approach in.NET 4.0 Task Parallel Library Parallel LINQ Coordination Data Structures Asynchronous patterns (+ a bit of language sugar) Act 2 = Different approaches are lurking Software Transactional Memory Purification of languages
Sessions On-Demand & Community Resources for IT Professionals Resources for Developers Microsoft Certification & Training Resources Resources
Related Content Breakout Sessions (session codes and titles) Interactive Theater Sessions (session codes and titles) Hands-on Labs (session codes and titles)
Track Resources Resource 1ki Resource 2 Resource 3 Resource 4
Complete an evaluation on CommNet and enter to win! Required Slide
© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. Required Slide