Task and Data Parallelism: Real-World Examples Sasha Goldshtein | SELA Group.

Slides:



Advertisements
Similar presentations
James Kolpack, InRAD LLC popcyclical.com. CodeStock is proudly partnered with: Send instant feedback on this session via Twitter: Send a direct message.
Advertisements

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Parallel Extensions to the.NET Framework Daniel Moth Microsoft
FUTURE OF.NET PARALLEL PROGRAMMING Joseph Albahari SESSION CODE: DEV308 (c) 2011 Microsoft. All rights reserved.
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Recursion. Recursion is a powerful technique for thinking about a process It can be used to simulate a loop, or for many other kinds of applications In.
CS 4800 By Brandon Andrews.  Specifications  Goals  Applications  Design Steps  Testing.
Threading Part 3 CS221 – 4/24/09. Teacher Survey Fill out the survey in next week’s lab You will be asked to assess: – The Course – The Teacher – The.
This presentation: Sasha GoldshteinCTO, Sela Group Garbage Collection Performance Tips.
Virtual techdays INDIA │ 9-11 February 2011 Parallelism in.NET 4.0 Parag Paithankar │ Technology Advisor - Web, Microsoft India.
Parallel Programming in Visual Studio 2010 Sasha Goldshtein Senior Consultant, Sela Group
Mapping Techniques for Load Balancing
PARALLEL PROGRAMMING ABSTRACTIONS 6/16/2010 Parallel Programming Abstractions 1.
Introducing Xamarin 2.0 Introducing Xamarin 2.0 Michael Hutchinson
Connect with life Bijoy Singhal Developer Evangelist | Microsoft India.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Parallel Programming in Java with Shared Memory Directives.
Abstraction IS 101Y/CMSC 101 Computational Thinking and Design Tuesday, September 17, 2013 Carolyn Seaman University of Maryland, Baltimore County.
Joe Hummel, PhD Technical Staff: Pluralsight Adjunct Professor: UIC, LUC
Parallel Extensions A glimpse into the parallel universe By Eric De Carufel Microsoft.NET Solution Architect at Orckestra
Abstraction IS 101Y/CMSC 101 Computational Thinking and Design Tuesday, September 17, 2013 Marie desJardins University of Maryland, Baltimore County.
Parallel Extensions A glimpse into the parallel universe Eric De Carufel
Consuming REST Services from C# SoftUni Team Technical Trainers Software University
Java Threads 11 Threading and Concurrent Programming in Java Introduction and Definitions D.W. Denbo Introduction and Definitions D.W. Denbo.
Scala Parallel Collections Aleksandar Prokopec, Tiark Rompf Scala Team EPFL.
About Me Microsoft MVP Intel Blogger TechEd Israel, TechEd Europe Expert C++ Book
Robert Vitolo CS474.  Branched off of ML (metalanguage)  Developed at Microsoft, available as part of the Visual Studio 2010 software package, integrated.
Lecture 21 Parallel Programming Richard Gesick. Parallel Computing Parallel computing is a form of computation in which many operations are carried out.
Expressing Parallel Patterns in Modern C++ Rahul V. Patil Microsoft C++ Parallel Computing Team Note: this is a simplified version of the deck used in.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
Parallel Extensions A glimpse into the parallel universe By Eric De Carufel Microsoft.NET Solution Architect at Orckestra
Oct * Brad Tutterow. VS 2008.NET 3.5LINQ Entity Framework  The ADO.NET Entity Framework is part of Microsoft’s next generation of.NET technologies.
Kernel Locking Techniques by Robert Love presented by Scott Price.
Huseyin YILDIZ Software Design Engineer Microsoft Corporation SESSION CODE: DEV314.
IAsyncResult ar = BeginSomething(…); // Do other work, checking ar.IsCompleted int result = EndSomething(ar);
Programovací jazyky F# a OCaml Chapter 6. Sequence expressions and computation expressions (aka monads)
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Asynchronous Programming Writing Asynchronous Code in C# SoftUni Team Technical Trainers Software University
CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Module 8 Enhancing User Interface Responsiveness.
LINQ & PLINQ (Parallel) Language Integrated Query.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Data Parallelism Task Parallel Library (TPL) The use of lambdas Map-Reduce Pattern FEN 20141UCN Teknologi/act2learn.
CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.
DEV303. Tiny Functions Why Does It Need a Name?
QUICKSORT 2015-T2 Lecture 16 School of Engineering and Computer Science, Victoria University of Wellington COMP 103 Marcus Frean.
TAP into async programming
04 |Sharing Code Between Windows 8 and Windows Phone 8 in Visual Studio Ben Riga
MULTIVIE W Slide 1 (of 21) Software Transactional Memory Should Not Be Obstruction Free Paper: Robert Ennals Presenter: Emerson Murphy-Hill.
Patterns of Parallel Programming with.NET 4 Stephen Toub Principal Architect Parallel Computing Platform Microsoft Corporation
CSC Multiprocessor Programming, Spring, 2012 Chapter 10 – Avoiding Liveness Hazards Dr. Dale E. Parson, week 11.
TOPICS WHAT YOU’LL LEAVE WITH WHO WILL BENEFIT FROM THIS TALK.NET developers: familiar with parallel programming support in Visual Studio 2010 and.NET.
C# Present and Future Marita Paletsou Software Engineer.
Functional Processing of Collections (Advanced) 6.0.
.NET Garbage Collection Performance Tips Sasha Goldshtein | SELA Group.
Asynchronous Programming Writing Concurrent Code in C# SoftUni Team Technical Trainers Software University
THE FUTURE OF C#: GOOD THINGS COME TO THOSE WHO ‘AWAIT’ Joseph Albahari SESSION CODE: DEV411 (c) 2011 Microsoft. All rights reserved.
TensorFlow– A system for large-scale machine learning
Async or Parallel? No they aren’t the same thing!
Customizing your device experience with assigned access
Lecture 5: GPU Compute Architecture
C++ Forever: Interactive Applications in the Age of Manycore
12 Asynchronous Programming
Lecture 5: GPU Compute Architecture for the last time
Patterns for Programming Shared Memory
Imperative Data Parallelism (Correctness)
Patterns for Programming Shared Memory
Lecture 20 Parallel Programming CSE /8/2019.
Presentation transcript:

Task and Data Parallelism: Real-World Examples Sasha Goldshtein | SELA Group

Why This Talk? Multicore machines have been a cheap commodity for >10 years Adoption of concurrent programming is still slow Patterns and best practices are scarce We discuss the APIs first… …and then turn to examples, best practices, and tips

Task Parallel Library Evolution The Future DataFlow in.NET 4.5 (NuGet) Augmented with language support ( await, async methods) 2012 Released in full glory with.NET Incubated for 3 years as “Parallel Extensions for.NET”

Tasks A task is a unit of work May be executed in parallel with other tasks by a scheduler (e.g. Thread Pool) Much more than threads, and yet much cheaper Task t = Task.Factory.StartNew( () => { return DnaSimulation(…); }); t.ContinueWith(r => Show(r.Exception), TaskContinuationOptions.OnlyOnFaulted); t.ContinueWith(r => Show(r.Result), TaskContinuationOptions.OnlyOnRanToCompletion); DisplayProgress(); try //The C# 5.0 version { var task = Task.Run(DnaSimulation); DisplayProgress(); Show(await task); } catch (Exception ex) { Show(ex); }

Parallel Loops Ideal for parallelizing work over a collection of data Easy porting of for and foreach loops Beware of inter-iteration dependencies! Parallel.For(0, 100, i => {... }); Parallel.ForEach(urls, url => { webClient.Post(url, options, data); });

Parallel LINQ Mind-bogglingly easy parallelization of LINQ queries Can introduce ordering into the pipeline, or preserve order of original elements var query = from monster in monsters.AsParallel() where monster.IsAttacking let newMonster = SimulateMovement(monster) orderby newMonster.XP select newMonster; query.ForAll(monster => Move(monster));

Measuring Concurrency Parallelizing code is worthless if you can’t measure the results Visual Studio Concurrency Visualizer to the rescue

Recursive Parallelism Extraction Divide-and-conquer algorithms are often parallelized through the recursive call Need to be careful with parallelization threshold and watch out for dependencies void FFT(float[] src, float[] dst, int n, int r, int s) { if (n == 1) { dst[r] = src[r]; } else { FFT(src, n/2, r, s*2); FFT(src, n/2, r+s, s*2); //Combine the two halves in O(n) time } Parallel.Invoke( () => FFT(src, n/2, r, s*2), () => FFT(src, n/2, r+s, s*2) ); DEMO

Symmetric Data Processing For a large set of uniform data items that need to processed, parallel loops are usually the best choice and lead to ideal work distribution Parallel.For(0, image.Rows, i => { for (int j = 0; j < image.Cols; ++j) { destImage.SetPixel(i, j, PixelBlur(image, i, j)); } }); Inter-iteration dependencies complicate things (think in-place blur)

Uneven Work Distribution With non-uniform data items, use custom partitioning or manual distribution Checking primality: 7 is easier to check than 10,320,647 Not ideal (and complicated!): var work = Enumerable.Range(0, Environment.ProcessorCount).Select(n => Task.Run(() => CountPrimes(start+chunk*n, start+chunk*(n+1)))); Task.WaitAll(work.ToArray()); Faster (and simpler!): Parallel.ForEach(Partitioner.Create(Start, End, chunkSize), chunk => CountPrimes(chunk.Item1, chunk.Item2) ); DEMO

Complex Dependency Management With dependent work units, parallelization is no longer automatic and simple Must extract all dependencies and incorporate them into the algorithm Typical scenarios: 1D loops, dynamic algorithms Levenshtein string edit distance: each task depends on two predecessors, computation proceeds as a wavefront C = x[i-1] == y[i-1] ? 0 : 1; D[i, j] = min( D[i-1, j] + 1, D[i, j-1] + 1, D[i-1, j-1] + C); 0,0 m,n DEMO

Synchronization => Aggregation Excessive synchronization brings parallel code to its knees Try to avoid shared state, or minimize access to it Aggregate thread- or task-local state and merge later Parallel.ForEach( Partitioner.Create(Start, End, ChunkSize), () => new List (), //initial local state (range, pls, localPrimes) => { //aggregator for (int i = range.Item1; i < range.Item2; ++i) if (IsPrime(i)) localPrimes.Add(i); return localPrimes; }, localPrimes => { lock (primes) //combiner primes.AddRange(localPrimes); }); DEMO

Creative Synchronization We implement a collection of stock prices, initialized with 10 5 name/price pairs Access rate: 10 7 reads/s, 10 6 “update” writes/s, 10 3 “add” writes/day Many reader threads, many writer threads Store the data in two dictionaries: safe and unsafe GET(key): if safe contains key then return safe[key] lock { return unsafe[key] } PUT(key, value): if safe contains key then safe[key] = value lock { unsafe[key] = value }

Lock-Free Patterns (1) Try to void Windows synchronization and use hardware synchronization Primitive operations such as Interlocked.Increment, Interlocked.Add, Interlocked.CompareExchange Retry pattern with Interlocked.CompareExchange enables arbitrary lock-free algorithms, ConcurrentQueue and ConcurrentStack int InterlockedMultiply(ref int x, int y) { int t, r; do { t = x; r = t * y; } while (Interlocked.CompareExchange(ref x, r, t) != t); return r; } New Value Comparand

Lock-Free Patterns (2) User-mode spinlocks ( SpinLock class) can replace locks you acquire very often, which protect tiny computations //Sample implementation of SpinLock class __DontUseMe__SpinLock { private volatile int _lck; public void Enter() { while (Interlocked.CompareExchange(ref _lck, 1, 0) != 0); } public void Exit() { Thread.MemoryBarrier(); _lck = 0; }

Miscellaneous Tips (1) Don’t mix several concurrency frameworks in the same process Some parallel work is best organized in pipelines – TPL DataFlow Some parallel work can be offloaded to the GPU – C++ AMP void vadd_exp(float* x, float* y, float* z, int n) { array_view avX(n, x), avY(n, y); array_view avZ(n, z); avZ.discard_data(); parallel_for_each(avZ.extent, [=](index i)... { avZ[i] = avX[i] + fast_math::exp(avY[i]); }); avZ.synchronize(); } BroadcastBlock TransformBlock ActionBlock

Miscellaneous Tips (2) Invest in SIMD parallelization of heavy math or data-parallel algorithms START: movups xmm0, [esi+4*ecx] addps xmm0, [edi+4*ecx] movups [ebx+4*ecx], xmm0 sub ecx, 4 jns START Make sure to take cache effects into account, especially on MP systems A’s cache B’s cache CL state = S CL state = S Thread does CL[0] = 15 RFO Set CL as I RFO Set CL as I

Conclusions Avoid shared state and synchronization Parallelize judiciously and apply thresholds Measure and understand performance gains or losses Concurrency and parallelism are still hard A body of best practices, tips, patterns, examples is being built: Microsoft Parallel Computing Center: Concurrent Programming on Windows (Joe Duffy): Parallel Programming with.NET:

Thank blog.sashag.net Sasha Goldshtein | SELA Group Two free Pro.NET Performance eBooks! Send an to with an answer to the following question: Which class can be used to generate markers in the Visual Studio Concurrency Visualizer reports?