Task and Data Parallelism: Real-World Examples Sasha Goldshtein | SELA Group
Why This Talk? Multicore machines have been a cheap commodity for >10 years Adoption of concurrent programming is still slow Patterns and best practices are scarce We discuss the APIs first… …and then turn to examples, best practices, and tips
Task Parallel Library Evolution The Future DataFlow in.NET 4.5 (NuGet) Augmented with language support ( await, async methods) 2012 Released in full glory with.NET Incubated for 3 years as “Parallel Extensions for.NET”
Tasks A task is a unit of work May be executed in parallel with other tasks by a scheduler (e.g. Thread Pool) Much more than threads, and yet much cheaper Task t = Task.Factory.StartNew( () => { return DnaSimulation(…); }); t.ContinueWith(r => Show(r.Exception), TaskContinuationOptions.OnlyOnFaulted); t.ContinueWith(r => Show(r.Result), TaskContinuationOptions.OnlyOnRanToCompletion); DisplayProgress(); try //The C# 5.0 version { var task = Task.Run(DnaSimulation); DisplayProgress(); Show(await task); } catch (Exception ex) { Show(ex); }
Parallel Loops Ideal for parallelizing work over a collection of data Easy porting of for and foreach loops Beware of inter-iteration dependencies! Parallel.For(0, 100, i => {... }); Parallel.ForEach(urls, url => { webClient.Post(url, options, data); });
Parallel LINQ Mind-bogglingly easy parallelization of LINQ queries Can introduce ordering into the pipeline, or preserve order of original elements var query = from monster in monsters.AsParallel() where monster.IsAttacking let newMonster = SimulateMovement(monster) orderby newMonster.XP select newMonster; query.ForAll(monster => Move(monster));
Measuring Concurrency Parallelizing code is worthless if you can’t measure the results Visual Studio Concurrency Visualizer to the rescue
Recursive Parallelism Extraction Divide-and-conquer algorithms are often parallelized through the recursive call Need to be careful with parallelization threshold and watch out for dependencies void FFT(float[] src, float[] dst, int n, int r, int s) { if (n == 1) { dst[r] = src[r]; } else { FFT(src, n/2, r, s*2); FFT(src, n/2, r+s, s*2); //Combine the two halves in O(n) time } Parallel.Invoke( () => FFT(src, n/2, r, s*2), () => FFT(src, n/2, r+s, s*2) ); DEMO
Symmetric Data Processing For a large set of uniform data items that need to processed, parallel loops are usually the best choice and lead to ideal work distribution Parallel.For(0, image.Rows, i => { for (int j = 0; j < image.Cols; ++j) { destImage.SetPixel(i, j, PixelBlur(image, i, j)); } }); Inter-iteration dependencies complicate things (think in-place blur)
Uneven Work Distribution With non-uniform data items, use custom partitioning or manual distribution Checking primality: 7 is easier to check than 10,320,647 Not ideal (and complicated!): var work = Enumerable.Range(0, Environment.ProcessorCount).Select(n => Task.Run(() => CountPrimes(start+chunk*n, start+chunk*(n+1)))); Task.WaitAll(work.ToArray()); Faster (and simpler!): Parallel.ForEach(Partitioner.Create(Start, End, chunkSize), chunk => CountPrimes(chunk.Item1, chunk.Item2) ); DEMO
Complex Dependency Management With dependent work units, parallelization is no longer automatic and simple Must extract all dependencies and incorporate them into the algorithm Typical scenarios: 1D loops, dynamic algorithms Levenshtein string edit distance: each task depends on two predecessors, computation proceeds as a wavefront C = x[i-1] == y[i-1] ? 0 : 1; D[i, j] = min( D[i-1, j] + 1, D[i, j-1] + 1, D[i-1, j-1] + C); 0,0 m,n DEMO
Synchronization => Aggregation Excessive synchronization brings parallel code to its knees Try to avoid shared state, or minimize access to it Aggregate thread- or task-local state and merge later Parallel.ForEach( Partitioner.Create(Start, End, ChunkSize), () => new List (), //initial local state (range, pls, localPrimes) => { //aggregator for (int i = range.Item1; i < range.Item2; ++i) if (IsPrime(i)) localPrimes.Add(i); return localPrimes; }, localPrimes => { lock (primes) //combiner primes.AddRange(localPrimes); }); DEMO
Creative Synchronization We implement a collection of stock prices, initialized with 10 5 name/price pairs Access rate: 10 7 reads/s, 10 6 “update” writes/s, 10 3 “add” writes/day Many reader threads, many writer threads Store the data in two dictionaries: safe and unsafe GET(key): if safe contains key then return safe[key] lock { return unsafe[key] } PUT(key, value): if safe contains key then safe[key] = value lock { unsafe[key] = value }
Lock-Free Patterns (1) Try to void Windows synchronization and use hardware synchronization Primitive operations such as Interlocked.Increment, Interlocked.Add, Interlocked.CompareExchange Retry pattern with Interlocked.CompareExchange enables arbitrary lock-free algorithms, ConcurrentQueue and ConcurrentStack int InterlockedMultiply(ref int x, int y) { int t, r; do { t = x; r = t * y; } while (Interlocked.CompareExchange(ref x, r, t) != t); return r; } New Value Comparand
Lock-Free Patterns (2) User-mode spinlocks ( SpinLock class) can replace locks you acquire very often, which protect tiny computations //Sample implementation of SpinLock class __DontUseMe__SpinLock { private volatile int _lck; public void Enter() { while (Interlocked.CompareExchange(ref _lck, 1, 0) != 0); } public void Exit() { Thread.MemoryBarrier(); _lck = 0; }
Miscellaneous Tips (1) Don’t mix several concurrency frameworks in the same process Some parallel work is best organized in pipelines – TPL DataFlow Some parallel work can be offloaded to the GPU – C++ AMP void vadd_exp(float* x, float* y, float* z, int n) { array_view avX(n, x), avY(n, y); array_view avZ(n, z); avZ.discard_data(); parallel_for_each(avZ.extent, [=](index i)... { avZ[i] = avX[i] + fast_math::exp(avY[i]); }); avZ.synchronize(); } BroadcastBlock TransformBlock ActionBlock
Miscellaneous Tips (2) Invest in SIMD parallelization of heavy math or data-parallel algorithms START: movups xmm0, [esi+4*ecx] addps xmm0, [edi+4*ecx] movups [ebx+4*ecx], xmm0 sub ecx, 4 jns START Make sure to take cache effects into account, especially on MP systems A’s cache B’s cache CL state = S CL state = S Thread does CL[0] = 15 RFO Set CL as I RFO Set CL as I
Conclusions Avoid shared state and synchronization Parallelize judiciously and apply thresholds Measure and understand performance gains or losses Concurrency and parallelism are still hard A body of best practices, tips, patterns, examples is being built: Microsoft Parallel Computing Center: Concurrent Programming on Windows (Joe Duffy): Parallel Programming with.NET:
Thank blog.sashag.net Sasha Goldshtein | SELA Group Two free Pro.NET Performance eBooks! Send an to with an answer to the following question: Which class can be used to generate markers in the Visual Studio Concurrency Visualizer reports?