Presented by David Cravey 10/15/2011
About Me – David Cravey Started programming in 4 th grade Learned BASIC on a V-Tech “Precomputer 1000” and then GW-BASIC, and eventually QuickBasic Got bored with BASIC in 8 th Grade so moved to C++ Software Development Manager at Vivicom President of the Houston C++ User Group Meets at Microsoft’s Houston Office 1 st Thursday of Each 7PM Microsoft Visual C++ MVP
Agenda Why C++? Concurrent Runtime Tasks PPL Agents GPGPU AMP Resources Summary
The language of power!
Why C++ C++ Provides Speed Down to the metal performance! Access to the Latest Hardware and Drivers Example: GPGPU Multi-paradigm Programming Procedural Object Oriented Generic Programming High Level Programming (i.e. Strong Abstractions) Classes AND Templates But still allows you to step down to Low Level as needed! Portable Code ?
Modern C++: Clean SafeFast *Used with permission from Herb Sutter’s “Writing modern C++ code: how C++ has evolved over the years”
Automatic Memory Management Never type “delete” again! unique_ptr shared_ptr weak_ptr
What’s Different: At a Glance Then Now circle* p = new circle( 42 ); vector vw = load_shapes(); for( vector ::iterator i = vw.begin(); i != vw.end(); ++i ) { if( *i && **i == *p ) cout << **i << “ is a match\n”; } for( vector ::iterator i = vw.begin(); i != vw.end(); ++i ) { delete *i; } delete p; auto p = make_shared ( 42 ); vector > vw = load_shapes(); for_each( begin(vw), end(vw), [&]( shared_ptr & s ) { if( s && *s == *p ) cout << *s << “ is a match\n”; } ); T* shared_ptr new make_shared no need for “delete” automatic lifetime management exception-safe for/while/do std:: algorithms [&] lambda functions auto type deduction not exception-safe missing try/catch, __try/__finally *Used with permission from Herb Sutter’s “Writing modern C++ code: how C++ has evolved over the years”
Because processors will keep getting more cores … but not very many more GHz!
Why Concurrency? You can deal with problems faster if you have more threads (or “light sabers”)!!! My HERO!
Why A Concurrency Runtime? According to the MSDN: A runtime for concurrency provides uniformity and predictability to applications and application components that run simultaneously. (i.e. Without a single concurrency runtime various libraries and routines will end up “competing” instead of “cooperating” for processor resources.)
Without a Concurrency Runtime OUCH! Threads will compete for system resources and the program will run slower instead of faster!!!!
With a Concurrency Runtime Success! Threads will cooperate to make maximum use of system resources and the program will faster!!!!
What does ConcRT Provide? Improved use of processing resources Cooperative Task Scheduling Cooperative Blocking Work Stealing Task Queues Low Level Building Blocks Synchronization Primitives Task Schedulers Resource Managers 2 High Level Libraries PPL – Parallel Patterns Library Agents – Asynchronous Agents Library Concurrent Container and Message Passing Libraries
ConcRT Architecture Diagram (Diagram taken from MSDN
ConcRT Task’s MSDN Basic building block for concurrency under ConcRT A Task is a unit of work that performs a specific job Tasks can be further broken down into more fine grain tasks (fork and join on “child” tasks) Tasks are kinds like very light weight Threads Threads normally reserve 1MB of memory for their stacks. Thread context switches eat processing time reducing throughput
Work Stealing Processor #1 Task #1 Processor #2 When a running task creates additional tasks it adds them to the bottom of the queue for the current Processor. If another Processor does not have any tasks in its queue it will steal a task from the top of another Processor’s queue (the top of the queue is the least likely to still be in the other Processor’s Cache). Task #2Task #3 Task #2 Task #1
Synchronization Data Structures Concurrency::critical_section Cooperative mutual exclusion object (yields to other tasks instead of preemting) Concurrency::reader_writer_lock Only allows a single writer Allows multiple readers if no writers Concurrency::scoped_lock and Concurrency::scoped_read_lock RAII locking for critical_section and reader_writer_lock Concurrency::event Allows Tasks to signal each other that an Event has occurred
Potential Concurrency Potential Concurrency is the concurrency that your application could have if computer could utilize it. Tasks are lightweight so that they are “cheap” to create. This allows you create many tasks to express the Potential Concurrency of your program. In other words … expressing the Potential Concurrency of your application Future Proofs your application!
Parallel Patterns Library Overview Task Parallelism Tasks and Task Groups Concurrency::task_group Concurrency::structured_task_group Parallel Algorithms Concurrency::parallel_for Concurrency::parallel_for_each Concurrency::parallel_invoke Parallel Containers and Objects Concurrency::concurrent_vector Concurrency::concurrent_queue Concurrency::combinable
PPL Task Groups Tasks are grouped by the task group they are created within. A tasks is cancelled as a group This is useful for operations such a search, where once the item searched for is found then all tasks that are searching should be canceled. Note that if a Task Group is cancelled while waiting on anther Task Group to complete the Task Group that is waiting will also be cancelled.
PPL Algorithms Today Concurrency::parallel_for Performs parallel tasks using iteration values (much like a normal for loop) Concurrency::parallel_for_each Performs parallel tasks for each item in an iterator range (much like std::for_each) Concurrency::parallel_invoke Executes a set of tasks in parallel PPL algorithms do not return until all the tasks within them complete or are canceled.
ConcRT Extras and Sample Pack Microsoft has released the ConcRT Extras and Sample Pack to give early access to new enhancements to the ConcRT before the next version of VC++. The ConcRT Extras and Sample Pack can be downloaded at: These are Template Libraries, so only need to include the header files. Microsoft has stated they encourage users to not only use, but to modify the Libraries to learn more.
Upcoming PPL Algorithms Currently Available as part of the ConcRT Sample Pack Concurrency::parallel_transform Concurrency::parallel_reduce Concurrency::parallel_sort Concurrency::parallel_buffered_sort Concurrency::parallel_radixsort Parallel Partitioners These have been announced to be part of vNext cing-the-ppl-agents-and-concrt-efforts-for-v-next.aspx
PPL Containers and Objects Concurrency::concurrent_vector Provides Concurrent Safe Random Access, Element Access, Iterator Access/Transversal Append Does Not Provide Deletion Of Elements Concurrency::concurrent_queue Provides Concurrent Safe Enqueue and Dequeue operations Concurrency::combinable Reuseable Thread Local Storage Allows Associative Operations to be combined at the end of a parallel_for, parallel_for_each, etc.
Upcoming PPL Containers Currently Available as part of the ConcRT Sample Pack concurrent_unordered_map concurrent_unordered_multimap concurrent_unordered_set concurrent_unordered_multiset Like the new algorithms these new containers have been announced to be part of vNext cing-the-ppl-agents-and-concrt-efforts-for-v-next.aspx
When To Use PPL When you have reasonably large tasks that can be processed in parallel This often requires that you change your algorithm to be parallel-able (for example using combinable ) It is easy to change your existing code to use PPL to accomplish: Parallel Sorts Parallel Sums/Counts/Averages (use Combinable ) Parallel Map/Reduce
PPL Best Practices From MSDN - Do Not Parallelize Small Loop Bodies Express Parallelism at the Highest Possible Level Use parallel_invoke to Solve Divide-and-Conquer Problems Use Cancellation or Exception Handling to Break from a Parallel Loop Understand how Cancellation and Exception Handling Affect Object Destruction Do Not Block Repeatedly in a Parallel Loop Do Not Perform Blocking Operations When You Cancel Parallel Work Do Not Write to Shared Data in a Parallel Loop When Possible, Avoid False Sharing Make Sure That Variables Are Valid Throughout the Lifetime of a Task
Using the PPL to parallelize loops
Asynchronous Agents Overview According to MSDN: An asynchronous agent (or just agent) is an application component that works asynchronously with other agents to solve larger computing tasks. Read File From Disk Decompress Input Data Process File Data Compress Output Data Decrypt Input Data Encrypt Output Data Transmit Output Data
Agent Message Passing Programming Model Message Passing Based “Life Cycle” Pattern Asynchronous Message Blocks Concurrency::unbounded_buffer Concurrency::overwrite_buffer Concurrency::single_assignment Message Passing Functions Concurrency::send Concurrency::asend Concurrency::receive Concurrency::try_receive
Agent Message Passing Diagram (Diagram taken from MSDN
When to use Asynchronous Agents When you have multiple processing steps that can work in parallel to process data as a pipeline (i.e. when you can arrange your code to work as an assembly line such that you can achieve parallelism) Examples: Image Processing Large Calculations That Build Upon Previous Calculations
Programming the GPU using AMP
The Power of Heterogeneous Computing 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding HD video stream to H X Simulation in Matlab using.mex file CUDA function 100X Astrophysics N-body simulation 149X Financial simulation of LIBOR model with swaptions 47X An M- script API for linear Algebra operations on GPU 20X Ultrasound medical imaging for cancer diagnostics 24X Highly optimized object oriented molecular dynamics 30X Cmatch exact string matching to find similar proteins and gene sequences source *Used with permission from Daniel Moth’s “Taming GPU compute with C++ Accelerated Massive Parallelism”
CPUs vs GPUs today CPU Low memory bandwidth Higher power consumption Medium level of parallelism Deep execution pipelines Random accesses Supports general code Mainstream programming GPU High memory bandwidth Lower power consumption High level of parallelism Shallow execution pipelines Sequential accesses Supports data-parallel code Niche programming images source: AMD *Used with permission from Daniel Moth’s “Taming GPU compute with C++ Accelerated Massive Parallelism”
C++ AMP Accelerated Massive Parallelism Best for Data Parallelism Bring GPGPU to the Masses Write C++ Code that runs on the GPU Available as part of the Visual Studio 2011 Developer Preview When running VS11 on Win8 there is even GPGPU debugging! Microsoft is submitting it as an Open Specification Several other compiler vendors have committed to implementing AMP.
Hello World: Array Addition #include using namespace concurrency; void AddArrays(int n, int * pA, int * pB, int * pC) { array_view a(n, pA); array_view b(n, pB); array_view sum(n, pC); parallel_for_each( sum.grid, [=](index i) restrict(direct3d) { sum[i] = a[i] + b[i]; } ); } void AddArrays(int n, int * pA, int * pB, int * pC) { for (int i=0; i<n; i++) { pC[i] = pA[i] + pB[i]; } *Used with permission from Daniel Moth’s “Taming GPU compute with C++ Accelerated Massive Parallelism”
For your reference
General C++ Links Microsoft’s MSDN C++ Developer Center CPlusPlus.com (Great Site for quick refernce to C++ and STL) Visual Studio Team Blog Herb Sutter’s Blog (ISO C++ Chairman and Microsoft Software Architect)
Parallel Programming in Native Code Blog Best Way To Stay Up To Date Parallel Programming in Native Code Blog Great tutorials and more How to pick your parallel sort? parallel-sort.aspx concurrent_vector and concurrent_queue explained and-concurrent-queue-explained.aspx Synchronization with the Concurrency Runtime (2 parts) with-the-concurrency-runtime.aspx Resource Management in the Concurrency Runtime (3 parts) management-in-the-concurrency-runtime-part-1.aspx
ConcRT Written Resources MSDN - Concurrency Runtime ConcRT Extras Parallel Programming with Microsoft Visual C++ (Free Book Online, PBook and EBook not free) Introducing the Visual C++ Concurrency Runtime (59 page hands on lab) Parallel Programming in Native Code Blog
ConcRT Video Resources Don McCrady - Parallelism in C++ Using the Concurrency Runtime Concurrency-Runtime The Concurrency Runtime: Fine Grained Parallelism for C++ Grained-Parallelism-for-C Parallel Programming for C++ Developers: Tasks and Continuations (2 Parts) Native-Code-Tasks-and-Continuations-Part-1-of-2 Native Parallelism with the Parallel Patterns Library parallel-patterns-library
AMP Resources Herb Sutter: Heterogeneous Computing and C++ AMP (Learn about the future of computing) Heterogeneous-Computing-and-C-AMP Taming GPU compute with C++ AMP Walkthrough: Debugging an AMP Application Daniel Moth’s Blog (AMP Project Manager)
Conclusions C++ is a Modern Language C++ is the language of choice to: Maximize Speed Minimize Power Consumption Target the latest hardware Have full control of your application Native Concurrency using C++ PPL, Agents, and AMP provide a powerful set of tools to enable you to unlock your potential concurrency!!! C++ is AMPed!!!
Please fill out a evaluation form before you leave! If you would like a copy of this slide deck please me at If you would more information please contact me or better yet, come to either the local C++ User Groups: Houston C++ User Group (1 st Thursday each month) University of Houston C++ User Group (Wednesday before 1 st Thursday each month)