OpenMP China MCP.

OpenMP China MCP

Agenda Motivation: The Need The OpenMP Solution OpenMP Features
OpenMP Implementation Getting Started with OpenMP on KeyStone & Keystone II - We’re all aware of the capabilities of our c66x-based multicore DSPs, 8 cores of raw performance with a host of efficient peripherals that can really make it possible to It is crucial that along with these hardware enhancements, is a software implementation that is optimized for multiple cores and efficiently utilizes the parallelism inherent in them. From a code developer’s perspective, this translates to identifiying a relevant parallel processing model that helps ease this development For a long time, waiting for next processor would give them the performance boost they need – not a good idea to start with, but now less than before Shared memory parallel programming You tell the compiler where paralleliszation is Any system with single address space Memory model – pool of threads, each thread has its oen private memory accessible only to that thread and also globaly shared memory – once a thread puts it in the shared memory pool, everybody can read or write to it. All details taken away and handled by runtime system and compiler When I start – master thread runs from start to finish – in parallel region, worker threads are spawned and work is distributed. Once parallel region has completed, there is an implicity synchronization

OpenMP Implementation Getting Started with OpenMP on KeyStone & Keystone II - We’re all aware of the capabilities of our c66x-based multicore DSPs, 8 cores of raw performance with a host of efficient peripherals that can really make it possible to - It is crucial that along with these hardware enhancements, is a software implementation that is optimized for multiple cores and efficiently utilizes the parallelism inherent in them. From a code developer’s perspective, this translates to identifiying a relevant parallel processing model that helps ease this development For a long time, waiting for next processor would give them the performance boost they need – not a good idea to start with, but now less than before Shared memory parallel programming You tell the compiler where paralleliszation is Any system with single address space Memory model – pool of threads, each thread has its oen private memory accessible only to that thread and also globaly shared memory – once a thread puts it in the shared memory pool, everybody can read or write to it. All details taken away and handled by runtime system and compiler When I start – master thread runs from start to finish – in parallel region, worker threads are spawned and work is distributed. Once parallel region has completed, there is an implicity synchronization

Motivation: TI Multicore Perspective
Mission Critical Test and Automation Medical Imaging High Performance Compute Emerging Broadband Emerging Some of the applications which are ideal for the C667x family of devices are test & automation, medical imaging, smart grids, emerging broadband technologies and high performance computing. We’re all aware of the capabilities of our c66x-based multicore DSPs, 8 cores of raw performance with a host of efficient peripherals that can really make it possible to It is crucial that along with these hardware enhancements, is a software implementation that is optimized for multiple cores and efficiently utilizes the parallelism inherent in them. From a code developer’s perspective, this translates to identifiying a relevant parallel processing model that helps ease this development For a long time, waiting for next processor would give them the performance boost they need – not a good idea to start with, but now less than before Multichannel & Next Generation Video – H.265, SVC, etc. 4

Motivation: Migrate SW from Single to Multicore
Earlier with Single Core New, faster processor would give desired performance boost Faster execution speed was a result of better hardware Minimal effort from software developers Porting sequential code was straight forward Now with Multicore Boost in performance not only function of hardware Need to master software techniques that leverage inherent parallelism of multicore device Every semiconductor vendor has own software solution Many are new to multicore software development and have existing sequential code to port

Motivation: The Need An efficient way to program multicore that is:
Easy to use and quick to implement Scalable Sequential-coder friendly Portable and widely adopted - We’re all aware of the capabilities of our c66x-based multicore DSPs, 8 cores of raw performance with a host of efficient peripherals that can really make it possible to - It is crucial that along with these hardware enhancements, is a software implementation that is optimized for multiple cores and efficiently utilizes the parallelism inherent in them. From a code developer’s perspective, this translates to identifiying a relevant parallel processing model that helps ease this development For a long time, waiting for next processor would give them the performance boost they need – not a good idea to start with, but now less than before Shared memory parallel programming You tell the compiler where paralleliszation is Any system with single address space Memory model – pool of threads, each thread has its oen private memory accessible only to that thread and also globaly shared memory – once a thread puts it in the shared memory pool, everybody can read or write to it. All details taken away and handled by runtime system and compiler When I start – master thread runs from start to finish – in parallel region, worker threads are spawned and work is distributed. Once parallel region has completed, there is an implicity synchronization

The OpenMP Solution What is OpenMP?
An API for writing multi-threaded applications API includes compiler directives and library routines C/C++ and Fortran support in general. C/C++ support on TI devices. Standardizes last 20 years of Shared-Memory Programming (SMP) practice TBD: 1. Check on image processing demo how number of cores 2. Compiler backward compatible – compilation errors 3. What is openMP – with latest revision number – compiler is going to have openmp revision 3.0 Need for a standard, industry-agreed upon, application programming interface that is portable and easy to use At TI we are not tied exclusively to any programming model or architecture – however the focus for this talk is OpenMP and SMP Shared memory parallel programming You tell the compiler where paralleliszation is Any system with single address space Memory model – pool of threads, each thread has its oen private memory accessible only to that thread and also globaly shared memory – once a thread puts it in the shared memory pool, everybody can read or write to it. All details taken away and handled by runtime system and compiler When I start – master thread runs from start to finish – in parallel region, worker threads are spawned and work is distributed. Once parallel region has completed, there is an implicity synchronization

The OpenMP Solution: How does OpenMP address the needs?
Easy to use and quick to implement Minimal modification to source code Compiler figures out details Scalable Minimal or no code changes to add cores to implementation Sequential-coder friendly Allows incremental parallelization v/s all-or-nothing approach Allows unified code base for sequential and parallel versions Portable and widely adopted Ideal for shared-memory parallel (SMP) architectures Open-source and community-driven effort Architecture Review Board includes:TI, Cray, Intel, NVidia, AMD, IBM, HP, Microsoft and others TBD: 1. Check on image processing demo how number of cores 2. Compiler backward compatible – compilation errors 3. What is openMP – with latest revision number – compiler is going to have openmp revision 3.0 Need for a standard, industry-agreed upon, application programming interface that is portable and easy to use At TI we are not tied exclusively to any programming model or architecture – however the focus for this talk is OpenMP and SMP Shared memory parallel programming You tell the compiler where paralleliszation is Any system with single address space Memory model – pool of threads, each thread has its oen private memory accessible only to that thread and also globaly shared memory – once a thread puts it in the shared memory pool, everybody can read or write to it. All details taken away and handled by runtime system and compiler When I start – master thread runs from start to finish – in parallel region, worker threads are spawned and work is distributed. Once parallel region has completed, there is an implicity synchronization

Features: OpenMP Solution Stack
Runtime library OS/system Directives, Compiler OpenMP library Environment variables Application End User System layer User layer Prog. Layer (OpenMP API)

Features: OpenMP API consists of…
Compiler Directives and Clauses: Specifies instructions to execute in parallel and its distribution across cores Example: #pragma omp construct [clause [clause] .. ] Library Routines: Execution Environment Routines Configure and monitor threads, processors, and parallel environment Example: int omp_set_num_threads (int) Lock Routines Synchronization with OpenMP locks Example: void omp_set_lock (omp_lock_t *) Timing Routines Support portable wall clock timer Example: double omp_get_wtime(void) Environment Variables: Alter execution features of applications like default number of threads, loop iteration scheduling, etc. Example: OMP_NUM_THREADS

OpenMP Implementation Create Teams of Threads Share Work among Threads Manage Data-Scoping Synchronize Threads and Variables Getting Started with OpenMP on KeyStone & Keystone II - We’re all aware of the capabilities of our c66x-based multicore DSPs, 8 cores of raw performance with a host of efficient peripherals that can really make it possible to - It is crucial that along with these hardware enhancements, is a software implementation that is optimized for multiple cores and efficiently utilizes the parallelism inherent in them. From a code developer’s perspective, this translates to identifiying a relevant parallel processing model that helps ease this development For a long time, waiting for next processor would give them the performance boost they need – not a good idea to start with, but now less than before Shared memory parallel programming You tell the compiler where paralleliszation is Any system with single address space Memory model – pool of threads, each thread has its oen private memory accessible only to that thread and also globaly shared memory – once a thread puts it in the shared memory pool, everybody can read or write to it. All details taken away and handled by runtime system and compiler When I start – master thread runs from start to finish – in parallel region, worker threads are spawned and work is distributed. Once parallel region has completed, there is an implicity synchronization

Implementation: Use OpenMP to…
Create Teams of Threads Fork-Join Model Execute code in a parallel region Implemented by using compiler directive #pragma omp parallel Nesting ‘parallel’ directives is possible, allowing multilevel parallelism Fork Join Master Thread Team of Threads (created automatically) Sequential Region Parallel Region Starts ID:0 ID:1 ID:2 ID:3 Threads execute simultaneously Parallel Region Ends Wait till all threads terminate Fork-Join Model: Begin with an initial thread in a non-parallel region. When a parallel region starts (noted by parallel pragma), extra threads are automatically created. In the parallel region, several threads execute simultaneously. Once the parallel region ends, the program waits for all threads to terminate and then reverts to a single-threaded execution (Source: Topcoder)

Implementation: Parallel Construct
Include Header API definitions #include <ti/omp/omp.h> void main() { omp_set_num_threads(4); #pragma omp parallel int tid = omp_get_thread_num(); printf ("Hello World from thread = %d\n", tid); } } Library Function Set # of threads (typically # of cores) Compiler Directive Fork team of threads Library Function Get thread id TBD: Check what happens when you set num threads greater than number of cores on TI device Can you keep changing num_threads? Do we have control over which cores get assigned for task? Implicit Barrier

Share Work among Threads By default each thread redundantly executes all code in // region Programmer can insert work-sharing constructs to express how computation should be distributed Example: Distribute for loop Applicable only to loops where iterations are independent, i.e. changing order of execution won’t matter #pragma omp for Example: Distribute multiple tasks #pragma omp section

Implementation: Work-sharing Constructs
Sequential Code for(i=0;i<N;i++) { a[i] = a[i] + b[i]; } #pragma omp parallel { int id, i, Nthrds, istart, iend; id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i<iend;i++) { a[i] = a[i] + b[i]; } } Only with Parallel Construct Source: Barbara Chapman, University of Houston Parallel and Work-sharing Constructs #pragma omp parallel #pragma omp for for(i=0;i<N;i++) { a[i] = a[i] + b[i]; } Source: Reference #3

Implementation: Work-sharing Constructs
#pragma omp parallel #pragma omp sections { #pragma omp section x_calculation(); #pragma omp section y_calculation(); #pragma omp section z_calculation(); } Source: Barbara Chapman, University of Houston By default, there is a barrier at the end of the “omp sections” Use the “nowait” clause to turn off the barrier. Source: Reference #5

Manage Data-scoping using Clauses Control how variables should be treated in a parallel region Clauses private clause Each thread has a private copy of this variable and a unique value throughout the parallel construct Variable declared inside parallel region is automatically private Stored in thread stack; default size set by compiler but can override shared clause Same copy of this variable is seen by all threads Variable declared outside parallel region is automatically shared (part of MSMC or DDR3) default clause Override default scope assigned to any variable Set to none to explicitly specify scope of all variables used inside // Programmer’s responsibility to declare which variables are shared / private Some variables like iteration counts, the compiler automatically enforces

Implementation: Data-Scoping Clauses
#pragma omp parallel for default (none) private( i, j, sum ) shared (A, B, C) if (flag) { for (i = 0, i < 10; i++) { sum = 0; for ( j = 0; j < 20; j++ ) sum += B[ i ][ j ] * C [ j ]; A[ i ] = sum; } Source: Barbara Chapman, University of Houston If clause: way at runtime to disable parallel construct based on value of flag or whatever conditions are important

Synchronize Threads Synchronization at the end of work-sharing or // construct is automatic. Synchronizing subset of threads has to be manually handled. Some Synchronization Directives: #pragma omp critical <name> Only one thread may enter at a time. Applies to block of code If critical sections are unnamed, threads will not enter any of them. #pragma omp atomic Hardware provides atomic operation for expression Applies to line of code (expression like X+=5) Less overhead but less portability and limited to specific operations #pragma omp barrier Each thread waits until all threads arrive #pragma omp flush[optional list] User can creates sequence point for consistent view of memory Implicit barriers automatically ensure cache coherency Barrier sync: no thread can progress until all other threads in the team have reached that point in the program By default redundant execution of code across processors. Programmer needs to mention how we want to distribute work among threads Independent iterations means order in which iterations occur has no bearing on the outcome

Implementation: Synchronization Constructs
int sum = 0, i; int A [100] = populate(); #pragma omp for shared (sum, A) { for (i = 0, i < 100; i++) { #pragma omp atomic sum += A [ i ]; } Source: TopCoder Article

Implementation: Reduction Construct
int sum = 0, i; int A [100] = populate(); #pragma omp for shared (A) reduction (+:sum) { for (i = 0, i < 100; i++) { sum += A [ i ]; } Source: TopCoder Article Reduction creates private copy of shared variable for each thread At end of parallel loop, private copies of variable are ‘reduced’ back into original shared variable and operator (‘+’) is applied

Agenda Motivation: The Need The Solution: OpenMP OpenMP Features

OpenMP on KeyStone: Solution Stack
Depends on SYS/BIOS. Each core runs the RTOS. OpenMP master and worker threads execute inside dedicated SYS/BIOS tasks. IPC is used for communication and synchronization. OpenMP run-time state and user data is allocated in shared memory. Source: “OpenMP Runtime Implementation using Multi-core Navigator,” Feb Powerpoint. Author: Yogesh Siraswar, Texas Instruments

OpenMP on KeyStone: Availability
TI’s OpenMP package ‘OMP’ is installed as part of the MCSDK installation w/ OpenMP programming layer and runtime, and CodeGen 7.4.x compiler. MCSDK v is the latest version available, and includes OMP v Download MCSDK at Once MCSDK is installed, install update patch for OMP v from

OpenMP on KeyStone: Recent Updates
Important to update to latest OMP version which includes: Support for TI’s C6657 device, and nested parallelism feature Significant performance boost from previous versions: Inter-core synchronization achieved through Multicore Navigator instead of hardware semaphores, and other optimizations BIOS/IPC related memory for each core moved to MSMC, making L2 entirely available to the developer TIP: Assign OpenMP variable heap & .data memory to MSMC/DDR, and not L2, to ensure variable sharing across cores Refer to OMP release notes and user guide for more details on improvements, bug fixes, and configuration tips related to this release

OpenMP on KeyStone II On 66AK2H, OpenMP is available on the ARMs using gcc and the corresponding runtime, libgomp. OpenMP on the DSPs is available via the following programming models: OpenMP Dispatch With OpenCL. A TI-specific extension to the OpenCL which allows OpenCL kernels to dispatch OpenMP regions. OpenMP Accelerator Model. A subset of the OpenMP 4.0 specification that enables execution on heterogeneous devices. Source: “OpenMP Runtime Implementation using Multi-core Navigator,” Feb Powerpoint. Author: Yogesh Siraswar, Texas Instruments

OpenMP Dispatch With OpenCL on DSP
Depends on TI OpenCL extension OpenCL kernel act as a wrapper that invokes C functions containing OpenMP regions Three components The host program The OpenCL kernel The OpenMP region

OpenMP Accelerator Model on DSP
OpenMP Accelerator Model is the subset of OpenMP 4.0 specification Host is a Quad Core ARM Cortex-A15 cluster running SMP Linux The target accelerator is a single-cluster consisting of 8 C66x DSP cores. The OpenMP Accelerator model host runtime implementations uses the TI OpenCL Runtime as a back- end.

OpenMP on KeyStone: Spawning Threads
Use of an event queue for each core for task assignments Scheduler keeps track of the number of threads per core to distribute threads evenly on the cores. pop CoreN Create_Task Event Queue N Event Queue 2 Core2 Core1 Event Queue 1 Core0 Event Queue 0 Scheduler Create_Thread push Source: “OpenMP Runtime Implementation using Multicore Navigator” Yogesh Siraswar

OpenMP on KeyStone: CCS Demo
We will see how to: Access example OpenMP projects from CCS v5.1 Include OpenMP header file #include <ti/omp/omp.h> Specify number of cores in project configuration .cfg OpenMP.setNumProcessors(4); Provide --omp compiler option…available as a check box in project settings on CCSv5.1 Build  C6000 Compiler  Advanced Options  Advanced Optimizations  Enable Support for OpenMP 3.0

References OpenMP on C6000 wiki article
MCSDK User Guide OpenMP Runtime for SYS/BIOS User’s Guide Included in OMP/docs folder once you install OMP package OpenMP Specification, Using OpenMP, B. Chapman, G. Jost, R. Van Der Pas Introduction to OpenMP MCSDK-HPC 3.0 Getting Started

OpenMP China MCP.

Similar presentations

Presentation on theme: "OpenMP China MCP."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

OpenMP China MCP.

Similar presentations

Presentation on theme: "OpenMP China MCP."— Presentation transcript:

Similar presentations

About project

Feedback