C66x KeyStone Training OpenMP: An Overview.  Motivation: The Need  The OpenMP Solution  OpenMP Features  OpenMP Implementation  Getting Started with.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Introductions to Parallel Programming Using OpenMP
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Parallel Programming in Java with Shared Memory Directives.
Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
OpenMP China MCP.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Introduction to OpenMP
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
MPI and OpenMP.
Threaded Programming Lecture 2: Introduction to OpenMP.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
CPE779: More on OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
CS240A, T. Yang, Parallel Programming with OpenMP.
COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University
Heterogeneous Computing using openMP lecture 1 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Introduction to OpenMP
Shared Memory Parallelism - OpenMP
CS427 Multicore Architecture and Parallel Computing
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.
Loop Parallelism and OpenMP CS433 Spring 2001
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.
Open[M]ulti[P]rocessing
Computer Engg, IIT(BHU)
Introduction to OpenMP
Computer Science Department
Multi-core CPU Computing Straightforward with OpenMP
Parallel Programming with OpenMP
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Introduction to OpenMP
OpenMP Parallel Programming
Shared-Memory Paradigm & OpenMP
Parallel Programming with OPENMP
Presentation transcript:

C66x KeyStone Training OpenMP: An Overview

 Motivation: The Need  The OpenMP Solution  OpenMP Features  OpenMP Implementation  Getting Started with OpenMP on 6678 Agenda

 Motivation: The Need  The OpenMP Solution  OpenMP Features  OpenMP Implementation  Getting Started with OpenMP on 6678 Agenda

Motivation: TI Multicore Perspective Test and AutomationMission Critical Medical Imaging Emerging Emerging Broadband Multichannel & Next Generation Video – H.265, SVC, etc. High Performance Compute

Earlier with Single Core – New, faster processor would give desired performance boost – Faster execution speed was a result of better hardware – Minimal effort from software developers – Porting sequential code was straight forward Now with Multicore – Boost in performance not only function of hardware – Need to master software techniques that leverage inherent parallelism of multicore device – Every semiconductor vendor has own software solution – Many new to multicore software development and have existing sequential code to port Motivation: Migrate SW from Single to Multicore

An efficient way to program multicore that is:  Easy to use and quick to implement  Scalable  Sequential-coder friendly  Portable and widely adopted Motivation: The Need

 Motivation: The Need  The OpenMP Solution  OpenMP Features  OpenMP Implementation  Getting Started with OpenMP on 6678 Agenda

What is OpenMP?  An API for writing multi-threaded applications  API includes compiler directives and library routines  C, C++, and Fortran support  Standardizes last 20 years of Shared-Memory Programming (SMP) practice The OpenMP Solution

How does OpenMP address the needs?  Easy to use and quick to implement  Minimal modification to source code  Compiler figures out details  Scalable  Minimal or no code changes to add cores to implementation  Sequential-coder friendly  Allows incremental parallelization v/s all-or-nothing approach  Allows unified code base for sequential and parallel versions  Portable and widely adopted  Ideal for shared-memory parallel (SMP) architectures  Open-source and community-driven effort  Architecture Review Board includes: TI, Cray, Intel, NVidia, AMD, IBM, HP, Microsoft and others The OpenMP Solution:

 Motivation: The Need  The OpenMP Solution  OpenMP Features  OpenMP Implementation  Getting Started with OpenMP on 6678 Agenda

Features: OpenMP Solution Stack Runtime library OS/system Directives, Compiler OpenMP library Environment variables Application End User System layer User layer Prog. Layer (OpenMP API)

 Compiler Directives and Clauses:  Specifies instructions to execute in parallel and its distribution across cores  Example: #pragma omp construct [clause [clause].. ]  Library Routines:  Execution Environment Routines  Configure and monitor threads, processors, and parallel environment  Example: int omp_set_num_threads (int)  Lock Routines  Synchronization with OpenMP locks  Example: void omp_set_lock (omp_lock_t *)  Timing Routines  Support portable wall clock timer  Example: double omp_get_wtime(void)  Environment Variables:  Alter execution features of applications like default number of threads, loop iteration scheduling, etc.  Example : OMP_NUM_THREADS Features: OpenMP API Consists of…

 Motivation: The Need  The OpenMP Solution  OpenMP Features  OpenMP Implementation  Create Teams of Threads  Share Work among Threads  Manage Data-Scoping  Synchronize Threads and Variables  Getting Started with OpenMP on 6678 Agenda

 Motivation: The Need  The OpenMP Solution  OpenMP Features  OpenMP Implementation  Create Teams of Threads  Share Work among Threads  Manage Data-Scoping  Synchronize Threads and Variables  Getting Started with OpenMP on 6678 Agenda

 Create Teams of Threads  Fork-Join Model  Execute code in a parallel region  Implemented by using compiler directive #pragma omp parallel  Nesting ‘parallel’ directives is possible, allowing multilevel parallelism Fork Join Master Thread Team of Threads (created automatically) Master Thread Sequential Region Parallel Region Starts ID:0 ID:1 ID:2 ID:3 Threads execute simultaneously Parallel Region Ends Wait till all threads terminate Sequential Region Implementation: Use OpenMP to…

#include void main() { omp_set_num_threads(4); #pragma omp parallel { int tid = omp_get_thread_num(); printf ("Hello World from thread = %d\n", tid); } Include Header API definitions Library Function Set # of threads (typically # of cores) Compiler Directive Fork team of threads Library Function Get thread id Implicit Barrier Implementation: Parallel Construct

 Motivation: The Need  The OpenMP Solution  OpenMP Features  OpenMP Implementation  Create Teams of Threads  Share Work among Threads  Manage Data-Scoping  Synchronize Threads and Variables  Getting Started with OpenMP on 6678 Agenda

 Share Work among Threads  By default each thread redundantly executes all code in // region  Programmer can insert work-sharing constructs to express how computation should be distributed  Example: Distribute for loop  Applicable only to loops where iterations are independent, i.e. changing order of execution won’t matter  #pragma omp for  Example: Distribute multiple tasks  #pragma omp section Implementation: Use OpenMP to…

Implementation: Work-sharing Constructs for(i=0;i<N;i++) { a[i] = a[i] + b[i]; } #pragma omp parallel { int id, i, Nthrds, istart, iend; id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i<iend;i++) { a[i] = a[i] + b[i]; } } #pragma omp parallel #pragma omp for for(i=0;i<N;i++) { a[i] = a[i] + b[i]; } Sequential Code Only with Parallel Construct Parallel and Work-sharing Constructs Source: Reference #3

Implementation: Work-sharing Constructs #pragma omp parallel #pragma omp sections { #pragma omp section x_calculation(); #pragma omp section y_calculation(); #pragma omp section z_calculation(); } By default, there is a barrier at the end of the “omp sections” Use the “nowait” clause to turn off the barrier. Source: Reference #5

 Motivation: The Need  The OpenMP Solution  OpenMP Features  OpenMP Implementation  Create Teams of Threads  Share Work among Threads  Manage Data-Scoping  Synchronize Threads and Variables  Getting Started with OpenMP on 6678 Agenda

 Manage Data-scoping using Clauses  Control how variables should be treated in a parallel region  Clauses  private clause  Each thread has a private copy of this variable and a unique value throughout the parallel construct  Variable declared inside parallel region is automatically private  Stored in thread stack; default size set by compiler but can override  shared clause  Same copy of this variable is seen by all threads  Variable declared outside parallel region is automatically shared (part of MSMC or DDR3)  default clause  Override default scope assigned to any variable  Set to none to explicitly specify scope of all variables used inside //  Programmer’s responsibility to declare which variables are shared / private  Some variables like iteration counts, the compiler automatically enforces Implementation: Use OpenMP to …

Implementation: Data-Scoping Clauses #pragma omp parallel for default (none) private( i, j, sum ) shared (A, B, C) if (flag) { for (i = 0, i < 10; i++) { sum = 0; for ( j = 0; j < 20; j++ ) sum += B[ i ][ j ] * C [ j ]; A[ i ] = sum; }

 Motivation: The Need  The OpenMP Solution  OpenMP Features  OpenMP Implementation  Create Teams of Threads  Share Work among Threads  Manage Data-Scoping  Synchronize Threads and Variables  Getting Started with OpenMP on 6678 Agenda

 Synchronize Threads  Synchronization at the end of work-sharing or // construct is automatic  Synchronizing subset of threads has to be manually handled  Some Synchronization Directives:  #pragma omp critical  Only one thread may enter at a time  Applies to block of code  If critical sections are unnamed, threads will not enter any of them  #pragma omp atomic  Hardware provides atomic operation for expression  Applies to line of code (expression like X+=5)  Less overhead but less portability and limited to specific operations  #pragma omp barrier  Each thread waits until all threads arrive  #pragma omp flush[optional list]  User can creates sequence point for consistent view of memory  Implicit barriers automatically ensure cache coherency Implementation: Use OpenMP to …

Implementation: Synchronization Constructs int sum = 0, i; int A [100] = populate(); #pragma omp for shared (sum, array) { for (i = 0, i < 100; i++) { #pragma omp atomic sum += A [ i ]; }

Implementation: Reduction Construct int sum = 0, i; int A [100] = populate(); #pragma omp for shared (A) reduction (+:sum) { for (i = 0, i < 100; i++) { sum += A [ i ]; }  Reduction creates private copy of shared variable for each thread  At end of parallel loop, private copies of variable are ‘reduced’ back into original shared variable and operator (‘+’) is applied

 Motivation: The Need  The Solution: OpenMP  OpenMP Features  OpenMP Implementation  Getting Started with OpenMP on 6678 Agenda

Each core runs SYS/BIOS RTOS OpenMP master and worker threads execute inside dedicated SYS/BIOS tasks IPC is used for communication and synchronization. OpenMP run-time state and user data is allocated in shared memory Source: Reference #3 OpenMP on 6678: Solution Stack

OpenMP Specification 3.0 support available as part of upcoming MCSDK 2.1. Compiler support from version 7.4 or higher Currently available: MCSDK v2.1 with OMP 1.1 MCSDK 2.1 includes “OMP” package w/ OpenMP programming layer and runtime, and CodeGen 7.4.x compiler. OpenMP on 6678: Availability

We will see how to: Access example OpenMP projects from CCS v5.1.1 Include OpenMP header file #include Specify number of cores in project configuration.cfg OpenMP.setNumProcessors(4); Provide --omp compiler option…available as a check box in project settings on CCSv5.1.1 Build  C6000 Compiler  Advanced Options  Advanced Optimizations  Enable Support for OpenMP 3.0 OpenMP on 6678: CCS Demo

OpenMP on 6678: Spawning Threads pop CoreN Create_Task Event Queue N Event Queue 2 Core2 Create_Task Core1 Create_Task Event Queue 1 Core0 Create_Task Event Queue 0 Scheduler pop Create_Thread push Use of an event queue for each core for task assignments Scheduler keeps track of the number of threads per core to distribute threads evenly on the cores Source: Reference #3

OpenMP on 6678: Creating a Parallel Region Compiler extension translates the directives into calls to runtime library functions #pragma omp parallel Structured Code; } { Setup data; Gomp_parallel_start (&subfunction, &data, num_threads) Gomp_parallel_end(); Void subfunction (void *data) { use data; } Structured Code; Subfunction(&data); Compiler Translation Source: Reference #3

1.Using OpenMP, B. Chapman, G. Jost, R. Van Der Pas OpenMP-Programming-Engineering-Computation/dp/ / OpenMP-Programming-Engineering-Computation/dp/ / 2.Introduction to OpenMP Multiple Presentations Eric Stotzer (TI), Barbara Chapman (UH), Yogesh Siraswar(TI) 4.OpenMP Specification, 5.OpenMP Runtime for SYS/BIOS User’s Guide Included in OMP/docs folder when you install MCSDK MCSDK 2.1 Addendum Included in MCSDK/docs folder when you install MCSDK TI Internal MCSDK alpha download link: MCSDK/02_01_00_00/index_FDS.html MCSDK/02_01_00_00/index_FDS.html References

Backup Slides

OpenMP on 6678: Starting Points Code Composer Studio Latest TI MCSDK Alpha-3 releases today, April 27 Externally available with MCSDK v2.1 release, May 14 MCSDK will include an OMP package w/ OpenMP programming layer and runtime, and CodeGen 7.4.x compiler We will now switch to the Code Composer Studio Environment to explore what the OMP package contains and how to start with building and running an OpenMP example on 6678

Heterogeneous OpenMP  2 orthogonal problems  OpenMP assumes a single shared address space (and not currently there for distributed space)  No mechanism in OpenMP for allocating work specific processors on an architecture and also working acorss 2 different operating systems  Key Points  Memory subsystems of GPUs  More than one accelerator  Each accelerator using its own separate memory  Need to extend OpenMP to support this  Why not use OpenCL?  Because abstraction is low-level…better solution compared to CUDA but not as high-level as OpenMP  Current approaches under investigation  StarPU, Barcelona Super Computing, Multicore Association including MCAPI, MRAPI, etc.

TI & OpenMP: History Former TI-er Shreyas (CA) worked on initial version of package Took LibGOMP (under gcc compiler) and implemented it on top of SYS/BIOS TI-er Yogesh (MD) added QMSS-based approach and continues to work on speeding up runtime

Backup  SMP – processing engines are identical, and shared memory through which that they can communicate (these are two criteria that openMP needs)  Well known – OpenMP Symmetric multiprocessing –  Distributed memory systems – MPI  OpenCL – host talking to a general purpose computing GPU  At TI we don’t exclusively tie you to any of them, but we focus here on symmetric

Features: OpenMP API consists of…  Compiler directives and clauses:  Specifies instructions to execute in parallel and its distribution across cores  Example: #pragma omp parallel for reduction (+:sum) for (i=1; i<=n; i++) { sum = sum + a[i]*b[i]; }  Library routines:  Execution Environment Routines  Affect and monitor threads, processors, and parallel environment  Example: int omp_set_num_threads (int)  Lock Routines  Synchronization with OpenMP locks  Example: void omp_set_lock (omp_lock_t *)  Timing Routines  Support portable wall clock timer  Example: double omp_get_wtime(void)  Environment variables :  Alter execution features of applications like default number of threads, loop iteration scheduling, etc.  Example: OMP_NUM_THREADS

Motivation: Parallelism  Instruction-level: Multiple functional units  Architectural: Multiple processors, shared memory  Multithreading: Interleaving instructions from multiple application threads  Multicore: Replicate substantial parts of a processor’s logic on a single chip  Software techniques to leverage hardware parallelism

Matrix Multiply Example #pragma omp parallel for { for (i = 0, i < 10; i++) { sum = 0; for ( j = 0; j < 20; j++ ) sum += B[ i ][ j ] * C [ j ]; }

Implementation: Memory Model  Cache coherency  Typically consistency of shared variables in cacheable memory handled automatically at synchronization points  However, current TI implementation makes it programmer’s responsibility  Can use #pragma omp flush or Cache_wbInvAll() within

OpenMP on TI Multicore SYS/BIOS RTOS runs on each core and enables real-time multithreading Navigator Module allows synchronization between threads using QMSS Inter-Processor Communication (IPC) module allows dynamic management of shared memory and communication