Dieter an Mey Center for Computing and Communication, RWTH Aachen University, Germany anMey@rz.rwth-aachen.de V 3.0.

Slides:



Advertisements
Similar presentations
Barcelona Supercomputing Center. The BSC-CNS objectives: R&D in Computer Sciences, Life Sciences and Earth Sciences. Supercomputing support to external.
Advertisements

Parallel Processing with OpenMP
Introduction to Openmp & openACC
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
Introductions to Parallel Programming Using OpenMP
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Types of Parallel Computers
Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
2005/6/2 IWOMP05 1 IWOMP05 panel “OpenMP 3.0” Mitsuhisa Sato ( University of Tsukuba, Japan)
1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.
Executing OpenMP Programs Mitesh Meswani. Presentation Outline Introduction to OpenMP Machine Architectures Shared Memory (SMP) Distributed Memory MPI.
Parallel Programming in Java with Shared Memory Directives.
OMPi: A portable C compiler for OpenMP V2.0 Elias Leontiadis George Tzoumas Vassilios V. Dimakopoulos University of Ioannina.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.
Hybrid MPI and OpenMP Parallel Programming
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
DEV490 Easy Multi-threading for Native.NET Apps with OpenMP ™ and Intel ® Threading Toolkit Software Application Engineer, Intel.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
HPC1 OpenMP E. Bruce Pitman October, HPC1 Outline What is OpenMP Multi-threading How to use OpenMP Limitations OpenMP + MPI References.
Threaded Programming Lecture 2: Introduction to OpenMP.
Porting processes to threads with MPC instead of forking Some slides from Marc Tchiboukdjian (IPDPS’12) : Hierarchical Local Storage Exploiting Flexible.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.
Information Technology Services B. Estrade, Louisiana Optical Network LSU OpenMP I B. Estrade.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
SHARED MEMORY PROGRAMMING WITH OpenMP
Welcome: Intel Multicore Research Conference
SHARED MEMORY PROGRAMMING WITH OpenMP
Loop Parallelism and OpenMP CS433 Spring 2001
Computer Engg, IIT(BHU)
Constructing a system with multiple computers or processors
The Cache-Coherence Problem
The Cache-Coherence Problem
High Performance Computing
Multicore / Multiprocessor Architectures
Hybrid Parallel Programming
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson *Parts of this lecture.
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Hybrid Programming with OpenMP and MPI
Hybrid Parallel Programming
Hybrid MPI and OpenMP Parallel Programming
Department of Computer Science, University of Tennessee, Knoxville
Parallel Computing Explained How to Parallelize a Code
HPC User Forum: Back-End Compiler Technology Panel
Hybrid Parallel Programming
Types of Parallel Computers
Question 1 How are you going to provide language and/or library (or other?) support in Fortran, C/C++, or another language for massively parallel programming.
Shared-Memory Paradigm & OpenMP
Programming Parallel Computers
Presentation transcript:

Dieter an Mey Center for Computing and Communication, RWTH Aachen University, Germany anMey@rz.rwth-aachen.de V 3.0

Setting the scene My perspective: HPC support in a technical university (engineering, natural sciences …) 4 … 144-way compute servers 1000s jobs per day 8-16 processors per job on the average MPI - OpenMP – Autoparallel – Hybrid

The Time Scale today OpenMP running on all big SMPs: Cray X, Fujitsu Primepower, HP Superdome, IBM Regatta, NEC SX, SGI Altix, Sun Fire => OpenMP is expensive 2003 dual processors Intel boxes with hyperthreading for 4 threads, not scalable, but cheap 2004 4-way Opteron machines attractive (NUMA!!) 2005 16-way Opteron boxes at a very competitive price with many OpenMP compilers on Lin/Win/Sol available 2006 dual core processors everywhere (incl. laptops) 2007 low latency networks are getting commodity OpenMP on Infiniband-Clusters 2008 4-8-core processors => 8-32 – way systems affordable 2009 OpenMP V3.0 available for the end user Are we keeping pace?

Is OpenMP easy? Yes, you can easily run into data races. We need data race detection / prevention tools in the development cycle. The default shared strategy for global data may be inadequate !$OMP DEFAULT(SHARED|PRIVATE|NONE) !$OMP THREADSHARED(list) !$OMP THREADPRIVATE(list) need for a compiler switch? (Fujitsu: frt –Kprivate …) autoscoping = cooperating with the compiler !$OMP PARALLEL DEFAULT(AUTO)

Automatic Scoping – Sometimes it would really help! ... !$omp & omegaz,prode,qdens,qjc,qmqc,redbme,redbpe,renbme, & !$omp & renbpe,resbme,resbpe,reubme,reubpe,rkdbmk,rkdbpk,rknbmk, & !$omp & rknbpk,rksbmk,rksbpk,rkubmk,rkubpk,rtdbme,rtdbpe,rtnbme, & !$omp & rtnbpe,rtsbme,rtsbpe,rtubme,rtubpe,rudbme,rudbmx,rudbmy, & !$omp & rudbmz,rudbpe,rudbpx,rudbpy,rudbpz,runbme,runbmx,runbmy, & !$omp & runbmz,runbpe,runbpx,runbpy,runbpz,rusbme,rusbmx,rusbmy, & !$omp & rusbmz,rusbpe,rusbpx,rusbpy,rusbpz,ruubme,ruubmx,ruubmy, & !$omp & ruubmz,ruubpe,ruubpx,ruubpy,ruubpz,rvdbme,rvdbmx,rvdbmy, & !$omp & rvdbmz,rvdbpe,rvdbpx,rvdbpy,rvdbpz,rvnbme,rvnbmx,rvnbmy, & !$omp & rvnbmz,rvnbpe,rvnbpx,rvnbpy,rvnbpz,rvsbme,rvsbmx,rvsbmy, & !$omp & rvsbmz,rvsbpe,rvsbpx,rvsbpy,rvsbpz,rvubme,rvubmx,rvubmy, & !$omp & rvubmz,rvubpe,rvubpx,rvubpy,rvubpz,rwdbme,rwdbmx,rwdbmy, & !$omp & rwdbmz,rwdbpe,rwdbpx,rwdbpy,rwdbpz,rwnbme,rwnbmx,rwnbmy, & !$omp & rwnbmz,rwnbpe,rwnbpx,rwnbpy,rwnbpz,rwsbme,rwsbmx,rwsbmy, & !$omp & rwsbmz,rwsbpe,rwsbpx,rwsbpy,rwsbpz,rwubme,rwubmx,rwubmy, & !$omp & rwubmz,rwubpe,rwubpx,rwubpy,rwubpz,tc,tdb,tdbm, & !$omp & tdbp,teb,tkc,tkdb,tkdbm,tkdbp,tkeb,tknb, & !$omp & tknbm,tknbp,tksb,tksbm,tksbp,tkub,tkubm,tkubp, & !$omp & tkwb,tnb,tnbm,tnbp,tsb,tsbm,tsbp,tub, & !$omp & tubm,tubp,twb,uc,udb,udbm,udbp,ueb, & !$omp & unb,unbm,unbp,usb,usbm,usbp,uub,uubm, & !$omp & uubp,uwb,vc,vdb,vdbm,vdbp,veb,velosq, & !$omp & vnb,vnbm,vnbp,vsb,vsbm,vsbp,vub,vubm, & !$omp & vubp,vwb,wc,wdb,wdbm,wdbp,web,wnb, & !$omp & wnbm,wnbp,wsb,wsbm,wsbp,wub,wubm,wubp, & !$omp & wwb,xiaxc,xiaxeb,xiaxwb,xiayc,xiayeb,xiaywb,xiazc, & !$omp & xiazeb,xiazwb,xibxc,xibxnb,xibxsb,xibynb,xibysb,xibznb, & !$omp & xibzsb,xicxc,xicxdb,xicxub,xicydb,xicyub,xiczdb,xiczub) do i = is,ie ---- 1600 lines omitted ---- end do !$omp parallel do & !$omp & lastprivate(INE,IEE,ISE,ISW,IWW,INW,IUE,IUW,IDE,IDW,INU,IND,ISU,ISD) & !$omp & private( & !$omp & aece,aee,aeebe,aewbe,akck,akebk,akk,akwbk, & !$omp & amueec,amuegc,amuekc,amuelc,amuetc,asonsq,atce,ate, & !$omp & atebe,atwbe,auce,aucx,aucy,aucz,aue,auebe, & !$omp & auebx,aueby,auebz,auwbe,auwbx,auwby,auwbz,aux, & !$omp & auy,auz,avce,avcx,avcy,avcz,ave,avebe, & !$omp & avebx,aveby,avebz,avwbe,avwbx,avwby,avwbz,avx, & !$omp & avy,avz,awce,awcx,awcy,awcz,awe,awebe, & !$omp & awebx,aweby,awebz,awwbe,awwbx,awwby,awwbz,awx, & !$omp & awy,awz,bedbe,benbe,besbe,beube,bkdbk,bknbk, & !$omp & bksbk,bkubk,btdbe,btnbe,btsbe,btube,budbe,budbx, & !$omp & budby,budbz,bunbe,bunbx,bunby,bunbz,busbe,busbx, & !$omp & busby,busbz,buube,buubx,buuby,buubz,bvdbe,bvdbx, & !$omp & bvdby,bvdbz,bvnbe,bvnbx,bvnby,bvnbz,bvsbe,bvsbx, & !$omp & bvsby,bvsbz,bvube,bvubx,bvuby,bvubz,bwdbe,bwdbx, & !$omp & bwdby,bwdbz,bwnbe,bwnbx,bwnby,bwnbz,bwsbe,bwsbx, & !$omp & bwsby,bwsbz,bwube,bwubx,bwuby,bwubz,caqden,dedq1, & !$omp & dedq7,diffe,diffre,diffrk,diffru,diffrv,diffrw,dkdq1, & !$omp & dkdq6,dmtdq6,dmtdq7,dtdq1,dtdq2,dtdq3,dtdq4,dtdq5, & !$omp & dtdq6,dudq1,dudq2,dvdq1,dvdq3,dwdq1,dwdq4,ec, & !$omp & edb,edbm,edbp,eeb,enb,enbm,enbp,esb, & !$omp & esbm,esbp,eub,eubm,eubp,ewb,gedb,genb, & !$omp & gesb,geub,gkdb,gknb,gksb,gkub,gtdb,gtnb, & !$omp & gtsb,gtub,gudb,gunb,gusb,guub,gvdb,gvnb, & !$omp & gvsb,gvub,gwdb,gwnb,gwsb,gwub,lijk,omegay, & ... extract form the PANTA CFD code !$omp parallel default(__auto) do do i = is,ie ---- 1600 lines omitted ---- end do

How about NUMA, clusters, memory hierarchies? Frequently first-touch is not the way to go! If data is initialized by the master process (e.g. by reading from files), data needs to be migrated (automatically or manually) !$OMP NEXTTOUCH(*|var_list) is easy to understand, nothing breaks, if it is not supported => we’ll have to wait for the migration How about !$OMP PREFETCH(var_list, [urgency]) Stef Salvini, EWOMP 2003 http://www.rz.rwth-aachen.de/ewomp03/omptalks/Tuesday/Session8/ewomp_salvini.mpg

NEXTTOUCH works Summing und Saxpying raus… Stream saxpying in C/C++ on a 4-way Opteron system running Solaris

Does OpenMP scale well? Well, occasionally … rarely How are we going to use all these nice 8..32-way boxes in the near future? Parallelizing while loops The proposal is on the table since the very beginning: #pragma omp taskq (KAI guidec/guidec++, Intel icc) Better support of nested parallelism. Can we efficiently use OpenMP encapsulated in libraries? User code calling a Library function(e.g. Newton alg.) calling User function

OpenMP Nested - orientation !$OMP PARALLEL (name) omp_get_thread_num(name) level = omp_get_parallel_level() omp_get_thread_num(level) ! Get thread id of ancestors do level = 0, omp_get_parallel_level() print *, omp_get_thread_num(level) end do

OpenMP Nested - threadprivate The threadprivate directive specifies that named global-lifetime objects are replicated, with each thread having its own copy. There needs to be a way to suppress additional copying in a lower level of nested parallelism. SUBROUTINE SUB() COMMON /T/ A(1000000000) !$OMP THREADPRIVATE(/T/) !$OMP PARALLEL COPYIN(/T/) ... !$OMP PARALLEL WORKSHARE SHARED(/T/) A = ... !$OMP END PARALLEL WORKSHARE ... !$OMP END PARALLEL END SUBROUTINE SUB

Summary We need to hurry … Data race detection / prevention tools Autoscoping Task queues Nexttouch / prefetch Better nested support Orientation threadprivate