Lecture 4: Timing & Programming Models Lecturer: Simon Winberg Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Slides:

Advertisements

Similar presentations

Distributed Systems CS

Advertisements

Slide 2-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 2 Using the Operating System 2.

Introduction to Parallel Computing

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts Essentials – 2 nd Edition Chapter 4: Threads.

Chapter 5 Processes and Threads Copyright © 2008.

Reference: Message Passing Fundamentals.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.

Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.

Parallel Computing Multiprocessor Systems on Chip: Adv. Computer Arch. for Embedded Systems By Jason Agron.

Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Processes.

A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.

 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.

1 I/O Management in Representative Operating Systems.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Lecture 4: Timing & Programming Models Lecturer: Simon Winberg Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Lecture 6: Design of Parallel Programs Part I Lecturer: Simon Winberg Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

10/04/2011CS4961 CS4961 Parallel Programming Lecture 12: Advanced Synchronization (Pthreads) Mary Hall October 4, 2011.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.

1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.

Silberschatz, Galvin and Gagne ©2011Operating System Concepts Essentials – 8 th Edition Chapter 4: Threads.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Parallel Computer Architecture and Interconnect 1b.1.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.

CE Operating Systems Lecture 3 Overview of OS functions and structure.

Lecture 7: Design of Parallel Programs Part II Lecturer: Simon Winberg.

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

Lab 2 Parallel processing using NIOS II processors

Outline Why this subject? What is High Performance Computing?

Slide 6-1 Chapter 6 System Software Considerations Introduction to Information Systems Judith C. Simon.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Parallel Computing Presented by Justin Reschke

SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.

Operating System Concepts

Embedded Real-Time Systems Processing interrupts Lecturer Department University.

Lecture 4: Lecturer: Simon Winberg Temporal and Spatial Computing, Shared Memory, Granularity Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Lecture 5: Lecturer: Simon Winberg Review of paper: Temporal Partitioning Algorithm for a Coarse-grained Reconfigurable Computing Architecture by Chongyong.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Lecture 3: Harvard Architecture, Shared Memory, Architecture Case Studies Lecturer: Simon Winberg Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

These slides are based on the book:

Chapter 4: Multithreaded Programming

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Introduction to Parallel Processing

Distributed Shared Memory

Computer Engg, IIT(BHU)

Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang

Chapter 4: Threads.

Chapter 4: Threads.

Chapter 2: System Structures

Modified by H. Schulzrinne 02/15/10 Chapter 4: Threads.

Programming with Shared Memory

Chapter 4: Threads & Concurrency

Chapter 4: Threads.

Programming with Shared Memory

Presentation transcript:

Lecture 4: Timing & Programming Models Lecturer: Simon Winberg Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

 Timing in C  Review of homework (scalar product)  Important terms  Data parallel model  Message passing model  Shared memory model  Hybrid model

 Makefile: For compiling the program.  blur.c: The file that has the pthreads radial blur code.  serial.c: This file just has the file IO and timing code written for you. You need to implement the median filter.  Develop a para.c version of serial.c to develop your parallel solution. Report issues…

 Write a short report (it can be one or two pages) discussing your observations and results.  Show some of the images produced, demonstrating your understanding of the purpose and method of the median filter.  Give a break down of the different timing results you got from the method.  The timing data needs to be shown in a table, with the number of threads on the y-axis and the different processing types on the x-axis.  Also, there needs to be a graph plotting the speed up factor observed versus the ideal speed up.  The ideal speed up would have the speed up factor being perfectly proportional to the processing resource allocation. So if two processors are being used, then the ideal speed up factor would be 200%.  Comment on the results you got, and why you think you got them. Please reference technical factors in justifying your results.  Hand in: Submit report using the Prac01 VULA assignment.  Working in a team of 2? Submit only one report / zip file, name file according to both student numbers.

Thus far… Looks like everyone has signed up to present a seminar! Meeting TitleDate Status Seminar Group 2Tue, 2014/03/04Full Seminar Group 3Tue, 2014/03/11Full Seminar Group 4Tue, 2014/03/18Full Seminar Group 5Tue, 2014/03/25Available Seminar Group 6Tue, 2014/04/01Full Seminar Group 7Tue, 2014/04/15Available Seminar Group 8Tue, 2014/04/22Full Seminar Group 9Tue, 2014/05/06Full Seminar Group 10Tue, 2014/05/13Full

Chapters covered CH1: A Retrospective on High Performance Embedded Computing CH2: Representative Example of a High Performance Embedded Computing System Next Seminar (4-Mar) Seminar Group: Arnab, Anurag Kammies, Lloyd Mtintsilana, Masande Munbodh, Mrinal

 Sample implementations:  scalarprod.dev – scalar product sequential implementation  matmul.dev – matrix multiply sequential implementation See: TimingCode\TimingExamples See: ParallelProd

Scalarprod.c t0 = CPU_ticks(); // get initial tick value // Do processing... // first initialize the vectors for (i=0; i<VECTOR_LEN; i++) { a[i] = random_f(); b[i] = random_f(); } sum = 0; for (i=0; i<VECTOR_LEN; i++) { sum = sum + (a[i] * b[i]); } // get the time elapsed t1 = CPU_ticks(); // get final tick value Golden measure / sequence solution

Parallelizing a1a1 b1b1 a2a2 b2b2 a3a3 b3b3 a4a4 b4b4 a1 xb1a1 xb1 a2 xb2a2 xb2 a3 xb3a3 xb3 a4 xb4a4 xb4 There’s obvious concurrency here. Ideally, it would be nice to do all these multiplies simultaneously

Parallel code program… Parallel solution: 3.88 Example runs on Intel Core2 Duo, each CPU 2.93GHz Serial solution: 6.31 Speedup = 6.31/3.88 = 1.63

Scalar product parallel solutions… Introduction to some important terms &

Pthreads (partitioned) [a 1 a 2 a 1 a 2 … a m-1 ][b 1 b 2 b 1 b 2 … b m-1 ] Thread 1 sum1 Vectors in global / shared memory [a m a m+1 a m+2 … a n ][b m b m+1 b m+2 … b n ] Thread 2 sum2 starts main() sum = sum1+sum2 Assuming a 2-core machine a b

Pthreads (interlaced*) [a 1 a 3 a 5 a 7 … a n-2 ][b 1 b 3 b 5 b 7 … b n-2 ] Thread 1 sum1 Vectors in global / shared memory [a 2 a 4 a 6 … a n-1 ][b 2 b 4 b 6 … b n-1 ] Thread 2 sum2 starts main() sum = sum1+sum2 Assuming a 2 core machine a b * Data striping, interleaving and interlacing can mean the same or quite different approaches depending on your data.

 Timing methods have been included in the sample files (alternatively see next slides for other options)  Avoid having many programs running (this may interfere slightly with the timing results)

 Additional sample code has been providing for high performance timing on Windows and on Linux – tried to find something fairly portable that doesn’t need MSVC++  I’ve only been able to get the Intel high performance timer working on Cygwin (can’t seem to get better than ms on Cygwin)

Some Important Terms Contiguous Partitioned (or separated or split) Interleaved (or alternating) Interlaced

Programming Models EEE4084F Data parallel model Message passing model Shared memory model Hybrid model Some more terms

 Set of tasks work in parallel on separate parts of a data structure  Sequential version usually a loop, iterating through arrays  Example: Multiply each element by 2 Communicate: (Engage threads and/or co-processors, give instructions) Initialize: prepare division of data (may include programming application co-processors) Perform parallel computations on ‘own’ data Combine results Possible parallel implementation

 Most of the parallel work focused on doing operations on a data set.  Data set is organized into a ‘common structure’, e.g., multidimensional array.  Group of tasks (threads / co-processors) work concurrently on the same data structure, but each works on a different partition of the data structure.  Tasks perform the same operation on ‘their part’ of the data structure.

 For shared memory architectures, all tasks often have access to the same data structure through global memory. (you probably did this with Pthreads in Prac1!!)  For distributed memory architectures the data structure is split up and resides as ‘chunks’ in local task/machines (OpenMP programs tends to use this method together with messages indicating which data to use)

 Can be difficult to write, to identify how to divide up the input and later join up the results…  Consider for i = 1..n: x[i]=x[i] + 1  Simple for shared mem. machine, difficult (if not pointless) for distributed machine  But can also be very easy to write – even embarrassingly parallel  Can be easy to debug (all threads running the same code)

 Communication aspect often easier to deal with, as each process has the same input and output interfaces  Each thread/task usually runs independently without having to wait on answers (synchronization) from others.

 Involves a set of tasks that use their own local memory during computations  Multiple tasks can run on the same physical machine, or they could run across multiple machines  Tasks exchange data by sending and receiving communication messages  Transfer of data usually needs cooperative or synchronization operations to be performed by each process, i.e., each send operation needs a corresponding receive operation.

 Parallelized software applications using MP usually make use of a specialized library of subroutines, linked with the application code.  Programmer is entirely responsible for deciding and implementing all the parallelism.  Historically, a variety of message passing libraries have been available since the 1980s. These implementations differed substantially from each other making it difficult for programmers to develop portable applications.

 The MPI Forum was formed in 1992, with the goal of establishing a standard interface for message passing implementations, called the “Message Passing Interface (MPI)” – first released in 1994

 MPI is now the most common industry standard for message passing, and has replaced many of the custom-developed and obsolete standards used in legacy systems.  Most manufacturers of (microprocessor-based) parallel computing platforms offer an implementation of MPI.  For shared memory architectures, MPI implementations usually do not use the network for inter-task communications, but rather use shared memory (or memory copies) for better performance.

 Tasks share a common address space, read from and written to asynchronously.  Memory control access is needed (e.g., to cater for data dependencies), such as:  Locks and semaphores to control access to shared memory.  In this model, the notion of tasks ‘owning’ data is lacking, so there is no need to specify explicitly the communication of data between tasks.

 Adv: often allows program development to be simplified (e.g. compared to MP).  Adv: Keeping data local to processor working on it conserves memory accesses (reducing cache refreshes, bus traffic and the like that occurs when multiple processors use the same data).  Disad: tends to become more difficult to understand and manage data locality, which may detract from the overall system performance.

 This simple involves a combination of any two or more of the parallel programming models described.  A commonly used hybrid model is the combination of message passing with either POSIX threads. This hybrid model is well suited to the increasingly popular clustering of SMP machines.

 Another increasingly used hybrid model is the combination of data parallelism and message passing. E.g., using a cluster of SMPs each with their own FPGA-based application accelerator or co-processor.

 Single Program Multiple Data (SPMD)  Consider it as running the same program, on different data inputs, on different computers (possibly) at the same time  Multiple Program Multiple Data (MPMD)  Consider this one as running the same program with different parameters settings, or recompiling the same code with different sections of code included (e.g., uisng #ifdef and #endif to do this)

 Observed speedup =  Parallel overhead:  Amount of time to coordinate parallel tasks (excludes time doing useful work).  Parallel overhead includes operations such as: Task/co-processor start-up time, Synchronizations, communications, parallelization libraries (e.g., OpenMP, Pthreads.so), tools, operating system, task termination and clean-up time Wallclock time initial version Wallclock time refined version Wallclock time sequential (or gold) Wallclock time parallel version =

 Embarrassingly Parallel  Simultaneously performing many similar, independent tasks, with little to no coordination between tasks.  Massively Parallel  Hardware that has very many processors (execution of parallel tasks). Can consider this classification of parallel tasks.

 Parallel architectures Reminder – Quiz 1 Next Thursday

Image sources: Stop watch slides 1 & 14, Gold bar: Wikipedia (open commons) books clipart: (open commons) Disclaimers and copyright/licensing details I have tried to follow the correct practices concerning copyright and licensing of material, particularly image sources that have been used in this presentation. I have put much effort into trying to make this material open access so that it can be of benefit to others in their teaching and learning practice. Any mistakes or omissions with regards to these issues I will correct when notified. To the best of my understanding the material in these slides can be shared according to the Creative Commons “Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)” license, and that is why I selected that license to apply to this presentation (it’s not because I particulate want my slides referenced but more to acknowledge the sources and generosity of others who have provided free material such as the images I have used).