Computer Organization David Monismith CS345 Notes to help with the in class assignment.

Slides:

Advertisements

Similar presentations

Threads, SMP, and Microkernels

Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Streaming SIMD Extension (SSE)

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Introduction to Parallel Computing

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.

MULTICORE, PARALLELISM, AND MULTITHREADING By: Eric Boren, Charles Noneman, and Kristen Janick.

1 Threads, SMP, and Microkernels Chapter 4. 2 Process: Some Info. Motivation for threads! Two fundamental aspects of a “process”: Resource ownership Scheduling.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Dr. Muhammed Al-Mulhem 1ICS ICS 535 Design and Implementation of Programming Languages Part 1 OpenMP -Example ICS 535 Design and Implementation.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Introduction to Parallel Processing Ch. 12, Pg

Flynn’s Taxonomy of Computer Architectures Source: Wikipedia Michael Flynn 1966 CMPS 5433 – Parallel Processing.

CS 470/570:Introduction to Parallel and Distributed Computing.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Programming with Shared Memory Introduction to OpenMP

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

Parallel Programming in Java with Shared Memory Directives.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Parallel Programming Platforms David Monismith Cs599 Based on notes from Introduction to Parallel Programming, Second Ed., by A. Grama, A. Gupta, G. Karypis,

HPC Technology Track: Foundations of Computational Science Lecture 2 Dr. Greg Wettstein, Ph.D. Research Support Group Leader Division of Information Technology.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

L21: Final Preparation and Course Retrospective (also very brief introduction to Map Reduce) December 1, 2011.

Processes and Threads CS550 Operating Systems. Processes and Threads These exist only at execution time They have fast state changes -> in memory and.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Threads Concurrency in Java. What is mult-tasking? Doing more than one task.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.

Introduction to OpenMP

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,

Chapter 2: Parallel Programming Models Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval.

Parallel Processing Presented by: Wanki Ho CS147, Section 1.

Single Node Optimization Computational Astrophysics.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,

An Overview of Parallel Processing

Parallel Computing Presented by Justin Reschke

LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

Processor Level Parallelism 1

Introduction to OpenMP

Shared Memory Parallelism - OpenMP

SIMD Multimedia Extensions

Introduction to OpenMP

Exploiting Parallelism

Morgan Kaufmann Publishers

Multi-Processing in High Performance Computer Architecture:

Multi-core CPU Computing Straightforward with OpenMP

Threads, SMP, and Microkernels

Programming with Shared Memory Introduction to OpenMP

Overview Parallel Processing Pipelining

AN INTRODUCTION ON PARALLEL PROCESSING

Introduction to OpenMP

COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING

Multicore and GPU Programming

OpenMP Parallel Programming

Presentation transcript:

Computer Organization David Monismith CS345 Notes to help with the in class assignment.

Flynn’s Taxonomy SISD = Single Instruction Single Data = Serial Programming SIMD = Single Instruction Multiple Data = Implicit Parallelism (Instruction/Architecture Level) MISD = Multiple Instruction Single Data (Rarely implemented) MIMD = Multiple Instruction Multiple Data = Multiprocessor Single DataMultiple Data Single InstructionSISDSIMD Multiple InstructionMISDMIMD

Flynn’s Taxonomy SIMD instructions and architectures allow for implicit parallelism when writing programs To provide a sense of how these work, examples are shown in the following slides. Our focus on MIMD is through the use of processes and threads, and examples are shown in later slides.

Understanding SIMD Instructions Implicit parallelism occur via AVX (Advanced Vector Extensions) or SSE (Streaming SIMD Instructions) Example: without SIMD the following loop might be executed with four add instructions: //Serial Loop for(int i = 0; i < n; i+=4) { c[i] = a[i] + b[i]; //add c[i], a[i], b[i] c[i+1] = a[i+1] + b[i+1]; //add c[i+1], a[i+1], b[i+1] c[i+2] = a[i+2] + b[i+2]; //add c[i+2], a[i+2], b[i+2] c[i+3] = a[i+3] + b[i+3]; //add c[i+3], a[i+3], b[i+3] }

Understanding SIMD Instructions With SIMD the following loop might be executed with one add instruction: //SIMD Loop for(int i = 0; i < n; i+=4) { c[i] = a[i] + b[i]; //add c[i to i+3], a[i to i+3], b[i to i+3] c[i+1] = a[i+1] + b[i+1]; c[i+2] = a[i+2] + b[i+2]; c[i+3] = a[i+3] + b[i+3]; }

Understanding SIMD Instructions Note that the add instructions above are pseudo-assembly instructions The serial loop is implemented as follows: | a[i] | + | b[i] | -> | c[i] | |a[i+1]| + |b[i+1]| -> |c[i+1]| |a[i+2]| + |b[i+2]| -> |c[i+2]| |a[i+3]| + |b[i+3]| -> |c[i+3]|

Understanding SIMD Instructions Versus SIMD: | a[i] | | b[i] | | c[i] | | | | |a[i+1]| |b[i+1]| |c[i+1]| | | + | | -> | | |a[i+2]| |b[i+2]| |c[i+2]| | | | |a[i+3]| |b[i+3]| |c[i+3]|

Understanding SIMD Instructions In the previous example 4x Speedup was achieved by using SIMD instructions Note that SIMD Registers are often 128, 256, or 512 bits wide allowing for addition, subtraction, multiplication, etc., of 2, 4, or 8 double precision variables. Performance of SSE and AVX Instruction Sets, Hwancheol Jeong, Weonjong Lee, Sunghoon Kim, and Seok-Ho Myung, Proceedings of Science, 2012,

Processes and Threads These exist only at execution time They have fast state changes -> in memory and waiting A Process – is a fundamental computation unit – can have one or more threads – is handled by process management module – requires system resources

Process Process (job) - program in execution, ready to execute, or waiting for execution A program is static whereas a process (running program) is dynamic. In Operating Systems (cs550) we will implement processes using an API called the Message Passing Interface (MPI). MPI will provide us with an abstract layer that will allow us to create and identify processes without worrying about the creation of data structures for sockets or shared memory.

Threads Threads - lightweight processes – Dynamic component of processes – Often, many threads are part of a process Current OSes and Hardware support multithreading – multiple threads (tasks) per process – One or more threads per CPU-core Execution of threads is handled more efficiently than that of full weight processes (although there are other costs). At process creation, one thread is created, the "main" thread. Other threads are created from the "main" thread

Embarrassingly Parallel (Map) Processes and threads are MIMD. Performing array (or matrix) addition is a straightforward example that is easily parallelized The serial example of this follows: for(i = 0; i < N; i++) C[i] = A[i] + B[i]; OpenMP allows you to write a #pragma to parallelize code that you write in a serial (normal) fashion. Three OpenMP parallel versions follow on the next slides

OpenMP First Try We could parallelize the loop on the last slide directly as follows: #pragma omp parallel private(i) shared(A,B,C) { int start = omp_get_thread_num()*(N / omp_get_num_threads()); int end = start + (N/omp_get_num_threads()); for(i = start; i < end; i++) C[i] = A[i] + B[i]; } Notice that i is declared private because it it is not shared between threads – each thread gets its own copy of i Arrays A, B, and C are declared shared because they are shared between threads

OpenMP for clause It is preferred to allow OpenMP to directly parallelize loops using the for clause as follows #pragma omp parallel private(i) shared(A,B,C) { #pragma omp for for(i = 0; i < N; i++) C[i] = A[i] + B[i]; } Notice that the loop can be written in a serial fashion and it will be automatically partitioned and tasked to a thread

Shortened OpenMP for When using a single for loop, the parallel and for clauses may be combined #pragma omp parallel for private(i) \ shared(A,B,C) for(i = 0; i < N; i++) C[i] = A[i] + B[i];