Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney.

Slides:

Advertisements

Similar presentations

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Advertisements

Distributed Systems CS

SE-292 High Performance Computing

Includes slides from “Multicore Programming Primer” course at Massachusetts Institute of Technology (MIT) by Prof. Saman Amarasinghe and Dr. Rodric Rabbah.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.

Lecture 1 – Parallel Programming Primer CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed.

CS 470/570:Introduction to Parallel and Distributed Computing.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Introduction to Parallel Processing 3.1 Basic concepts 3.2 Types and levels of parallelism 3.3 Classification of parallel architecture 3.4 Basic parallel.

Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Computer System Architectures Computer System Software

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note Introduction to Parallel Computing (Blaise Barney,

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Outline Why this subject? What is High Performance Computing?

Lecture 3: Computer Architectures

1 Concurrent and Distributed Programming Lecture 2 Parallel architectures Performance of parallel programs References: Based on: Mark Silberstein, ,

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Parallel Computing Presented by Justin Reschke

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

These slides are based on the book:

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

18-447: Computer Architecture Lecture 30B: Multiprocessors

Lecture 1 – Parallel Programming Primer

PARALLEL COMPUTING Submitted By : P. Nagalakshmi

Introduction to parallel programming

CS5102 High Performance Computer Systems Thread-Level Parallelism

Parallel Processing - introduction

Parallel Programming By J. H. Wang May 2, 2017.

CS 147 – Parallel Processing

Concurrent and Distributed Programming

Introduction to Parallelism.

Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang

Morgan Kaufmann Publishers

EE 193: Parallel Computing

Multi-Processing in High Performance Computer Architecture:

Chapter 3: Principles of Scalable Performance

What is Parallel and Distributed computing?

MIMD Multiple instruction, multiple data

Summary Background Introduction in algorithms and applications

Lecture 3 : Performance of Parallel Programs

CSE8380 Parallel and Distributed Processing Presentation

AN INTRODUCTION ON PARALLEL PROCESSING

Distributed Systems CS

Distributed Systems CS

Chapter 4: Threads & Concurrency

Chapter 4 Multiprocessors

Mattan Erez The University of Texas at Austin

Parallel Programming in C with MPI and OpenMP

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

Concurrent and Distributed Programming Lecture 1 Introduction References: Slides by Mark Silberstein, 2011 “Intro to parallel computing” by Blaise Barney Lecture 1, CS149 Stanford by Aiken & Olukotun, “Intro to parallel programming”, Gupta

Assaf Schuster, , Winter Administration ● Lectures: Assaf Schuster ● TA in charge: Nadav Amit ● TAs: Adi Omari, Ben Halperin ● EXE checkers: Shir Yadid, Meital Ben-Sinai ● Syllabus on the site – possible changes

Assaf Schuster, , Winter Grading ● 70% Exam, 30% Home assignments ● Pass conditioned on Exam grade >= 55 ● And assignment grade >= 55 for each one ● 5 assignments, syllabus check deadlines (each assignment worth 6 final-grade points) – Threads (Java) – OpenMP (C) – Hadoop/Map-Reduce (new) – BSP (C) – MPI (C) ● Assignments may overlap ● Will be checking assignment copying ● No postponement (unless force major)

Assaf Schuster, , Winter Prerequisites ● Formal ● Operating Systems ● Digital Computer Structure – Or EE equivalent ● Informal ● Programming skills (makefile, Linux, ssh) ● Positive attitude

Assaf Schuster, , Winter Question ● I write “hello world” program in plain C ● I have the latest and the greatest computer with the latest and the greatest OS and compiler Does my program run in parallel?

Assaf Schuster, , Winter Serial vs. parallel program ● One instruction at a time ● Multiple instructions in parallel

Assaf Schuster, , Winter Why parallel programming? ● In general, most software will have to be written as a parallel software ● Why ?

Assaf Schuster, , Winter Free lunch... ● Early 90-s: ● Wait 18 months – get new CPU with x2 speed Moore's law: #transitors per chip = 2 (years) Source: Wikipedia

Assaf Schuster, , Winter is over ● More transistors = more execution units ● BUT performance per unit does not improve ● Bad news: serial programs will not run faster ● Good news: parallelization has high performance potential ● May get faster on newer architectures

Assaf Schuster, , Winter Parallel programming is hard ● Need to optimize for performance ● Understand management of resources ● Identify bottlenecks ● No one technology fits all needs ● Zoo of programming models, languages, run-times ● Hardware architecture is a moving target ● Parallel thinking is NOT intuitive ● Parallel debugging is not fun But there is no detour

Assaf Schuster, , Winter Concurrent and Distributed Programming Course ● Learn ● basic principles ● Parallel and distributed architectures ● Parallel programming techniques ● Basics of programming for performance ● This is a practical course – you will fail unless you complete all home assignments

Assaf Schuster, , Winter Parallelizing Game of Life (Cellular automaton) ● Given a 2D grid ● v t (i,j)=F(v t-1 (of all its neighbors)) i-1 j-1 i+1 j+1 i,j

Assaf Schuster, , Winter Problem partitioning ● Domain decomposition ● (SPMD) ● Input domain ● Output domain ● Both ● Functional decomposition ● (MPMD) ● Independent tasks ● Pipelining

Assaf Schuster, , Winter We choose: domain decomposition ● The field is split between processors i-1 j-1 i+1 j+1 i,j CPU 0CPU 1

Assaf Schuster, , Winter Issue 1. Memory ● Can we access v(i+1,j) from CPU 0 as in serial program? i-1 j-1 i+1 j+1 i,j CPU 0CPU 1

Assaf Schuster, , Winter It depends... ● NO: Distributed memory space architecture: disjoint memory space ● YES: Shared memory space architecture: same memory space CPU 0CPU 1 Memory Network CPU 0CPU 1 Memory

Assaf Schuster, , Winter Someone has to pay ● Harder to program, easier to build hardware ● Easier to program, harder to build hardware CPU 0CPU 1 Memory Network CPU 0CPU 1 Memory ● Tradeoff: Programability vs. Scalability

Assaf Schuster, , Winter Flynn's HARDWARE Taxonomy

Assaf Schuster, , Winter SIMD Lock-step execution ● Example: vector operations (SSE)

Assaf Schuster, , Winter MIMD ● Example: multicores

Assaf Schuster, , Winter Memory architecture ● CPU0: Time to access v(i+1,j) = Time to access v(i-1,j)? i-1 j-1 i+1 j+1 i,j CPU 0CPU 1

Assaf Schuster, , Winter Hardware shared memory flavors 1 ● Uniform memory access: UMA ● Same cost of accessing any data by all processors

Assaf Schuster, , Winter Hardware shared memory flavors 2 ● NON-Uniform memory access: NUMA ● Tradeoff: Scalability vs. Latency

Assaf Schuster, , Winter Software Distributed Shared Memory (SDSM) CPU 0CPU 1 Memory Network Process 1 Process 2 DSM daemons ● Software - DSM: emulation of NUMA in software over distributed memory space

Assaf Schuster, , Winter Memory-optimized programming ● Most modern systems are NUMA or distributed ● Architecture is easier to scale up by scaling out ● Access time difference: local vs. remote data: x ● Memory accesses: main source of optimization in parallel and distributed programs ● Locality: most important parameter in program speed (serial or parallel)

Assaf Schuster, , Winter Issue 2: Control ● Can we assign one v per CPU? ● Can we assign one v per process/logical task? i-1 j-1 i+1 j+1 i,j

Assaf Schuster, , Winter Task management overhead ● Each task has a state that should be managed ● More tasks – more state to manage ● Who manages tasks? ● How many tasks should be run by a cpu? ● Does that depend on F? – Reminder: v(i,j)= F(all v's neighbors)

Assaf Schuster, , Winter Question ● Every process reads the data from its neighbors ● Will it produce correct results? i-1 j-1 i+1 j+1 i,j

Assaf Schuster, , Winter Synchronization ● The order of reads and writes made in different tasks is non-deterministic ● Synchronization required to enforce the order ● Locks, semaphores, barriers, conditionals

Assaf Schuster, , Winter Check point Fundamental hardware-related issues ● Memory accesses ● Optimizing locality of accesses ● Control ● Overhead ● Synchronization ● Fundamental issues: Correctness, Scalability

Assaf Schuster, , Winter Parallel programming issues i-1 j-1 i+1 j+1 i,j CPU 0 CPU 1 CPU 2CPU 4 ● We decide to split this 3x3 grid like this: Problems?

Assaf Schuster, , Winter Issue 1: Load balancing ● Always waiting for the slowest task ● Solutions? i-1 j-1 i+1 j+1 i,j CPU 0 CPU 1 CPU 2CPU 4

Assaf Schuster, , Winter Issue 2: Granularity ● G=Computation/Communication ● Fine-grain parallelism. ● G is small ● Good load balancing ● Potentially high overhead ● Coarse-grain parallelism ● G is large ● Potentially bad load balancing ● Low overhead What granularity works for you?

Assaf Schuster, , Winter It depends.. ● For each combination of computing platform and parallel application ● The goal is to minimize overheads and maximize CPU utilization ● High performance requires ● Enough parallelism to keep CPUs busy ● Low relative overheads (communications, synchronization, task control, memory accesses versus computations) ● Good load balancing

Assaf Schuster, , Winter Summary ● Parallelism and scalability do not come for free ● Overhead ● Memory-aware access ● Synchronization ● Granularity ● But... let's assume for one slide we did not have these issues....

Assaf Schuster, , Winter Can we estimate an upper bound on speedup? Amdahl's law ● Sequential component limits the speedup ● Split program serial time T serial =1 into ● Ideally parallelizable: A (fraction parallel): ideal load balancing, identical speed, no overheads ● Cannot be parallelized: 1-A ● Parallel time T parallel = A/#CPUs+(1-A) ● Scalability=Speedup(#CPUs)=T serial /T parallel = =1/(A/#CPUs+(1-A))

37 Bad news Source: wikipedia ● So why do we need machines with 1000x CPUs?