Parallelism Can we make it faster? 25-Apr-17.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

Computer Abstractions and Technology
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Parallel Programming Motivation and terminology – from ACM/IEEE 2013 curricula.
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
Concurrency 101 Shared state. Part 1: General Concepts 2.
Efficient Parallel Algorithms COMP308
Multiple Processor Systems
Designing a thread-safe class  Store all states in public static fields  Verifying thread safety is hard  Modifications to the program hard  Design.
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Slide 1 Parallel Computation Models Lecture 3 Lecture 4.
Precept 3 COS 461. Concurrency is Useful Multi Processor/Core Multiple Inputs Don’t wait on slow devices.
Concurrency CS 510: Programming Languages David Walker.
Computer System Overview
Chapter 1 and 2 Computer System and Operating System Overview
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
Chapter 1 and 2 Computer System and Operating System Overview
Computer Organization and Architecture
1 Sharing Objects – Ch. 3 Visibility What is the source of the issue? Volatile Dekker’s algorithm Publication and Escape Thread Confinement Immutability.
29-Jun-15 Java Concurrency. Definitions Parallel processes—two or more Threads are running simultaneously, on different cores (processors), in the same.
Code Generation CS 480. Can be complex To do a good job of teaching about code generation I could easily spend ten weeks But, don’t have ten weeks, so.
Week 9 Building blocks.
PMIT-6102 Advanced Database Systems
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
Computing hardware CPU.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.
Internet Software Development Controlling Threads Paul J Krause.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Lecture 20: Parallelism & Concurrency CS 62 Spring 2013 Kim Bruce & Kevin Coogan CS 62 Spring 2013 Kim Bruce & Kevin Coogan Some slides based on those.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Processor Architecture
Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
C H A P T E R E L E V E N Concurrent Programming Programming Languages – Principles and Paradigms by Allen Tucker, Robert Noonan.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.
Cooperating Processes The concurrent processes executing in the operating system may be either independent processes or cooperating processes. A process.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
09/02/2010CS4961 CS4961 Parallel Programming Lecture 4: CTA, cont. Data and Task Parallelism Mary Hall September 2,
Concurrency and Performance Based on slides by Henri Casanova.
Background Computer System Architectures Computer System Software.
1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
Algorithm Complexity is concerned about how fast or slow particular algorithm performs.
Modularity Most useful abstractions an OS wants to offer can’t be directly realized by hardware Modularity is one technique the OS uses to provide better.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
How will execution time grow with SIZE?
Parallelism Can we make it faster? 29-Nov-18.
CSE8380 Parallel and Distributed Processing Presentation
Java Concurrency 17-Jan-19.
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Java Concurrency.
Parallelism Can we make it faster? 27-Apr-19.
Java Concurrency.
Parallelism Can we make it faster? 8-May-19.
Java Concurrency 29-May-19.
Operating System Overview
Presentation transcript:

Parallelism Can we make it faster? 25-Apr-17

The RAM model The RAM (Random Access Machine) model of computation assumes: There is a single processing unit There is an arbitrarily large amount of memory Accessing any arbitrarily chosen (i.e. random) memory location takes unit time This simple model is very useful guide for algorithm design For maximum efficiency, “tuning” to the particular hardware is required The RAM model breaks down when the assumptions are violated If an array is so large that only a portion of it fits in memory (the rest is on disk), very different sorting algorithms should be used

Approaches to parallelism The basic question is, do the processing units share memory, or do they send messages to one another? A thread consists of a single flow of control, a program counter, a call stack, and a small amount of thread-specific data Threads share memory, and communicate by reading and writing to that memory This is thread-based or shared-memory parallel processing Java “out of the box” is thread-based A process is a thread that has its own private memory Threads (sometimes called actors) send messages to one another This is message-passing parallel processing

The PRAM model An obvious extension to the RAM model is the Parallel Random Access model, which assumes: There are multiple processing units There is an arbitrarily large amount of memory Accessing any memory location takes unit time The third assumption is “good enough” for many in-memory sequential programs, but not good enough for parallel programs If the processing units share memory, then complicated and expensive synchronization mechanisms must be used If the processing units do not share memory, then each has its own (fast) local memory, and communicates with other processes by sending messages to them (much slower--especially if over a network!) Bottom line: Because there seems to be no way to meet the unit time assumption, the PRAM model is seriously broken!

The CTA model The Candidate Type Architecture model makes these assumptions: There are P standard sequential processors, each with its own local memory One of the processors may be acting as “controller,” doing things like initialization and synchronization Processors can access non-local memory over a communication network Non-local memory is between 100 times and 10000 times slower to access than local memory (based on common architectures) A processor can make only a very small number (maybe 1 or 2) of simultaneous non-local memory accesses

Consequences of CTA The CTA model does not specify how many processors are available The programmer does not need to plan for some specific number of processors More processors may cause the code to execute somewhat more quickly The CTA modes does specify a huge discrepancy between local and non-local memory access The programmer should minimize the number of non-local memory accesses

Costs of parallelism It would be great if having N processors meant our programs would run N times as fast, but... There is overhead involved in setting up the parallelism, which we don’t need to pay for a sequential program There are parts of any program that cannot be parallelized Some processors will be idle because there is nothing for them to do Processors have to contend for the same resources, such as memory, and may have to wait for one another

Overhead Overhead is any cost incurred by the parallel algorithm but not by the corresponding sequential algorithm Communication among threads and processes (a single thread has no other threads with which to communicate) Synchronization is when one thread or process has to wait for results or events from another thread or process Contention for a shared resource, such as memory Java’s synchronized is used to wait for a lock to become free Extra computation to combine the results of the various threads or processes Extra memory may be needed to give each thread or process the memory required to do its job

Amdahl’s law Some proportion P of a program can be made to run in parallel, while the remaining (1 - P) must remain sequential If there are N processors, then the computation can be done in (1 - P) + P/N time The maximum speedup is then 1 . (1 - P) + P/N As N goes to infinity, the maximum speedup is 1/(1 - P) For example, if P = 0.75, the maximum speedup is (1/0.25), or four times

Consequences of Amdahl’s law If 75% of a process can be parallelized, and there are four processors, then the possible speedup is 1 / ((1 - 0.75) + 0.75/4) = 2.286 But with 40 processors--ten times as many--the speedup is only 1 / ((1 - 0.75) + 0.75/40) = 3.721 This has led many people (including Amdahl) to conclude that having lots of processors won’t help very much However.... For many problems, as the data set gets larger, The inherently sequential part of the program remains (fairly) constant Thus, the sequential proportion P becomes smaller So: The greater the volume of data, the more speedup we can get

Idle time Idle time results when There is a load imbalance--one process may have much less work to do than another A process must wait for access to memory or some other shared resource Data is registers is most quickly accessed Data in a cache is next most quickly accessed A level 1 cache is the fastest, but also the smallest A level 2 cache is larger, but slower Memory--RAM--is much slower Disk access is very much slower

Dependencies A dependency is when one thread or process requires the result of another thread or process Example: (a + b) * (c + d) The additions can be done in parallel The multiplication must wait for the results of the additions Of course, at this level, the hardware itself handles the parallelism Threads or processors that depend on results from other threads or processors must wait for those results

Parallelism in Java Java uses the shared memory model There are various competing Java packages (such as Akka and Kilim) to support message passing, but nothing yet in the official Java release The programming language Erlang has developed the message passing approach Scala is a Java competitor that supports both approaches Scala’s message passing is based on Erlang

Concurrency in Java, I Java Concurrency in Practice, by Brian Goetz, is the book to have if you need to do much concurrent programming in Java The following 11 points are from his summary of basic principles It’s the mutable state, stupid! Make fields final unless they need to be mutable. Immutable objects are automatically thread-safe. Encapsulation makes it practical to manage the complexity. Guard each mutable variable with a lock.

Concurrency in Java, II Guard all variables in an invariant with the same lock. Hold locks for the duration of compound actions. A program that accesses a mutable variable from multiple threads without synchronization is a broken program. Don’t rely on clever reasoning about why you don’t need to synchronize. Include thread safety in the design process—or explicitly document that your class in not thread-safe. Document your synchronization policy.

Models of concurrency Shared memory model—as in Java. Actors An actor is a lightweight thread with its own private memory Actors can send messages to one another Messages are received in an actor’s “mailbox” (a queue) Actors can monitor the health of other actors STM, Software Transactional Memory A copy is made of the input data to a function The function is computed The input data is compared to the current values of that data If nothing has changed, the function result is stored If the input data has changed, the result is discarded and the function is tried again with the new data

Functional programming In functional programming (FP): A function is a value It can be assigned to variables It can be passed as an argument to another function It can be returned as the result of a function call There are much briefer ways of writing a literal function Scala example: a => 101 * a A function acts like a function in mathematics If you call it with the same arguments, you will get the same result. Every time. Guaranteed. Functions have no side effects Immutable values are strongly emphasized over mutable values Some languages, such as Haskell, don’t allow mutable values at all Computation proceeds by the application of functions, not by changing the state of mutable variables

Why functional programming? Here are the three most important reasons that functional programming is better for concurrency than imperative programming: Immutable values are automatically thread safe Why not functional programming? Functional languages—Lisp, Haskell, ML, OCaml—have long been regarded as only for ivory-tower academics Functional languages are “weird” (meaning: unfamiliar)

What’s happening now? Moore’s law has ended Instead of getting faster processors, we’re now getting more of them Consequently, parallelism, and concurrency, have become much more important After about ten years, CIS 120 is once again starting with OCaml Python has gotten more functional Other languages are getting more functional Microsoft is starting to promote F# (based on ML?) Java 8 will have some functional features Scala is a hybrid object/functional language based on Java, and is freely available now

The End …for now