University of Washington What is parallel processing? Winter 2015 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing

Advertisements

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Lecture 6: Multicore Systems

Introduction to Parallel Computing

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

MULTICORE, PARALLELISM, AND MULTITHREADING By: Eric Boren, Charles Noneman, and Kristen Janick.

Reference: Message Passing Fundamentals.

Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.

Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

CS 300 – Lecture 23 Intro to Computer Architecture / Assembly Language Virtual Memory Pipelining.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Computer System Architectures Computer System Software

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

LOGO Multi-core Architecture GV: Nguyễn Tiến Dũng Sinh viên: Ngô Quang Thìn Nguyễn Trung Thành Trần Hoàng Điệp Lớp: KSTN-ĐTVT-K52.

1 TOPIC 1 INTRODUCTION TO COMPUTER SCIENCE AND PROGRAMMING Topic 1 Introduction to Computer Science and Programming Notes adapted from Introduction to.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

Threads, Thread management & Resource Management.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

Vector/Array ProcessorsCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Vector/Array Processors Reading: Stallings, Section.

Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.

Lecture 20: Parallelism & Concurrency CS 62 Spring 2013 Kim Bruce & Kevin Coogan CS 62 Spring 2013 Kim Bruce & Kevin Coogan Some slides based on those.

University of Washington What is parallel processing? Spring 2014 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

1 Announcements  Homework 4 out today  Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead.

December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

Processor Architecture

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Computer performance issues* Pipelines, Parallelism. Process and Threads.

Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.

University of Washington Today Quick review? Parallelism Wrap-up 

Parallel Computing Presented by Justin Reschke

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

Von Neumann Machines. 3 The Von Neumann Architecture Model for designing and building computers, based on the following three characteristics: 1)The.

University of Washington 1 What is parallel processing? When can we execute things in parallel? Parallelism: Use extra resources to solve a problem faster.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.

Datapath and control Dr. ir. A.B.J. Kokkeler 1. What is programming ? “Programming is instructing a computer to do something for you with the help of.

Introduction. News you can use Hardware –Multicore chips (2009: mostly 2 cores and 4 cores, but doubling) (cores=processors) –Servers (often.

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Parallel Processing - introduction

Memory Consistency Models

Atomic Operations in Hardware

Atomic Operations in Hardware

Memory Consistency Models

Exploiting Parallelism

EE 193: Parallel Computing

Hardware Multithreading

Chapter 5: Computer Systems Organization

Distributed Systems CS

Presentation transcript:

University of Washington What is parallel processing? Winter 2015 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve a problem faster resources Concurrency: Correctly and efficiently manage access to shared resources requests work resource 1

University of Washington What is parallel processing? Brief introduction to key ideas of parallel processing  instruction level parallelism  data-level parallelism  thread-level parallelism Winter 2015 Wrap-up 2

University of Washington Exploiting Parallelism Of the computing problems for which performance is important, many have inherent parallelism computer games  Graphics, physics, sound, AI etc. can be done separately  Furthermore, there is often parallelism within each of these:  Each pixel on the screen’s color can be computed independently  Non-contacting objects can be updated/simulated independently  Artificial intelligence of non-human entities done independently Search engine queries  Every query is independent  Searches are (pretty much) read-only!! Winter 2015 Wrap-up 3

University of Washington Instruction-Level Parallelism add %r2 <- %r3, %r4 or %r2 <- %r2, %r4 lw %r6 <- 0(%r4) addi %r7 <- %r6, 0x5 sub %r8 <- %r8, %r4 Dependencies? RAW – read after write WAW – write after write WAR – write after read When can we reorder instructions? add %r2 <- %r3, %r4 or %r5 <- %r2, %r4 lw %r6 <- 0(%r4) sub %r8 <- %r8, %r4 addi %r7 <- %r6, 0x5 When should we reorder instructions? Superscalar Processors: Multiple instructions executing in parallel at *same* stage Winter 2015 Wrap-up 4

University of Washington Data Parallelism Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; } Operating on one element at a time + Winter 2015 Wrap-up 5

University of Washington Data Parallelism Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; } + Operating on one element at a time Winter 2015 Wrap-up 6

University of Washington Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; } + Data Parallelism with SIMD + Operate on MULTIPLE elements ++ Single Instruction, Multiple Data (SIMD) Winter 2015 Wrap-up 7

University of Washington Is it always that easy? Not always… a more challenging example: unsigned sum_array(unsigned *array, int length) { int total = 0; for (int i = 0 ; i < length ; ++ i) { total += array[i]; } return total; } Is there parallelism here? Each loop iteration uses data from previous iteration. Winter 2015 Wrap-up 8

University of Washington Restructure the code for SIMD… // one option... unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; // chunks of 4 at a time for (i = 0 ; i < length & ~0x3 ; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } // add the 4 sub-totals total = temp[0] + temp[1] + temp[2] + temp[3]; // add the non-4-aligned parts for ( ; i < length ; ++ i) { total += array[i]; } return total; } Winter 2015 Wrap-up 9

University of Washington Mar 13 Friday 13, boo!! Last lecture  or ? Winter Wrap-up

University of Washington What are threads? Independent “thread of control” within process Like multiple processes within one process, but sharing the same virtual address space.  logical control flow  program counter  stack  shared virtual address space  all threads in process use same virtual address space Lighter-weight than processes  faster context switching  system can support more threads Winter Wrap-up

University of Washington Thread-level parallelism: multi-core processors Two (or more) complete processors, fabricated on the same silicon chip Execute instructions from two (or more) programs/threads at same time #1 #2 IBM Power5 Winter 2015 Wrap-up 12

University of Washington Multi-cores are everywhere (circa 2015) Laptops, desktops, servers,  Most any machine from the past few years has at least 2 cores Game consoles:  Xbox 360: 3 PowerPC cores; Xbox One: 8 AMD cores  PS3: 9 Cell cores (1 master; 8 special SIMD cores); PS4: 8 custom AMD x86-64 cores  Wii U: 2 Power cores Smartphones  iPhone 4S, 5: dual-core ARM CPUs  Galaxy S II, III, IV: dual-core ARM or Snapdragon  … Your home automation nodes… Winter Wrap-up

University of Washington Why Multicores Now? Number of transistors we can put on a chip growing exponentially… But performance is no longer growing along with transistor count. So let’s use those transistors to add more cores to do more at once… Winter 2015 Wrap-up 14

University of Washington What happens if we run this program on a multicore? void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++i) { C[i] = A[i] + B[i]; } As programmers, do we care? #1#2 Winter 2015 Wrap-up 15

University of Washington What if we want one program to run on multiple processors (cores)? We have to explicitly tell the machine exactly how to do this  This is called parallel programming or concurrent programming There are many parallel/concurrent programming models  We will look at a relatively simple one: fork-join parallelism Winter 2015 Wrap-up 16

University of Washington How does this help performance? Parallel speedup measures improvement from parallelization: time for best serial version time for version with p processors What can we realistically expect? speedup(p) = Winter 2015 Wrap-up 17

University of Washington In general, the whole computation is not (easily) parallelizable Serial regions limit the potential parallel speedup. Reason #1: Amdahl’s Law Serial regions Winter 2015 Wrap-up 18

University of Washington Suppose a program takes 1 unit of time to execute serially A fraction of the program, s, is inherently serial (unparallelizable) For example, consider a program that, when executing on one processor, spends 10% of its time in a non-parallelizable region. How much faster will this program run on a 3-processor system? What is the maximum speedup from parallelization? Reason #1: Amdahl’s Law New Execution Time = 1-s +s p New Execution Time =.9T +.1T= 3 Speedup = Winter 2015 Wrap-up 19

University of Washington — Forking and joining is not instantaneous Involves communicating between processors May involve calls into the operating system —Depends on the implementation Reason #2: Overhead New Execution Time = 1-s +s+overhead(P) P Winter Wrap-up

University of Washington Multicore: what should worry us? Concurrency: what if we’re sharing resources, memory, etc.? Cache Coherence  What if two cores have the same data in their own caches? How do we keep those copies in sync? Memory Consistency, Ordering, Interleaving, Synchronization…  With multiple cores, we can have truly concurrent execution of threads. In what order do their memory accesses appear to happen? Do the orders seen by different cores/threads agree? Concurrency Bugs  When it all goes wrong…  Hard to reproduce, hard to debug  about-shared-variables-or-memory-models/fulltext about-shared-variables-or-memory-models/fulltext Winter Wrap-up

University of Washington Multicore: more than one processor on the same chip  Almost all devices now have multicore processors  Results from Moore’s law and power constraint Exploiting multicore requires parallel programming  Automatically extracting parallelism too hard for compiler, in general.  But, can have compiler do much of the bookkeeping for us Fork-Join model of parallelism  At parallel region, fork a bunch of threads, do the work in parallel, and then join, continuing with just one thread  Expect a speedup of less than P on P processors  Amdahl’s Law: speedup limited by serial portion of program  Overhead: forking and joining are not free Take 332, 451, 471 to learn more! Summary Winter 2015 Wrap-up 22

University of Washington “Innovation is most likely to occur at the intersection of multiple fields or areas of interest.” OS Biology ML PL Hardware Architecture

University of Washington Approximate Computing Data and IO centric systems New applications and technologies My current research themes ~ Trading off output quality for better energy efficiency and performance Systems for large-scale data analysis (graphs, images, etc..) Non-volatile memory, DNA-based architectures, energy-harvesting systems, etc…

University of Washington image, sound and video processing image rendering sensor data analysis, computer vision ✓ ✓ simulations, games, search, machine learning ✓ ✓ These applications consume a lot (most?) of our cycles/bytes/bandwidth! Often input data is inexact by nature (from sensors) They have multiple acceptable outputs They do not require “perfect execution” These applications consume a lot (most?) of our cycles/bytes/bandwidth! Often input data is inexact by nature (from sensors) They have multiple acceptable outputs They do not require “perfect execution”

University of Washington A big idea: Let computers be sloppy!

University of Washington “Approximate computing” Performance Resource usage (e.g., energy) Accuracy Sources of systematic accuracy loss: Unsound code transformations, ~2X Unreliable, probabilistic hardware (near/sub-threshold, etc.), ~5X Fundamentally different, inherently inaccurate execution models (e.g., neural networks, analog computing), ~10-100X

University of Washington

But approximation needs to be done carefully... or...

University of Washington

“Disciplined” approximate programming PreciseApproximate ✗ ✗ ✓ ✓ references jump targets JPEG header pixel data neuron weights audio samples video frames

University of Washington Disciplined Approximate Programming (EnerJ, EnerC,...) int p = int a = 7; for (int x = 0..) { a += int z; z = p * 2; p += 4; } a /= 9; p += 10; socket.send(z); write(file, z); Relaxed Algorithms λ Aggressive Compilation ɸ Approximate Data Storage Variable-Accuracy ISA ALU Approximate Logic/Circuits Variable-quality wireless communication Goal: support a wide range of approximation techniques with a single unified abstraction.

University of Washington Collaboration with DNA Nanotech Applying architecture concepts to nano-scale molecular circuits. Really new and exciting! Lots of applications: diagnostics, intelligent drugs, art, etc. And now storage!

University of Washington Winter Wrap-up OS Biology ML PL Hardware Architecture

University of Washington Winter Wrap-up