CS61C L20 Thread Level Parallelism I (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia www.cs.berkeley.edu/~ddgarcia inst.eecs.berkeley.edu/~cs61c.

Slides:



Advertisements
Similar presentations
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Advertisements

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
The University of Adelaide, School of Computer Science
Review: Multiprocessor Systems (MIMD)
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
CS61C L23 Synchronous Digital Systems (1) Garcia, Fall 2011 © UCB Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Chapter 17 Parallel Processing.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
1  2004 Morgan Kaufmann Publishers Chapter 9 Multiprocessors.
Chapter 7 Multicores, Multiprocessors, and Clusters.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
MultiIntro.1 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
Advanced Computer Architectures
Instructor: Justin Hsia 7/22/2013Summer Lecture #161 CS 61C: Great Ideas in Computer Architecture Amdahl’s Law, Thread Level Parallelism.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
CS61C L23 Synchronous Digital Systems (1) Garcia, Fall 2011 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
1 Chapter 04 Authors: John Hennessy & David Patterson.
1 Parallelism, Multicores, Multiprocessors, and Clusters [Adapted from Computer Organization and Design, Fourth Edition, Patterson & Hennessy, © 2009]
Multi-core architectures. Single-core computer Single-core CPU chip.
CMPE 421 Parallel Computer Architecture Multi Processing 1.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Instructor: Justin Hsia
Memory/Storage Architecture Lab Computer Architecture Multiprocessors.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Introduction to MMX, XMM, SSE and SSE2 Technology
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
EKT303/4 Superscalar vs Super-pipelined.
Lecture 3: Computer Architectures
Guest Lecturer: Alan Christopher
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.
Processor Level Parallelism 1
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
CS 110 Computer Architecture Lecture 18: Amdahl’s Law, Data-level Parallelism Instructor: Sören Schwertfeger School of Information.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.
Distributed Processors
Stateless Combinational Logic and State Circuits
Parallel Processing - introduction
Exploiting Parallelism
Morgan Kaufmann Publishers
Vladimir Stojanovic & Nicholas Weaver
Morgan Kaufmann Publishers
SIMD Programming CS 240A, 2017.
Instructor: Michael Greenbaum
Memristor memory on its way (hopefully)
Pipelining and Vector Processing
Instructor: Justin Hsia
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
Instructor: Michael Greenbaum
buses, crossing switch, multistage network.
CSC3050 – Computer Architecture
Memory System Performance Chapter 3
Guest Lecturer: Justin Hsia
Lecture 11: Machine-Dependent Optimization
Presentation transcript:

CS61C L20 Thread Level Parallelism I (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 20 Thread Level Parallelism Wireless “Matrix” device  A team at Brown University has developed a subdermal implant of a “battery, coppor coil for recharging, wireless radio, infrared transmitters, and custom ICs in a small, leak-proof, body-friendly container 2 inches long.” 100-electrode neuron- reading chip is implanted directly in the brain.

CS61C L20 Thread Level Parallelism I (2) Garcia, Spring 2013 © UCB Review Flynn Taxonomy of Parallel Architectures – SIMD: Single Instruction Multiple Data – MIMD: Multiple Instruction Multiple Data – SISD: Single Instruction Single Data – MISD: Multiple Instruction Single Data (unused) Intel SSE SIMD Instructions – One instruction fetch that operates on multiple operands simultaneously – 64/128 bit XMM registers – (SSE = Streaming SIMD Extensions)

CS61C L20 Thread Level Parallelism I (3) Garcia, Spring 2013 © UCB A Thread stands for “thread of execution”, is a single stream of instructions – A program / process can split, or fork itself into separate threads, which can (in theory) execute simultaneously. – An easy way to describe/think about parallelism A single CPU can execute many threads by Time Division Multipexing Multithreading is running multiple threads through the same hardware CPU Time Thread 0 Thread 1 Thread 2 Background: Threads

CS61C L20 Thread Level Parallelism I (4) Garcia, Spring 2013 © UCB Agenda SSE Instructions in C Multiprocessor “Although threads seem to be a small step from sequential computation, in fact, they represent a huge step. They discard the most essential and appealing properties of sequential computation: understandability, predictability, and determinism. Threads, as a model of computation, are wildly non-deterministic, and the job of the programmer becomes one of pruning that nondeterminism.” — The Problem with Threads, Edward A. Lee, UC Berkeley, 2006

CS61C L20 Thread Level Parallelism I (5) Garcia, Spring 2013 © UCB Intel SSE Intrinsics Intrinsics are C functions and procedures for putting in assembly language, including SSE instructions – With intrinsics, can program using these instructions indirectly – One-to-one correspondence between SSE instructions and intrinsics

CS61C L20 Thread Level Parallelism I (6) Garcia, Spring 2013 © UCB Example SSE Intrinsics Vector data type: _m128d Load and store operations: _mm_load_pdMOVAPD/aligned, packed double _mm_store_pdMOVAPD/aligned, packed double _mm_loadu_pdMOVUPD/unaligned, packed double _mm_storeu_pdMOVUPD/unaligned, packed double Load and broadcast across vector _mm_load1_pdMOVSD + shuffling/duplicating Arithmetic: _mm_add_pdADDPD/add, packed double _mm_mul_pdMULPD/multiple, packed double Corresponding SSE instructions:Instrinsics:

CS61C L20 Thread Level Parallelism I (7) Garcia, Spring 2013 © UCB Example: 2 x 2 Matrix Multiply C i,j = (A×B) i,j = ∑ A i,k × B k,j 2 k = 1 Definition of Matrix Multiply: A 1,1 A 1,2 A 2,1 A 2,2 B 1,1 B 1,2 B 2,1 B 2,2 x C 1,1 =A 1,1 B 1,1 + A 1,2 B 2,1 C 1,2 =A 1,1 B 1,2 +A 1,2 B 2,2 C 2,1 =A 2,1 B 1,1 + A 2,2 B 2,1 C 2,2 =A 2,1 B 1,2 +A 2,2 B 2,2 = x C 1,1 = 1*1 + 0*2 = 1 C 1,2 = 1*3 + 0*4 = 3 C 2,1 = 0*1 + 1*2 = 2 C 2,2 = 0*3 + 1*4 = 4 =

CS61C L20 Thread Level Parallelism I (8) Garcia, Spring 2013 © UCB Example: 2 x 2 Matrix Multiply Using the XMM registers – 64-bit/double precision/two doubles per XMM reg C1C1 C2C2 C 1,1 C 1,2 C 2,1 C 2,2 Stored in memory in Column order B1B1 B2B2 B i,1 B i,2 B i,1 B i,2 AA 1,i A 2,i

CS61C L20 Thread Level Parallelism I (9) Garcia, Spring 2013 © UCB Example: 2 x 2 Matrix Multiply Initialization I = 1 C1C1 C2C B1B1 B2B2 B 1,1 B 1,2 B 1,1 B 1,2 AA 1,1 A 2,1 _mm_load_pd: Stored in memory in Column order _mm_load1_pd: SSE instruction that loads a double word and stores it in the high and low double words of the XMM register

CS61C L20 Thread Level Parallelism I (10) Garcia, Spring 2013 © UCB Example: 2 x 2 Matrix Multiply Initialization I = 1 C1C1 C2C B1B1 B2B2 B 1,1 B 1,2 B 1,1 B 1,2 AA 1,1 A 2,1 _mm_load_pd: Load 2 doubles into XMM reg, Stored in memory in Column order _mm_load1_pd: SSE instruction that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM)

CS61C L20 Thread Level Parallelism I (11) Garcia, Spring 2013 © UCB Example: 2 x 2 Matrix Multiply First iteration intermediate result I = 1 C1C1 C2C2 B1B1 B2B2 B 1,1 B 1,2 B 1,1 B 1,2 AA 1,1 A 2,1 _mm_load_pd: Stored in memory in Column order 0+A 1,1 B 1,1 0+A 1,1 B 1,2 0+A 2,1 B 1,1 0+A 2,1 B 1,2 c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); SSE instructions first do parallel multiplies and then parallel adds in XMM registers _mm_load1_pd: SSE instruction that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM)

CS61C L20 Thread Level Parallelism I (12) Garcia, Spring 2013 © UCB Example: 2 x 2 Matrix Multiply First iteration intermediate result I = 2 C1C1 C2C2 0+A 1,1 B 1,1 0+A 1,1 B 1,2 0+A 2,1 B 1,1 0+A 2,1 B 1,2 B1B1 B2B2 B 2,1 B 2,2 B 2,1 B 2,2 AA 1,2 A 2,2 _mm_load_pd: Stored in memory in Column order c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); SSE instructions first do parallel multiplies and then parallel adds in XMM registers _mm_load1_pd: SSE instruction that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM)

CS61C L20 Thread Level Parallelism I (13) Garcia, Spring 2013 © UCB Example: 2 x 2 Matrix Multiply Second iteration intermediate result I = 2 C1C1 C2C2 A 1,1 B 1,1 +A 1,2 B 2,1 A 1,1 B 1,2 +A 1,2 B 2,2 A 2,1 B 1,1 +A 2,2 B 2,1 A 2,1 B 1,2 +A 2,2 B 2,2 B1B1 B2B2 B 2,1 B 2,2 B 2,1 B 2,2 AA 1,2 A 2,2 _mm_load_pd: Stored in memory in Column order C 1,1 C 1,2 C 2,1 C 2,2 c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); SSE instructions first do parallel multiplies and then parallel adds in XMM registers _mm_load1_pd: SSE instruction that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM)

CS61C L20 Thread Level Parallelism I (14) Garcia, Spring 2013 © UCB Example: 2 x 2 Matrix Multiply C i,j = (A×B) i,j = ∑ A i,k × B k,j 2 k = 1 Definition of Matrix Multiply: A 1,1 A 1,2 A 2,1 A 2,2 B 1,1 B 1,2 B 2,1 B 2,2 x C 1,1 =A 1,1 B 1,1 + A 1,2 B 2,1 C 1,2 =A 1,1 B 1,2 +A 1,2 B 2,2 C 2,1 =A 2,1 B 1,1 + A 2,2 B 2,1 C 2,2 =A 2,1 B 1,2 +A 2,2 B 2,2 = x C 1,1 = 1*1 + 0*2 = 1 C 1,2 = 1*3 + 0*4 = 3 C 2,1 = 0*1 + 1*2 = 2 C 2,2 = 0*3 + 1*4 = 4 =

CS61C L20 Thread Level Parallelism I (15) Garcia, Spring 2013 © UCB Example: 2 x 2 Matrix Multiply (Part 1 of 2) #include // header file for SSE compiler intrinsics #include // NOTE: vector registers will be represented in comments as v1 = [ a | b] // where v1 is a variable of type __m128d and a, b are doubles int main(void) { // allocate A,B,C aligned on 16-byte boundaries double A[4] __attribute__ ((aligned (16))); double B[4] __attribute__ ((aligned (16))); double C[4] __attribute__ ((aligned (16))); int lda = 2; int i = 0; // declare several 128-bit vector variables __m128d c1,c2,a,b1,b2; // Initialize A, B, C for example /* A = (note column order!) */ A[0] = 1.0; A[1] = 0.0; A[2] = 0.0; A[3] = 1.0; /* B = (note column order!) */ B[0] = 1.0; B[1] = 2.0; B[2] = 3.0; B[3] = 4.0; /* C = (note column order!) 0 0 */ C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;

CS61C L20 Thread Level Parallelism I (16) Garcia, Spring 2013 © UCB Example: 2 x 2 Matrix Multiply (Part 2 of 2) // used aligned loads to set // c1 = [c_11 | c_21] c1 = _mm_load_pd(C+0*lda); // c2 = [c_12 | c_22] c2 = _mm_load_pd(C+1*lda); for (i = 0; i < 2; i++) { /* a = i = 0: [a_11 | a_21] i = 1: [a_12 | a_22] */ a = _mm_load_pd(A+i*lda); /* b1 = i = 0: [b_11 | b_11] i = 1: [b_21 | b_21] */ b1 = _mm_load1_pd(B+i+0*lda); /* b2 = i = 0: [b_12 | b_12] i = 1: [b_22 | b_22] */ b2 = _mm_load1_pd(B+i+1*lda); /* c1 = i = 0: [c_11 + a_11*b_11 | c_21 + a_21*b_11] i = 1: [c_11 + a_21*b_21 | c_21 + a_22*b_21] */ c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); /* c2 = i = 0: [c_12 + a_11*b_12 | c_22 + a_21*b_12] i = 1: [c_12 + a_21*b_22 | c_22 + a_22*b_22] */ c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); } // store c1,c2 back into C for completion _mm_store_pd(C+0*lda,c1); _mm_store_pd(C+1*lda,c2); // print C printf("%g,%g\n%g,%g\n",C[0],C[2],C[1],C[3]); return 0; }

CS61C L20 Thread Level Parallelism I (17) Garcia, Spring 2013 © UCB Inner loop from gcc –O -S L2:movapd(%rax,%rsi), %xmm1//Load aligned A[i,i+1]->m1 movddup(%rdx), %xmm0//Load B[j], duplicate->m0 mulpd%xmm1, %xmm0//Multiply m0*m1->m0 addpd%xmm0, %xmm3//Add m0+m3->m3 movddup16(%rdx), %xmm0//Load B[j+1], duplicate->m0 mulpd%xmm0, %xmm1//Multiply m0*m1->m1 addpd%xmm1, %xmm2//Add m1+m2->m2 addq$16, %rax// rax+16 -> rax (i+=2) addq$8, %rdx// rdx+8 -> rdx (j+=1) cmpq$32, %rax// rax == 32? jneL2// jump to L2 if not equal movapd%xmm3, (%rcx)//store aligned m3 into C[k,k+1] movapd%xmm2, (%rdi)//store aligned m2 into C[l,l+1]

CS61C L20 Thread Level Parallelism I (18) Garcia, Spring 2013 © UCB You Are Here! Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions >1 one time e.g., 5 pipelined instructions Parallel Data >1 data one time e.g., Add of 4 pairs of words Hardware descriptions All gates functioning in parallel at same time Smart Phone Warehouse Scale Computer Software Hardware Harness Parallelism & Achieve High Performance Logic Gates Core … Memory (Cache) Input/Output Computer Main Memory Core Instruction Unit(s) Functional Unit(s) A 3 +B 3 A 2 +B 2 A 1 +B 1 A 0 +B 0 Project 3

CS61C L20 Thread Level Parallelism I (19) Garcia, Spring 2013 © UCB Review Intel SSE SIMD Instructions – One instruction fetch that operates on multiple operands simultaneously SSE Instructions in C – Can embed the SEE machine instructions directly into C programs through the use of intrinsics

CS61C L20 Thread Level Parallelism I (20) Garcia, Spring 2013 © UCB Parallel Processing: Multiprocessor Systems (MIMD) Multiprocessor (MIMD): a computer system with at least 2 processors 1.Deliver high throughput for independent jobs via job-level parallelism 2.Improve the run time of a single program that has been specially crafted to run on a multiprocessor - a parallel processing program Now Use term core for processor (“Multicore”) because “Multiprocessor Microprocessor” too redundant Processor Cache Interconnection Network Memory I/O

CS61C L20 Thread Level Parallelism I (21) Garcia, Spring 2013 © UCB Transition to Multicore Sequential App Performance

CS61C L20 Thread Level Parallelism I (22) Garcia, Spring 2013 © UCB Multiprocessors and You Only path to performance is parallelism – Clock rates flat or declining – SIMD: 2X width every 3-4 years 128b wide now, 256b 2011, 512b in 2014?, 1024b in 2018? Advanced Vector Extensions are 256-bits wide! – MIMD: Add 2 cores every 2 years: 2, 4, 6, 8, 10, … A key challenge is to craft parallel programs that have high performance on multiprocessors as the number of processors increase – i.e., that scale – Scheduling, load balancing, time for synchronization, overhead for communication Will explore this further in labs and projects

CS61C L20 Thread Level Parallelism I (23) Garcia, Spring 2013 © UCB Parallel Performance Over Time YearCores SIMD bits /Core Core * SIMD bits Peak DP FLOPs

CS61C L20 Thread Level Parallelism I (24) Garcia, Spring 2013 © UCB Multiprocessor Key Questions Q1 – How do they share data? Q2 – How do they coordinate? Q3 – How many processors can be supported?

CS61C L20 Thread Level Parallelism I (25) Garcia, Spring 2013 © UCB Shared Memory Multiprocessor (SMP) Q1 – Single address space shared by all processors/cores Q2 – Processors coordinate/communicate through shared variables in memory (via loads and stores) – Use of shared data must be coordinated via synchronization primitives (locks) that allow access to data to only one processor at a time All multicore computers today are SMP

CS61C L20 Thread Level Parallelism I (26) Garcia, Spring 2013 © UCB Example: Sum Reduction Sum 100,000 numbers on 100 processor SMP – Each processor has ID: 0 ≤ Pn ≤ 99 – Partition 1000 numbers per processor – Initial summation on each processor [ Phase I] sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i]; Now need to add these partial sums [Phase II] – Reduction: divide and conquer – Half the processors add pairs, then quarter, … – Need to synchronize between reduction steps

CS61C L20 Thread Level Parallelism I (27) Garcia, Spring 2013 © UCB Example: Sum Reduction half = 100; repeat synch(); if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1); Second Phase: After each processor has computed its “local” sum Remember, all processors are sharing the same memory.

CS61C L20 Thread Level Parallelism I (28) Garcia, Spring 2013 © UCB An Example with 10 Processors P0P1P2P3P4P5P6P7P8P9 sum[P0]sum[P1]sum[P2]sum[P3]sum[P4]sum[P5]sum[P6]sum[P7]sum[P8]sum[P9] half = 10

CS61C L20 Thread Level Parallelism I (29) Garcia, Spring 2013 © UCB An Example with 10 Processors P0P1P2P3P4P5P6P7P8P9 sum[P0]sum[P1]sum[P2]sum[P3]sum[P4]sum[P5]sum[P6]sum[P7]sum[P8]sum[P9] P0 P1P2P3P4 half = 10 half = 5 P1 half = 2 P0 half = 1

CS61C L20 Thread Level Parallelism I (30) Garcia, Spring 2013 © UCB So, In Conclusion… Sequential software is slow software – SIMD and MIMD only path to higher performance SSE Intrinsics allow SIMD instructions to be invoked from C programs Multiprocessor (Multicore) uses Shared Memory (single address space)