IT 251 Computer Organization and Architecture

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Lecture 6: Multicore Systems

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

The University of Adelaide, School of Computer Science

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

Chapter 12 Pipelining Strategies Performance Hazards.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

CS 300 – Lecture 23 Intro to Computer Architecture / Assembly Language Virtual Memory Pipelining.

Cisc Complex Instruction Set Computing By Christopher Wong 1.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

University of Washington What is parallel processing? Winter 2015 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

University of Washington What is parallel processing? Spring 2014 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.

1 Announcements  Homework 4 out today  Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead.

December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

Processor Architecture

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

BCS361: Computer Architecture I/O Devices. 2 Input/Output CPU Cache Bus MemoryDiskNetworkUSBDVD …

University of Washington Today Quick review? Parallelism Wrap-up 

Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng

University of Washington 1 What is parallel processing? When can we execute things in parallel? Parallelism: Use extra resources to solve a problem faster.

PipeliningPipelining Computer Architecture (Fall 2006)

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Chapter Overview General Concepts IA-32 Processor Architecture

Real-World Pipelines Idea Divide process into independent stages

Lynn Choi School of Electrical Engineering

Advanced Architectures

Lecture 3: MIPS Instruction Set

Visit for more Learning Resources

A Closer Look at Instruction Set Architectures

Design-Space Exploration

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Parallel Processing - introduction

Simultaneous Multithreading

How will execution time grow with SIZE?

Atomic Operations in Hardware

Atomic Operations in Hardware

Exploiting Parallelism

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Morgan Kaufmann Publishers

/ Computer Architecture and Design

COMP4211 : Advance Computer Architecture

Flow Path Model of Superscalars

EE 193: Parallel Computing

Pipelining: Advanced ILP

Instruction Scheduling for Instruction-Level Parallelism

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Superscalar Processors & VLIW Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Hardware Multithreading

Chapter 6 Memory System Design

Chapter 1 Introduction.

1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

* From AMD 1996 Publication #18522 Revision E

What is Computer Architecture?

Lecture 5: Pipeline Wrap-up, Static ILP

Lecture 11: Machine-Dependent Optimization

Presentation transcript:

IT 251 Computer Organization and Architecture Optimization and Parallelism Modified by R. Helps from original notes by Chia-Chi Teng

Pipelining vs. Parallel processing In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into a sequence of pieces, where each piece is handled by a different (specialized) functional unit Parallel processing: each thing is processed entirely by a single functional unit We will briefly introduce the key ideas behind parallel processing instruction level parallelism thread-level parallelism

Paralleism depends on scale Multiple computers working in concert Modern supercomputers “Beowulf” architecture Multiple network paths Chunk size is a complete program module Capstone project to use idle cycle on campus (install virtual machine) SETI @ home and offspring Parallelism within a computer: Multiple cores Chunk size is a subroutine or small app Parallelism within a computer: single core Multiple execution units within core Chunk size is on the instruction level Parallelism in GPU Optimized for many similar math operations Linear algebra and DSP Chunk size is a math operation

Exploiting Parallelism Of the computing problems for which performance is important, many have inherent parallelism Best example: computer games Graphics, physics, sound, AI etc. can be done separately Furthermore, there is often parallelism within each of these: Each pixel on the screen’s color can be computed independently Non-contacting objects can be updated/simulated independently Artificial intelligence of non-human entities done independently Another example: Google queries Every query is independent Google search is read-only!! Read-only means no synchronization problems

Parallelism at the Instruction Level add eax <- eax, edx or eax <- eax, ebx mov edx <- 12(esp) addi edx <- edx, eax sub ecx <- ecx, 1 Dependences? RAW WAW WAR pipeline stalled When can we reorder instructions? When should we reorder instructions? Solution: Multiple instructions executing in parallel at *same* stage add eax <- eax, edx or eax <- eax, ebx mov edx <- 12(esp) sub ecx <- ecx, 1 addi edx <- edx, eax

Modern CPU Design Instruction Control Execution Fetch Control Address Cache Retirement Unit Register File Instruction Decode Instructions Operations Register Updates Prediction OK? Execution Functional Units Note separate execution units for integer and floating point. Also note that in some modern processors there are multiple parallel units for some operations (see textbook ch 5) Integer/ Branch General Integer FP Add FP Mult/Div Load Store Operation Results Addr. Addr. Data Data Data Cache

Superscalar Processor Definition: A superscalar processor can issue and execute multiple instructions in one cycle. The instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically. Benefit: without programming effort, superscalar processor can take advantage of instruction level parallelism in most programs. Most CPUs since about 1998 are superscalar. Intel: since Pentium Pro

Superscaler Hardware

Multicycle FP Operations Multiple functional units: one approach Special-purpose hardware for FP operations Increased latency causes more frequent stalls Single cycle integer unit Fully pipelined multiplier F D M W Fully pipelined FP adder Non-pipelined divider can cause “structural” hazards

Dynamic scheduling Out-of-order execution engine: one view (Pentium 4) Fetching, decoding, translation of x86 instrs to uops to support precise exceptions and recovery from mispredicted branches Image from www.xbitlabs. com

Exploiting Parallelism at the Data Level Consider adding together two arrays (linear algebra): void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; } Operating on one element at a time +

Exploiting Parallelism at the Data Level Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; } Operating on one element at a time +

Exploiting Parallelism at the Data Level (SIMD) Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; } Operate on MULTIPLE elements + + + + Single Instruction, Multiple Data (SIMD)

Intel SSE/SSE2 as an example of SIMD Added new 128 bit registers (XMM0 – XMM7), each can store 4 single precision FP values (SSE) 4 * 32b 2 double precision FP values (SSE2) 2 * 64b 16 byte values (SSE2) 16 * 8b 8 word values (SSE2) 8 * 16b 4 double word values (SSE2) 4 * 32b 1 128-bit integer value (SSE2) 1 * 128b Perform four ADDs in parallel 4.0 (32 bits) + 3.5 (32 bits) -2.0 (32 bits) 2.3 (32 bits) 1.7 (32 bits) 2.0 (32 bits) -1.5 (32 bits) 0.3 (32 bits) 5.2 (32 bits) 6.0 (32 bits) 2.5 (32 bits) SIMD Single Instruction Multiple Data SSE = Stream SIMD Extensions (Intel 1999)

Is it always that easy? Not always… a more challenging example: unsigned sum_array(unsigned *array, int length) { int total = 0; for (int i = 0 ; i < length ; ++ i) { total += array[i]; } return total; Is there parallelism here?

We first need to restructure the code unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0 ; i < length & ~0x3 ; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length ; ++ i) { total += array[i]; return total; Length & ~0x3 = mask off lower two bits (count by fours) All the temp I’s add in parallel Last loop is for array leftovers < 3

Then we can write SIMD code for the hot part unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0 ; i < length & ~0x3 ; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length ; ++ i) { total += array[i]; return total; What if there are 8 processors available?

Thread level parallelism: Multi-Core Processors Two (or more) complete processors, fabricated on the same silicon chip Execute instructions from two (or more) programs/threads at same time #1 #2 IBM Power5

Multi-Cores are Everywhere Intel Core i7: 2, 4 or 6 x86 processors on same chip XBox360: 3 PowerPC cores Sony Playstation 3: Cell processor, an asymmetric multi-core with 9 cores (1 general-purpose, 8 special purpose SIMD processors) Exercise: How many parallel processors in a GPU card?

Why Multi-cores Now? Number of transistors we can put on a chip growing exponentially…

… and performance growing too… But power is growing even faster!! Power has become limiting factor in current chips

As programmers, do we care? What happens if we run a program on a multi-core? void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++i) { C[i] = A[i] + B[i]; } Can’t easily split on instruction level between cores #1 #2

What if we want a program to run on both processors? We have to explicitly tell the machine exactly how to do this This is called parallel programming or concurrent programming There are many parallel/concurrent programming models We will look at a relatively simple one: fork-join parallelism Posix threads and explicit synchronization

Review: Amdahl’s Law Execution time after improvement = Time affected by improvement + Time unaffected by improvement Amount of improvement Example: Many computing tasks are I/O-bound, and the speed of the input and output devices limits the overall system performance. Improved CPU performance alone has a limited effect on overall system speed.

Fork/Join Logical Example Fork N-1 threads Break work into N pieces (and do it) Join (N-1) threads void array_add(int A[], int B[], int C[], int length) { cpu_num = fork(N-1); int i; for (i = cpu_num ; i < length ; i += N) { C[i] = A[i] + B[i]; } join(); Fork() creates a separate thread, which an run on a separate core Simplified code for fork() How good is this with caches?

How does this help performance? Parallel speedup measures improvement from parallelization: time for best serial version time for version with p processors What can we realistically expect? speedup(p) =

Reason #1: Amdahl’s Law In general, the whole computation is not (easily) parallelizable Serial regions

Reason #1: Amdahl’s Law New Execution Time = 1-s + s P Suppose a program takes 1 unit of time to execute serially A fraction of the program, s, is inherently serial (unparallelizable) For example, consider a program that, when executing on one processor, spends 10% of its time in a non-parallelizable region. How much faster will this program run on a 3-processor system? What if you had 8 cores? What is the maximum speedup from parallelization? New Execution Time = 1-s + s P Class Exercise .3T + .1T = 0.4T, speedup old_T / new_T = 1/0.4 = 2.5x Theoretically best speedup for perfectly parallelizable = N times for N processors But what about overhead? New Execution Time = .9T + .1T 3 Speedup =

Reason #2: Overhead Forking and joining is not instantaneous void array_add(int A[], int B[], int C[], int length) { cpu_num = fork(N-1); int i; for (i = cpu_num ; i < length ; i += N) { C[i] = A[i] + B[i]; } join(); Forking and joining is not instantaneous Involves communicating between processors May involve calls into the operating system Depends on the implementation New Execution Time = 1-s + s overhead(P) P

Programming Explicit Thread-level Parallelism As noted previously, the programmer must specify how to parallelize But, want path of least effort Division of labor between the Human and the Compiler Humans: good at expressing parallelism, bad at bookkeeping Compilers: bad at finding parallelism, good at bookkeeping Want a way to take serial code and say “Do this in parallel!” without: Having to manage the synchronization between processors Having to know a priori how many processors the system has Deciding exactly which processor does what Replicate the private state of each thread OpenMP: an industry standard set of compiler extensions Works very well for programs with structured parallelism. Visual Studio with Concurrency Runtime

New Programming Model Scalable Task Oriented Statelessness

Performance Optimization Until you are an expert, first write a working version of the program Then, and only then, begin tuning, first collecting data, and iterate Otherwise, you will likely optimize what doesn’t matter “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.” -- Sir Tony Hoare Discuss Pareto chart Expect this in exam

Profiling Profiling is data collection for recording the performance of each part of the program, so that it can be optimised. Requires special tools Tools Visual Studio: MSDN tools link GCC: use GPROF or similar (link) Other tools exist for other environments

Fix only the big problems Pareto analysis to identify most significant problem Remember Amdahl’s law Pareto diagram from Wikipedia

Conventional Wisdom (CW) in Computer Architecture Old CW: Power is free, but transistors expensive New CW: Power Wall - Power expensive, transistors “free” Can’t put more transistors on a chip than have power to turn on Old CW: Multiplies slow, but loads fast New CW: Memory Wall - Loads slow, multiplies fast 200 clocks to DRAM, but even FP multiplies only 4 clocks Old CW: More ILP (instruction level parallelism) via compiler / architecture innovation Branch prediction, speculation, Out-of-order execution, VLIW, … New CW: ILP Wall - Diminishing returns on more ILP Old CW: 2X CPU Performance every 18 months New CW: Power Wall+MemoryWall+ILP Wall = Brick Wall The doubling of uniprocessor perf will take longer Power wall Turbo mode in i7 works by turning off mall but one cores for power reasons

Conventional Wisdom “The promise of parallelism has fascinated researchers for at least three decades. In the past, parallel computing efforts have shown promise and gathered investment, but in the end uniprocessor computing always prevailed. Nevertheless, we argue general-purpose computing is taking an irreversible step toward parallel architectures.” – Berkeley View, December 2006

Parallelism again? What’s different this time? “This shift toward increasing parallelism is not a triumphant stride forward based on breakthroughs in novel software and architectures for parallelism; instead, this plunge into parallelism is actually a retreat from even greater challenges that thwart efficient silicon implementation of traditional uniprocessor architectures.” – Berkeley View, December 2006 HW/SW Industry bet its future that breakthroughs will appear before it’s too late view.eecs.berkeley.edu

Number of Cores/Socket We need revolution, not evolution Software or architecture alone can’t fix parallel programming problem, need innovations in both “Multicore” 2X cores per generation: 2, 4, 8, … “Manycore” 100s is highest performance per unit area, and per Watt, then 2X per generation: 64, 128, 256, 512, 1024 … Multicore architectures & Programming Models good for 2 to 32 cores won’t evolve to Manycore systems of 1000’s of processors Desperately need HW/SW models that work for Manycore or will run out of steam (as ILP ran out of steam at 4 instructions)

Morgan Kaufmann Publishers 14 February, 2018 FIGURE 7.17 Four recent multiprocessors, each using two sockets for the processors. Starting from the upper left hand corner, the computers are: (a) Intel Xeon e5345 (Clovertown), (b) AMD Opteron X4 2356 (Barcelona), (c) Sun UltraSPARC T2 5140 (Niagara 2), and (d) IBM Cell QS20. Note that the Intel Xeon e5345 (Clovertown) has a separate north bridge chip not found in the other microprocessors. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers 14 February, 2018 FIGURE 7.21 Performance of LBMHD on the four multicores. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters

Summary Multi-core is having more than one processor on the same chip. Most PCs/servers and game consoles are already multi-core Results from Moore’s law and power constraint Exploiting multi-core requires parallel programming Automatically extracting parallelism too hard for compiler, in general. But, can have compiler do much of the bookkeeping for us OpenMP Fork-Join model of parallelism At parallel region, fork a bunch of threads, do the work in parallel, and then join, continuing with just one thread Expect a speedup of less than P on P processors Amdahl’s Law: speedup limited by serial portion of program Overhead: forking and joining are not free Everything is changing Old conventional wisdom is out We desperately need new approach to HW and SW based on parallelism since industry has bet its future that parallelism work

Administrivia Course rating - BYU Course outcome survey – blackboard Professionalism Lab 10 & HW 11, due tomorrow Final exam, available right after class, due 5PM next Monday, Dec 14th Cover everything in Block 2 Pipeline x86 stack structure Memory structure, cache, VM Interrupt, IO, disk Late work deadline, next Monday, Dec 14th Quiz 8