ESE532: System-on-a-Chip Architecture

Slides:

Advertisements

Similar presentations

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

Advertisements

The University of Adelaide, School of Computer Science

Introduction to Parallel Computing

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 14: March 19, 2014 Compute 2: Cascades, ALUs, PLAs.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Chapter 16 Control Unit Operation No HW problems on this chapter. It is important to understand this material on the architecture of computer control units,

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.

Chapter 16 Control Unit Implemntation. A Basic Computer Model.

Chapter 15 IA 64 Architecture Review Predication Predication Registers Speculation Control Data Software Pipelining Prolog, Kernel, & Epilog phases Automatic.

Penn ESE Spring DeHon 1 FUTURE Timing seemed good However, only student to give feedback marked confusing (2 of 5 on clarity) and too fast.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 5: January 24, 2007 ALUs, Virtualization…

1 Chapter 04 Authors: John Hennessy & David Patterson.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 9: February 24, 2014 Operator Sharing, Virtualization, Programmable Architectures.

Principles of Linear Pipelining

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 5: January 17, 2003 What is required for Computation?

Elements of Datapath for the fetch and increment The first element we need: a memory unit to store the instructions of a program and supply instructions.

Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 11: February 20, 2012 Instruction Space Modeling.

ESE532: System-on-a-Chip Architecture

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

ESE534: Computer Organization

Somet things you should know about digital arithmetic:

Computer Organization and Architecture + Networks

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

ESE534: Computer Organization

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

Simultaneous Multithreading

Chap 7. Register Transfers and Datapaths

ESE534 Computer Organization

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

CS184a: Computer Architecture (Structure and Organization)

ESE532: System-on-a-Chip Architecture

/ Computer Architecture and Design

Vector Processing => Multimedia

Basics Combinational Circuits Sequential Circuits Ahmad Jawdat

ESE534: Computer Organization

ESE534: Computer Organization

Day 26: November 1, 2013 Synchronous Circuits

CSE 370 – Winter 2002 – Comb. Logic building blocks - 1

ESE534: Computer Organization

ESE534: Computer Organization

Multivector and SIMD Computers

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

ESE534: Computer Organization

Guest Lecturer TA: Shreyas Chand

ESE534: Computer Organization

ESE532: System-on-a-Chip Architecture

/ Computer Architecture and Design

ESE534: Computer Organization

ESE534: Computer Organization

Superscalar and VLIW Architectures

ECE 352 Digital System Fundamentals

COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING

ESE534: Computer Organization

ESE534: Computer Organization

ESE534: Computer Organization

Computer Architecture

CSE 502: Computer Architecture

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

Presentation transcript:

ESE532: System-on-a-Chip Architecture Day 4: January 25, 2017 Data-Level Parallelism Penn ESE532 Spring 2017 -- DeHon

Today Data-level Parallelism For Parallel Decomposition Architectures Concepts NEON Penn ESE532 Spring 2017 -- DeHon

Message Data Parallelism easy basis for decomposition Data Parallel architectures can be compact – pack more computations onto a die Penn ESE532 Spring 2017 -- DeHon

Preclass 1 300 news articles Count total occurrences of a string How can we exploit data-level parallelism on task? How much parallelism can we exploit? Penn ESE532 Spring 2017 -- DeHon

Parallel Decomposition Penn ESE532 Spring 2017 -- DeHon

Data Parallel Data-level parallelism can serve as an organizing principle for parallel task decomposition Run computation on independent data in parallel Penn ESE532 Spring 2017 -- DeHon

Exploit Can exploit with Threads Instruction-level Parallelism Fine-grained Data-Level Parallelism Penn ESE532 Spring 2017 -- DeHon

Thread Exploit DP How exploit threads for data-parallel text search? Penn ESE532 Spring 2017 -- DeHon

SPMD Single Program Multiple Data Only need to write code once Get to use many times Penn ESE532 Spring 2017 -- DeHon

ILP Exploit DP + How exploit ILP for DP? ALU Instruction Memory Address Instruction Memory Penn ESE532 Spring 2017 -- DeHon

Pipeline Exploit How exploit hardware pipeline for text search? Penn ESE532 Spring 2017 -- DeHon

Common Examples What are common examples of DLP? Signal Processing Simulation Numerical Linear Algebra Graphics Image Processing Optimization Other? Penn ESE532 Spring 2017 -- DeHon

Hardware Architectures Penn ESE532 Spring 2017 -- DeHon

Idea If we’re going to perform the same operations on different data, exploit that to reduce area, energy Reduced area means can have more computation on a fixed-size die. Penn ESE532 Spring 2017 -- DeHon

SIMD Single Instruction Multiple Data Penn ESE532 Spring 2017 -- DeHon

W-bit ALU as SIMD Familiar idea A W-bit ALU (W=8, 16, 32, 64, …) is SIMD Each bit of ALU works on separate bits Performing the same operation on it Trivial to see bitwise AND, OR, XOR Also true for ADD (each bit performing Full Adder) Share one instruction across all ALU bits Penn ESE532 Spring 2017 -- DeHon

ALU Bit Slice Penn ESE532 Spring 2017 -- DeHon

Register File Small Memory Usually with multiple ports Small Ability to perform multiple reads and writes simultaneously Small To make it fast (small memories fast) Multiple ports are expensive Penn ESE532 Spring 2017 -- DeHon

Preclass 2 Area W=16? Area W=128? Number in 108 Perfect Pack Ratio? Penn ESE532 Spring 2017 -- DeHon

Preclass 2 W for single datapath in 108? Perfect 16b pack ratio? Compare W=128 perfect pack ratio? Penn ESE532 Spring 2017 -- DeHon

ALU vs. SIMD ? What’s different between 128b wide ALU SIMD datapath supporting eight 16b ALU operations Penn ESE532 Spring 2017 -- DeHon

Segmented Datapath Relatively easy (few additional gates) to convert a wide datapath into one supporting a set of smaller operations Just need to squash the carry at points Penn ESE532 Spring 2017 -- DeHon

Segmented Datapath Relatively easy (few additional gates) to convert a wide datapath into one supporting a set of smaller operations Just need to squash the carry at points But need to keep instructions (description) small So typically have limited, homogeneous widths supported Penn ESE532 Spring 2017 -- DeHon

Segmented 128b Datapath 1x128b, 2x64b, 4x32b, 8x16b Penn ESE532 Spring 2017 -- DeHon

Terminology: Vector Lane Each of the separate segments called a Vector Lane For 16b data, this provides 8 vector lanes Penn ESE532 Spring 2017 -- DeHon

Opportunity Don’t need 64b variables for lots of things Natural data sizes? Audio samples? Input from A/D? Video Pixels? X, Y coordinates for 4K x 4K image? Penn ESE532 Spring 2017 -- DeHon

Vector Computation Easy to map to SIMD flow if can express computation as operation on vectors Vector Add Vector Multiply Dot Product Penn ESE532 Spring 2017 -- DeHon

Concepts Penn ESE532 Spring 2017 -- DeHon

Vector Register File Need to be able to feed the SIMD compute units Not be bottlenecked on data movement to the SIMD ALU Wide RF to supply With wide path to memory Penn ESE532 Spring 2017 -- DeHon

Point-wise Vector Operations Easy – just like wide-word operations (now with segmentation) Penn ESE532 Spring 2017 -- DeHon

Point-wise Vector Operations …but alignment matters. If not aligned, need to perform data movement operations to get aligned Penn ESE532 Spring 2017 -- DeHon

Vector Length May not match physical hardware length What happens when Vector length > hardware SIMD operators? Vector length < hardware SIMD operators? Vector length % hdw operators !=0 E.g. vector length 20, for 8 hdw operators Penn ESE532 Spring 2017 -- DeHon

Skipping Elements? How does this work with datapath? for (i=0;i<64;i=i+2) c[i]=a[i]+b[i] Penn ESE532 Spring 2017 -- DeHon

Stride Stride: the distance between vector elements used for (i=0;i<64;i=i+2) c[i]=a[i]+b[i] Accessing data with stride=2 Penn ESE532 Spring 2017 -- DeHon

Load/Store Strided load/stores Scatter/gather Some architectures will provide strided memory access that compact when read into register file Scatter/gather Some architectures will provide memory operations to grab data from different places to construct a dense vector Penn ESE532 Spring 2017 -- DeHon

Conditionals? What happens if want to do something different? For (i=0;i<8;i++) if (a[i]<b[i]) d[i]=a[i]+c[i] else d[i]=b[i]+c[i] Penn ESE532 Spring 2017 -- DeHon

Conditionals Only have one Program Counter Cannot implement conditional via branching Penn ESE532 Spring 2017 -- DeHon

Conditionals Only have one instruction Cannot perform separate operations on each ALU in datapath Penn ESE532 Spring 2017 -- DeHon

Conditionals Only have one Program Counter Only have one instruction Cannot implement conditional via branching Only have one instruction Cannot perform separate operations on each ALU in datapath Must perform an invariant operation sequence Penn ESE532 Spring 2017 -- DeHon

Invariant Operation If (a[i]<b[i]) then d[i]=a[i]+c[i] else d[i]=b[i]+c[i] What’s in each register as go through sequence? T1[i]=a[i]<b[i] T2[i]=-T1[i] T3[i]=~(T2[i]) T2[i]=a[i] & T2[i] T3[i]=b[i] & T3[i] d[i]=c[i] + T2[i] d[i]=c[i] + T3[i] Penn ESE532 Spring 2017 -- DeHon

IfMux Conversion Can always transform into a data independent sequence Often convenient to think of IF’s as Multiplexers If (cond) o=a else o=b Penn ESE534 Spring2016 -- DeHon

Predicated Operation Many architectures will provide a predicated operation Only perform operation when predicate matches instruction p[i]=a[i]<b[i] p[i]: d[i]=c[i] + a[i] ~p[i]: d[i]=c[i] + b[i] Penn ESE532 Spring 2017 -- DeHon

Predicated Operation What does this do to instructions must be issued? What does this do to efficiency? Useful operations performed per cycle p[i]=a[i]<b[i] p[i]: d[i]=c[i] + a[i] ~p[i]: d[i]=c[i] + b[i] Penn ESE532 Spring 2017 -- DeHon

Nested Conditionals What happens with nested conditionals? Penn ESE532 Spring 2017 -- DeHon

Dot Product What happens when need a dot product? res=0; for (i=0;i<N;i++) res+=a[i]*b[i] Penn ESE532 Spring 2017 -- DeHon

Reduction Common operations where want to perform a combining operation to reduce a vector to a scalar Sum values in vector AND, OR Reduce Operation Penn ESE532 Spring 2017 -- DeHon

Reduce Tree Efficiently handled with reduce tree Penn ESE532 Spring 2017 -- DeHon

Reduce in Pipeline Comes almost for free in pipeline Penn ESE532 Spring 2017 -- DeHon

Vector Reduce Instruction Usually include support for vector reduce operation Doesn’t need to add much to delay Maybe even faster than performing larger operation 8 16x16 multiplies with sum reduce less complex than one 128x128 multiply …can exploit datapath of larger operation Penn ESE532 Spring 2017 -- DeHon

Dot Product Revisited for (i=0;i<N;i++) a in R0—R4 b in R4—R7 res+=a[i]*b[i] a in R0—R4 b in R4—R7 With 3 cycle pipelined multiply What happens if try to implement dot product as: MPY R0, R4, R14 ADD R14, R15, R15 MPY R1, R5, R14 … Penn ESE532 Spring 2017 -- DeHon

Dot Product Revisited for (i=0;i<N;i++) a in R0—R4 b in R4—R7 res+=a[i]*b[i] a in R0—R4 b in R4—R7 How should order (reformulate) instructions exploiting data-level parallelism? Penn ESE532 Spring 2017 -- DeHon

Pipelined Vector Units Will get both pipelining and parallel vector lanes Exploit data-level parallelism for both Penn ESE532 Spring 2017 -- DeHon

Neon Penn ESE532 Spring 2017 -- DeHon

Penn ESE532 Spring 2017 -- DeHon

Neon Vector 128b wide register file, 16 registers Support 2x64b 4x32b (also Single-Precision Float) 8x16b 16x8b Penn ESE532 Spring 2017 -- DeHon

Sample Instructions VADD – basic vector VCEQ – compare equal Sets to all 0s or 1s, useful for masking VMIN – avoid using if’s VMLA – accumulating multiply VPADAL – maybe useful for reduce VEXT – for “shifting” vector alignment VLDn – deinterleaving load Penn ESE532 Spring 2017 -- DeHon

Neon Notes Didn’t see Vector-wide reduce operation Conditionals within vector lanes Do need to think about operations being pipelined within lanes Penn ESE532 Spring 2017 -- DeHon

Big Ideas Data Parallelism easy basis for decomposition Data Parallel architectures can be compact – pack more computations onto a chip SIMD, Pipelined Benefit by sharing (instructions) Performance can be brittle Drop from peak as mismatch Penn ESE532 Spring 2017 -- DeHon

Admin Reading for Day 5 on web Talk on Thursday by Ed Lee (UCB) 3pm in Wu and Chen HW2 due Friday HW3 out (soon…) Different partners Penn ESE532 Spring 2017 -- DeHon