Prof. Zhang Gang School of Computer Sci. & Tech.

Slides:



Advertisements
Similar presentations
1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
PIPELINE AND VECTOR PROCESSING
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
The University of Adelaide, School of Computer Science
Chapter 2 — Instructions: Language of the Computer — 1 Branching Far Away If branch target is too far to encode with 16-bit offset, assembler rewrites.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
1 RISC Machines Because of their load-store ISAs, RISC architectures require a large number of CPU registers. These register provide fast access to data.
2.3) Example of program execution 1. instruction  B25 8 Op-code B means to change the value of the program counter if the contents of the indicated register.
Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Part II: Addressing Modes
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
1. 2 Instructions: Words of the language understood by CPU Instruction set: CPU’s vocabulary Instruction Set Architecture (ISA): CPU’s vocabulary together.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
Vector/Array ProcessorsCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Vector/Array Processors Reading: Stallings, Section.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Computer Studies/ICT SS2
Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.
Elements of Datapath for the fetch and increment The first element we need: a memory unit to store the instructions of a program and supply instructions.
Two dimensional arrays A two dimensional m x n array A is a collection of m. n elements such that each element is specified by a pair of integers (such.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
Vector computers.
F453 Module 8: Low Level Languages 8.1: Use of Computer Architecture.
Prof. Zhang Gang School of Computer Sci. & Tech.
Computer Architecture: SIMD and GPUs (Part I)
Memory Management.
Assembly language.
CS5100 Advanced Computer Architecture Data-Level Parallelism
The University of Adelaide, School of Computer Science
A Closer Look at Instruction Set Architectures
From Address Translation to Demand Paging
From Address Translation to Demand Paging
Copyright © 2012, Elsevier Inc. All rights reserved.
Quiz for Week #5.
Computer Architecture
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 14 The Roofline Visual Performance Model Prof. Zhang Gang
Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 11 Amazon Web Services Prof. Zhang Gang
Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 7 Physical Infrastructure of WSC Prof. Zhang Gang
Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 13 Using Energy Efficiently Inside the Server Prof. Zhang.
Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 4 Storage Prof. Zhang Gang School of.
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 22 Similarities & Differences between Vector Arch & GPUs Prof. Zhang Gang.
Prof. Zhang Gang School of Computer Sci. & Tech.
Prof. Zhang Gang School of Computer Sci. & Tech.
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
Computer Organization and Assembly Language (COAL)
The University of Adelaide, School of Computer Science
COMP4211 : Advance Computer Architecture
Data Representation – Instructions
The University of Adelaide, School of Computer Science
Pipelining and Vector Processing
Array Processor.
Linchuan Chen, Peng Jiang and Gagan Agrawal
Functional Units.
The University of Adelaide, School of Computer Science
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
Multivector and SIMD Computers
Computer Architecture and the Fetch-Execute Cycle
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Parallel build blocks.
CSC3050 – Computer Architecture
7/6/
9/13/
The University of Adelaide, School of Computer Science
Presentation transcript:

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 11 Gather-Scatter Prof. Zhang Gang gzhang@tju.edu.cn School of Computer Sci. & Tech. Tianjin University, Tianjin, P. R. China

Gather-Scatter In a sparse matrix, the elements of a vector are usually stored in some compacted form and then accessed indirectly. The primary mechanism for supporting sparse matrices is gather-scatter operations using index vectors. The goal is to support moving between a compressed representation (i.e., zeros are not included) and normal representation (i.e., the zeros are included) of a sparse matrix.

Gather-Scatter A gather operation takes an index vector and fetches the vector whose elements are at the addresses given by adding a base address to the offsets given in the index vector. The result is a dense vector in a vector register.

Gather-Scatter Consider sparse vectors A & C and vector indices K & M A and C have the same number (n) of non-zeros: for (i = 0; i < n; i=i+1) A[K[i]] = A[K[i]] + C[M[i]]; Ra, Rc, Rk and Rm are the starting addresses of vectors Use index vector: LV Vk, Rk ;load K LVI Va, (Ra+Vk) ;load A[K[]] LV Vm, Rm ;load M LVI Vc, (Rc+Vm) ;load C[M[]] ADDVV.D Va, Va, Vc ;add them SVI (Ra+Vk), Va ;store A[K[]]

Gather-Scatter This technique allows code with sparse matrices to run in vector mode. Although indexed loads and stores (gather and scatter) can be pipelined, they typically run much more slowly than non-indexed loads or stores, since the memory banks are not known at the start of the instruction.

Gather-Scatter Each element has an individual address, so they can’t be handled in groups, and there can be conflicts at many places throughout the memory system. Thus, each individual access incurs significant latency.

Exercises What is the meaning of gather? What is the meaning of scatter? Where do gather-scatter operations are needed? What is stored in the index vector?