Prof. Zhang Gang School of Computer Sci. & Tech.

Slides:

Advertisements

Similar presentations

1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

PIPELINE AND VECTOR PROCESSING

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.

The University of Adelaide, School of Computer Science

Chapter 2 — Instructions: Language of the Computer — 1 Branching Far Away If branch target is too far to encode with 16-bit offset, assembler rewrites.

Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.

1 RISC Machines Because of their load-store ISAs, RISC architectures require a large number of CPU registers. These register provide fast access to data.

2.3) Example of program execution 1. instruction  B25 8 Op-code B means to change the value of the program counter if the contents of the indicated register.

Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

Part II: Addressing Modes

Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.

1. 2 Instructions: Words of the language understood by CPU Instruction set: CPU’s vocabulary Instruction Set Architecture (ISA): CPU’s vocabulary together.

RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.

Vector/Array ProcessorsCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Vector/Array Processors Reading: Stallings, Section.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

Computer Studies/ICT SS2

Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.

Elements of Datapath for the fetch and increment The first element we need: a memory unit to store the instructions of a program and supply instructions.

Two dimensional arrays A two dimensional m x n array A is a collection of m. n elements such that each element is specified by a pair of integers (such.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

Vector computers.

F453 Module 8: Low Level Languages 8.1: Use of Computer Architecture.

Prof. Zhang Gang School of Computer Sci. & Tech.

Computer Architecture: SIMD and GPUs (Part I)

Memory Management.

Assembly language.

CS5100 Advanced Computer Architecture Data-Level Parallelism

The University of Adelaide, School of Computer Science

A Closer Look at Instruction Set Architectures

From Address Translation to Demand Paging

From Address Translation to Demand Paging

Copyright © 2012, Elsevier Inc. All rights reserved.

Quiz for Week #5.

Computer Architecture

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 14 The Roofline Visual Performance Model Prof. Zhang Gang

Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 11 Amazon Web Services Prof. Zhang Gang

Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 7 Physical Infrastructure of WSC Prof. Zhang Gang

Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 13 Using Energy Efficiently Inside the Server Prof. Zhang.

Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 4 Storage Prof. Zhang Gang School of.

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 22 Similarities & Differences between Vector Arch & GPUs Prof. Zhang Gang.

Prof. Zhang Gang School of Computer Sci. & Tech.

Prof. Zhang Gang School of Computer Sci. & Tech.

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang

Computer Organization and Assembly Language (COAL)

The University of Adelaide, School of Computer Science

COMP4211 : Advance Computer Architecture

Data Representation – Instructions

The University of Adelaide, School of Computer Science

Pipelining and Vector Processing

Array Processor.

Linchuan Chen, Peng Jiang and Gagan Agrawal

Functional Units.

The University of Adelaide, School of Computer Science

Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra

Multivector and SIMD Computers

Computer Architecture and the Fetch-Execute Cycle

Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory

Parallel build blocks.

CSC3050 – Computer Architecture

The University of Adelaide, School of Computer Science

Presentation transcript:

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 11 Gather-Scatter Prof. Zhang Gang gzhang@tju.edu.cn School of Computer Sci. & Tech. Tianjin University, Tianjin, P. R. China

Gather-Scatter In a sparse matrix, the elements of a vector are usually stored in some compacted form and then accessed indirectly. The primary mechanism for supporting sparse matrices is gather-scatter operations using index vectors. The goal is to support moving between a compressed representation (i.e., zeros are not included) and normal representation (i.e., the zeros are included) of a sparse matrix.

Gather-Scatter A gather operation takes an index vector and fetches the vector whose elements are at the addresses given by adding a base address to the offsets given in the index vector. The result is a dense vector in a vector register.

Gather-Scatter Consider sparse vectors A & C and vector indices K & M A and C have the same number (n) of non-zeros: for (i = 0; i < n; i=i+1) A[K[i]] = A[K[i]] + C[M[i]]; Ra, Rc, Rk and Rm are the starting addresses of vectors Use index vector: LV Vk, Rk ;load K LVI Va, (Ra+Vk) ;load A[K[]] LV Vm, Rm ;load M LVI Vc, (Rc+Vm) ;load C[M[]] ADDVV.D Va, Va, Vc ;add them SVI (Ra+Vk), Va ;store A[K[]]

Gather-Scatter This technique allows code with sparse matrices to run in vector mode. Although indexed loads and stores (gather and scatter) can be pipelined, they typically run much more slowly than non-indexed loads or stores, since the memory banks are not known at the start of the instruction.

Gather-Scatter Each element has an individual address, so they can’t be handled in groups, and there can be conflicts at many places throughout the memory system. Thus, each individual access incurs significant latency.

Exercises What is the meaning of gather? What is the meaning of scatter? Where do gather-scatter operations are needed? What is stored in the index vector?