Tan Hongbing, Liu Sheng†, Chen Haiyan School of National University of

Slides:

Advertisements

Similar presentations

The University of Adelaide, School of Computer Science

Advertisements

Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.

Multiobjective VLSI Cell Placement Using Distributed Simulated Evolution Algorithm Sadiq M. Sait, Mustafa I. Ali, Ali Zaidi.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.

Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

A Flexible Interleaved Memory Design for Generalized Low Conflict Memory Access Laurence S.Kaplan BBN Advanced Computers Inc. Cambridge,MA Distributed.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Sunpyo Hong, Hyesoon Kim

Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

CMSC 611: Advanced Computer Architecture

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

INTRODUCTION TO STATISTICS

ESE532: System-on-a-Chip Architecture

A Closer Look at Instruction Set Architectures

CSC 4250 Computer Architectures

How will execution time grow with SIZE?

DATA STRUCTURES AND OBJECT ORIENTED PROGRAMMING IN C++

The University of Adelaide, School of Computer Science

Streaming & sampling.

5.2 Eleven Advanced Optimizations of Cache Performance

Prof. Zhang Gang School of Computer Sci. & Tech.

Introduction to Summary Statistics

Introduction to Summary Statistics

Introduction to Summary Statistics

Random walk initialization for training very deep feedforward networks

Pipelining and Vector Processing

Bojian Zheng CSCD70 Spring 2018

Linchuan Chen, Peng Jiang and Gagan Agrawal

Introduction to Summary Statistics

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

CPU Scheduling G.Anuradha

Module 5: CPU Scheduling

Introduction to Summary Statistics

Hidden Markov Models Part 2: Algorithms

Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra

Unit-2 Divide and Conquer

Exponential Functions

3: CPU Scheduling Basic Concepts Scheduling Criteria

Multivector and SIMD Computers

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Chapter5: CPU Scheduling

Introduction to Summary Statistics

Lecture 3: Main Memory.

Guihai Yan, Yinhe Han, Xiaowei Li, and Hui Liu

Lecture 2 Part 3 CPU Scheduling

CS 3410, Spring 2014 Computer Science Cornell University

Introduction to Summary Statistics

Parallel build blocks.

Chapter 4 Multiprocessors

Parametric Methods Berlin Chen, 2005 References:

Introduction to Summary Statistics

Memory System Performance Chapter 3

Mattan Erez The University of Texas at Austin

Module 5: CPU Scheduling

Main Memory Background

Introduction to Summary Statistics

Optimal XOR based (2,n)-Visual Cryptography Schemes

Reporter: Wenkai Cui Institution: Tsinghua University Date:

COMPUTER ORGANIZATION AND ARCHITECTURE

Scalable light field coding using weighted binary images

Module 5: CPU Scheduling

Presentation transcript:

Tan Hongbing, Liu Sheng†, Chen Haiyan School of National University of Modeling and Evaluation for Gather/Scatter Operations in Vector-SIMD architectures Tan Hongbing, Liu Sheng†, Chen Haiyan School of National University of Defense Technology Good morning, everyone. Today, I would like to present the paper” Modeling and Evaluation for Gather/Scatter Operations in Vector-SIMD architectures”.I am the corresponding author Liu Sheng. 1

Presentation Outline 1. Introduction 2. Models and Verification 3. Evaluation and Results 4. Conclusion The report consist of four parts. The first one is Introduction. 2 2

Gather-Scatter in Vector-SIMD architectures Gathers: vector of addrs  vector register In the picture, we can see … Scatters: vector register vector of addrs – Reads and writes to different sub-banks performed in parallel – Multiple reads or writes to same sub-bank address combined into single access – Reads overlapped across different gathers; writes overlapped across different scatters 3 3

Gather-Scatter in application As present, vector-SIMD has become the popular architecture in modern processors. In which, vector memory provides sufficient data bandwidth through multiple memory banks. However, there are many applications, such as sparse-matrix, image warping, image histogram, whose access locations are random and can not be predicted in advance. The vector memory which only supports the common access patterns will a long unhidden memory latency. Gather/scatter instructions can directly realize the random vector access modes, which can satisfy the data requirements of these applications. 4 4

Definition of Gather-Scatter Gather and scatter are dual operations. A scatter indexed writes to an array, and a gather performs indexed reads from an array. In above equation, the array L for the scatter contains distinct write locations or read locations. Scatter: 5 5

what ‘s the probability of each distribution Gather/scatter has the stochastic and complicated properties, the hardware design of gather/scatter operations lacks theoretical analysis and modeling. what’s the possible distributions of access locations to the given PE and memory bank counts, what ‘s the probability of each distribution how to detailedly optimize the hardware implemation The proposed model will give the answers. 6

Presentation Outline 1. Introduction 2. Models and Verification 3. Evaluation and Results 4. Conclusion Next, Models and Verification 7 7

Example -Both the SIMD width and the number of memory banks(sing-port) are 4 -The Maximum Conflicts Per Cycle (MCPC) is equal to the maximum element of access location distributions. The Distribution of Access Location, We call DAL for short (1) MCPC=4, {4,0,0,0} (2) MCPC=3, {3,1,0,0} (3) MCPC=2, {2,2,0,0} (5) MCPC=1, {1,1,1,1} (4) MCPC=2, {2,1,1,0} To facilitate understanding, We take a simple example to explain our model. First, We assume both the SIMD width and the number of memory banks is 4, The Maximum Conflicts Per Cycle (MCPC) is equal to the maximum element of access location distributions. For this case, we list all 5 DALs. 4 access locations divide into 2 groups and distribute in two different memory banks 8 8

Relation among Distribution of Access Location (DAL) f(4,2): f(4,1);{2,1,1,0};{2,2,0,0} f(4,3): f(4,2);{3,1,0,0} f(4,4): f(4,3);{4,0,0,0} f(7,7): f(7,1);{2,f(5,2)}; {3,f(4,3)};{4,f(3,3)}; {5,f(2,2)};{6,f(1,1)}; g(7); The case of 4 memory banks is simple, so we can list all DALs easily. But it hard to list all DALs with large number of memory banks. Luckily, we find some recursive relations among DALs. Firstly, we create a function f, f(a,b) is the DALs whose MCPC<=b with a memory banks. It easy to see, f(7,7)can be represented by the small scale function f. f(a,b) is a set which include all the DALs whose maximum element less than or equal to b with a PEs 9 9

Modeling the DAL According to the relations among different DALs ,We can deduce a equation to calculate the DALs with different number of memory banks. f(a,b) is a set which include all the DALs whose maximum element less than or equal to b with a PEs is the integer portion of the quotient of A divided by B 10 10

Modeling the Probability of Access Conflict（PAC） (1) MCPC=4, {4,0,0,0} (2) MCPC=3, {3,1,0,0} (3) MCPC=2, {2,2,0,0} (5) MCPC=1, {1,1,1,1} (4) MCPC=2, {2,1,1,0} Through the DAL model, We can obtain all the DAL, but we don’t know the probability of each DAL. PAC is the probability of each DAL. It can generate by the common konwledge of permutation and combination. All possible permutation of the j-th DAL The probability of the j-th DAL 11 11

Modeling the PAC The data used in this equation come from D, D[i,j] is the i-th element in j-th row; O[j] is the number of non-zero elements in j-th row; G(i,j) is the sum of the front of i elements in j-th row; M(j) is an intermediate variable for calculation; F(m,j) is the number of elements m in the j-th row. 12

Model verification (By Matlab) Measured Estimated 4 PEs 5 8 PEs 22 16 PEs(total) 231 MCPC<=2(16 PEs) 9 MCPC=3(16 PEs) 21 MCPC=4(16 PEs) 34 MCPC=5(16 PEs) 37 MCPC=6(16 PEs) 35 MCPC=7(16 PEs) 28 MCPC=8(16 PEs) MCPC=9(16 PEs) 15 MCPC=10(16 PEs) 11 MCPC>=11(16 PEs) 19 32 PEs 8349 64 PEs 1741630 We validate our model by MATLAB program. For the DAL model, The results show all the measured and estimated results are totally same. For the PAL model, The average accuracy is over 98% Validating the PAC model The average accuracy of our model on the gather/ scatter is over 98% (min: 97.3%, max: 100%) when read/write locations are totally random Validating the DAL model The results show all the measured and estimated results are totally same 13 13

Presentation Outline 1. Introduction 2. Models and Verification 3. Evaluation and Results 4. Conclusion 14

Evaluation and Results To hardware designers, two common methods can improve gather/scatter performance (1) Organizing memory bank into separate sub-banks (2) Adding buffers to cache memory requests 15

Evaluation and Results Analysis for MCPC with the PE：Bank varied Because of the performance of gather/scatter is closely related to the depth of buffer array and the ratio of PEs to memory banks, we make analysis the MCPC with the ratio of PE counts to Bank counts, through the proposed model. From the picture we can see: When PE:Bank=1:1. more than 90% DALs, their MCPC<=4 When PE:Bank=1:2. more than 90% DALs, their MCPC<=3 When PE:Bank=1:4. more than 90% DALs, their MCPC<=2 more than 80% DALs, their MCPC<=4 more than 90% DALs, their MCPC<=3 more than 90% DALs, their MCPC<=2 The performance of gather/scatter is closely related to the ratio of PEs to memory banks 16 16

Evaluation and Results Analysis for selecting the proper number of memory banks NAC=2.12 NAC=1.64 NAC=1.32 The picture shows the Average Number of Access Conflict with the ratio of the number of PEs to the number of memory banks varied. From the picture, We can draw two conclusions. First, if the ratio of PEs to memory banks is fixed, the larger number of memory banks will result in the bigger conflicts. Second, the conlfict reduces as the decrease of PE:Bank when the number of PEs is fixed Average Number of Access Conflict (NAC) 2.12 2.59 3.05 3.45 3.76 φ(k) stands the DALs whose MCPC is k 17 17

Evaluation and Results 3.05 1.98 1.34 This picture gives the relation with the runtime and buffer depth and memory bank counts. The runtime reduce a lot after adding buffer array, and The runtime time reduced as the ratio of PEs to memory banks deceased Buffer array deeper, Run time more less The runtime time reduced as the ratio of PEs to memory banks deceased 18 18

Evaluation and Results The effect of performance improvement with the depth of buffer array varied The picture shows the improvement of gather/scatter after adding buffers. The effect of buffer depth on the performance improvement become smaller as the ratio of PEs to memory banks deceased. When PE: bank =1:1, three curves are far away, this means the performance improvement are very different when the depth of buffer array is 2, 4, 8, respectively. What’s more, when PE: bank =1:4, the three curves of are very closely, because the depth of buffer array set to 2 is enough. Dispersive Very close 19 19

Presentation Outline 1. Introduction 2. Models and Verification 3. Evaluation and Results 4. Conclusion 20

Conclusion -This model can give all the possible DAL, PAC and so on for gather/scatter operation in various situations. -This model can help users to select the optimum number of memory banks and guide the designers to select the proper number of buffers.（For example, if SIMD=16，each bank consist of 2 sub-banks，and buffer depth set to 4 is recommended） 21

Thank you！ 22