Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam

Slides:

Advertisements

Similar presentations

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

Advertisements

CPU Review and Programming Models CT101 – Computing Systems.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.

Lecture: Pipelining Basics

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Computer Architecture and Data Manipulation Chapter 3.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

University of Michigan Electrical Engineering and Computer Science 1 Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark,

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Lecture: Pipelining Basics

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Instruction Selection II CS 671 February 26, 2008.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

© 2009, Renesas Technology America, Inc., All Rights Reserved 1 Course Introduction  Purpose:  This course provides an overview of the SH-2 32-bit RISC.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Chapter 4 The Von Neumann Model

Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.

Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward.

Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

My Coordinates Office EM G.27 contact time:

Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Chapter Overview General Concepts IA-32 Processor Architecture

Immediate Addressing Mode

Variable Word Width Computation for Low Power

Ph.D. in Computer Science

CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains

Conception of parallel algorithms

Multiscalar Processors

Prof. Onur Mutlu Carnegie Mellon University

Decoupled Access-Execute Pioneering Compilation for Energy Efficiency

Prof. Sirer CS 316 Cornell University

Morgan Kaufmann Publishers

Flow Path Model of Superscalars

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Pipelining and Vector Processing

Tony Nowatzki Vinay Gangadhar Karthikeyan Sankaralingam

Ronny Krashinsky and Mike Sung

URECA: A Compiler Solution to Manage Unified Register File for CGRAs

Chapter 2: Data Manipulation

* From AMD 1996 Publication #18522 Revision E

Chapter 4 The Von Neumann Model

Analyzing Behavior Specialized Acceleration

Prof. Sirer CS 316 Cornell University

Chapter 2: Data Manipulation

The Vector-Thread Architecture

Additional ILP topic #5: VLIW Also: ISA topics Prof. Eric Rotenberg

Mattan Erez The University of Texas at Austin

The Stored Program Computer

Evolution of ISA’s ISA’s have changed over computer “generations”.

Loop-Level Parallelism

COMPUTER ORGANIZATION AND ARCHITECTURE

Lecture: Pipelining Basics

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

What Are Performance Counters?

Chapter 2: Data Manipulation

Stream-based Memory Specialization for General Purpose Processors

Prof. Onur Mutlu Carnegie Mellon University

Chapter 4 The Von Neumann Model

Presentation transcript:

Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam Efficient Execution of Memory Access Phases Using Dataflow Specialization Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam

Aggregation, matrix multiply, image processing… Memory Access Phase A dynamic portion of a program where its instruction stream is predominantly for memory accesses and address generation. for (f=0; f<FSIZE; f+=4) { __m128 xmm_in_r = _mm_loadu_ps(in_r+p+f); __m128 xmm_in_i = _mm_loadu_ps(in_i+p+f); __m128 xmm_mul_r = _mm_mul_ps(xmm_in_r, xmm_coef); __m128 xmm_mul_i = _mm_mul_ps(xmm_in_i, xmm_coef); accum_r = _mm_add_ps(xmm_accum_r, _mm_sub_ps( xmm_mul_r, xmm_mul_i)); accum_i = _mm_add_ps(xmm_accum_i, _mm_add_ps( xmm_mul_r, xmm_mul_i)); } for (i=0;i<v_size;++i){ A[K[i]] += V[i]; } for(i=0; i<8; ++i) { for(j=0; j<8; ++j) { float sum=0; for(k=0; k<8; ++k) { sum+=matAT[i*matAcol+k]* matB[j*matBrow+k]; } matCT[i*matBcol+j]+=sum; Aggregation, matrix multiply, image processing… for(int y = 0; y < srcImg.height; ++y ) for(int x = 0; x < srcImg.width; ++x ){ p = srcImg.build3x3Window(x, y); NPU_SEND(p[0][0]);NPU_SEND(p[0][1]); NPU_SEND(p[0][2]);NPU_SEND(p[1][0]); NPU_SEND(p[1][1]);NPU_SEND(p[1][2]); NPU_SEND(p[2][0]);NPU_SEND(p[2][1]); NPU_SEND(p[2][2]);NPU_RECEIVE(pixel); dstImg.setPixel(x, y, pixel); }

Execution Model Read D$, little comp., write D$ Speedup In-order OOO2 OOO4 Natural 1.0 1.5 2.2 Read D$, send to accel, write D$ Speedup In-order OOO2 OOO4 DySER 1.0 1.5 2.7 SSE 1.7 2.9 NPU 1.6 2.2

Core becomes bottleneck (Power) Total watts Address Computation + Data Access < 40%

Goal: To more efficiently access memory, obtain OOO’s performance without power overheads

Memory Access Dataflow A specialized dataflow architecture to access memory (Processor pipeline turned off) Big idea: exposing the concept of triggering events & actions Cache MAD Processor Core (off) Accelerator

What does the core do? address ready, control variable resolved, value returned from cache Create Events React Core follows few computation patterns to compute the address and control behavior; The outcomes create recurring events like value returned from cache, address ready in the processor etc. Based on these events, the core performs actions moving data between accelerator & memory above three are recurring and occur concurrently Access memory with loads and stores computes the address and control variables Compute patterns

MAD ISA Primitives Dataflow Graph Nodes Actions Events Analogous to compute instructions & reg state Actions Analogous to ld/st and move instructions Events Analogous to program counter sequencing Arch. Primer! Conventional RISC/CISC ISA: Register state Compute instructions LD/St instruction Program counter and control flow

Transforming ISA + < BaseA BaseB 1 n i Pseudo Program RISC ISA for(i=0; i<n; ++j) { a[i] = accel(a[i],b[i]) } Computation Ports RISC ISA Named registers .L0 ld, $r0+$r1 -> $acc0 ld, $r2+$r1 -> $acc1 st, $acc2 -> $r0+$r1 addi, $r1, 1 -> $r1 ble, $r4, $r1, .L0 Branch, PC.. Data Movement

Transforming ISA + < BaseA BaseB 1 n i Named Event Queues # Dataflow Graph Nodes N0: $eq7 + base A -> $eq0,$eq2 #Addr A N1: $eq7 + base B -> $eq4 #Addr B N2: $eq7 + 1 -> $eq6 #i++ N3: $eq7 < n -> $eq8 #i<n

on Event if Condition do Action MAD ISA (cont’d) Static Dataflow: Computations Dynamic Dataflow: Event-Condition- Actions (ECA) rules on Event if Condition do Action A combination of primitive dataflow events (the arrival of data) data states load, store, or moves

EQ States (Conditions) Transforming ISA Named Event Queues Data Movement MAD ISA # ECA Rules On $eq0 , if , do A0:ld,$eq0->$eq1 On $eq2∧eq3 , if , do A1:st,$eq3->$eq2 On $eq4 , if , do A2:ld,$eq4->$eq5 On $eq8∧eq6 , if $eq8(true), do A3:mv,$eq6->$eq7, $eq8-> EQ States (Conditions) # Dataflow-Graph Nodes N0: $eq7 + base A -> $eq0,$eq2 #Addr A N1: $eq7 + base B -> $eq4 #Addr B N2: $eq7 + 1 -> $eq6 #i++ N3: $eq7 < n -> $eq8 #i<n Computation

Data-driven computation Microarchitecture Move data Matching Events Data-driven computation

MAD Execution Accelerator Code Gen. Processor Off MAD ISA MAD (Access)

Evaluation Methodology Baseline: In-order, OOO2 and OOO4 MAD integration: 256 Dataflow Nodes, 64 Event Queues Integrated to OOO2/OOO4’s LSU Natural and Induced Memory Access Phases Accelerators: DySER, SIMD, NPU, C-Cores Reproduce/reuse benchmarks relevant to each accelerator

Evaluation & Analysis Performance MAD should consume less energy/power Explicit static & dynamic dataflow, larger instruction window, less speculative Can MAD match 2/4-OOO? MAD should consume less energy/power

Summary - Performance MAD’s performance is similar to OOO4 MAD can utilize OOO2’s LSU better, MAD+OOO2 > OOO2, with OOO4 MAD can be better than OOO4 In DySER programs, there are more opportunities for OOO4 to speculatively execute memory instructions

Summary Energy ~Half energy compared to OOO2 Compared to In-Order, OOO2 delivers better performance but does not save energy ~30% energy compared to OOO4

Power: Natural Phases OOO2, OOO4, MAD2, MAD4 MAD < sum(Fetch, Decode, Dispatch, Issue, Execute, WriteBack) LSU: More than 2-OOO, similar to 4-OOO

Summary MAD is an novel and useful customization for memory access phases Performance improvement and Power reduction Flexible & effective for accelerators

Questions