Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam

Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam
Efficient Execution of Memory Access Phases Using Dataflow Specialization Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam

Aggregation, matrix multiply, image processing…
Memory Access Phase A dynamic portion of a program where its instruction stream is predominantly for memory accesses and address generation. for (f=0; f<FSIZE; f+=4) { __m128 xmm_in_r = _mm_loadu_ps(in_r+p+f); __m128 xmm_in_i = _mm_loadu_ps(in_i+p+f); __m128 xmm_mul_r = _mm_mul_ps(xmm_in_r, xmm_coef); __m128 xmm_mul_i = _mm_mul_ps(xmm_in_i, xmm_coef); accum_r = _mm_add_ps(xmm_accum_r, _mm_sub_ps( xmm_mul_r, xmm_mul_i)); accum_i = _mm_add_ps(xmm_accum_i, _mm_add_ps( xmm_mul_r, xmm_mul_i)); } for (i=0;i<v_size;++i){ A[K[i]] += V[i]; } for(i=0; i<8; ++i) { for(j=0; j<8; ++j) { float sum=0; for(k=0; k<8; ++k) { sum+=matAT[i*matAcol+k]* matB[j*matBrow+k]; } matCT[i*matBcol+j]+=sum; Aggregation, matrix multiply, image processing… for(int y = 0; y < srcImg.height; ++y ) for(int x = 0; x < srcImg.width; ++x ){ p = srcImg.build3x3Window(x, y); NPU_SEND(p[0][0]);NPU_SEND(p[0][1]); NPU_SEND(p[0][2]);NPU_SEND(p[1][0]); NPU_SEND(p[1][1]);NPU_SEND(p[1][2]); NPU_SEND(p[2][0]);NPU_SEND(p[2][1]); NPU_SEND(p[2][2]);NPU_RECEIVE(pixel); dstImg.setPixel(x, y, pixel); }

Execution Model Read D$, little comp., write D$
Speedup In-order OOO2 OOO4 Natural 1.0 1.5 2.2 Read D$, send to accel, write D$ Speedup In-order OOO2 OOO4 DySER 1.0 1.5 2.7 SSE 1.7 2.9 NPU 1.6 2.2

Core becomes bottleneck (Power)
Total watts Address Computation + Data Access < 40%

Goal: To more efficiently access memory, obtain OOO’s performance without power overheads

Memory Access Dataflow
A specialized dataflow architecture to access memory (Processor pipeline turned off) Big idea: exposing the concept of triggering events & actions Cache MAD Processor Core (off) Accelerator

What does the core do? address ready, control variable resolved, value returned from cache Create Events React Core follows few computation patterns to compute the address and control behavior; The outcomes create recurring events like value returned from cache, address ready in the processor etc. Based on these events, the core performs actions moving data between accelerator & memory above three are recurring and occur concurrently Access memory with loads and stores computes the address and control variables Compute patterns

MAD ISA Primitives Dataflow Graph Nodes Actions Events
Analogous to compute instructions & reg state Actions Analogous to ld/st and move instructions Events Analogous to program counter sequencing Arch. Primer! Conventional RISC/CISC ISA: Register state Compute instructions LD/St instruction Program counter and control flow

Transforming ISA + < BaseA BaseB 1 n i Pseudo Program RISC ISA
for(i=0; i<n; ++j) { a[i] = accel(a[i],b[i]) } Computation Ports RISC ISA Named registers .L0 ld, $r0+$r1 -> $acc0 ld, $r2+$r1 -> $acc1 st, $acc2 -> $r0+$r1 addi, $r1, 1 -> $r1 ble, $r4, $r1, .L0 Branch, PC.. Data Movement

Transforming ISA + < BaseA BaseB 1 n i Named Event Queues
# Dataflow Graph Nodes N0: $eq7 + base A -> $eq0,$eq2 #Addr A N1: $eq7 + base B -> $eq4 #Addr B N2: $eq > $eq6 #i++ N3: $eq7 < n -> $eq8 #i<n

on Event if Condition do Action
MAD ISA (cont’d) Static Dataflow: Computations Dynamic Dataflow: Event-Condition- Actions (ECA) rules on Event if Condition do Action A combination of primitive dataflow events (the arrival of data) data states load, store, or moves

EQ States (Conditions)
Transforming ISA Named Event Queues Data Movement MAD ISA # ECA Rules On $eq0 , if , do A0:ld,$eq0->$eq1 On $eq2∧eq3 , if , do A1:st,$eq3->$eq2 On $eq4 , if , do A2:ld,$eq4->$eq5 On $eq8∧eq6 , if $eq8(true), do A3:mv,$eq6->$eq7, $eq8-> EQ States (Conditions) # Dataflow-Graph Nodes N0: $eq7 + base A -> $eq0,$eq2 #Addr A N1: $eq7 + base B -> $eq4 #Addr B N2: $eq > $eq6 #i++ N3: $eq7 < n -> $eq8 #i<n Computation

Data-driven computation
Microarchitecture Move data Matching Events Data-driven computation

MAD Execution Accelerator Code Gen. Processor Off MAD ISA MAD (Access)

Evaluation Methodology
Baseline: In-order, OOO2 and OOO4 MAD integration: 256 Dataflow Nodes, 64 Event Queues Integrated to OOO2/OOO4’s LSU Natural and Induced Memory Access Phases Accelerators: DySER, SIMD, NPU, C-Cores Reproduce/reuse benchmarks relevant to each accelerator

Evaluation & Analysis Performance MAD should consume less energy/power
Explicit static & dynamic dataflow, larger instruction window, less speculative Can MAD match 2/4-OOO? MAD should consume less energy/power

Summary - Performance MAD’s performance is similar to OOO4
MAD can utilize OOO2’s LSU better, MAD+OOO2 > OOO2, with OOO4 MAD can be better than OOO4 In DySER programs, there are more opportunities for OOO4 to speculatively execute memory instructions

Summary Energy ~Half energy compared to OOO2
Compared to In-Order, OOO2 delivers better performance but does not save energy ~30% energy compared to OOO4

Power: Natural Phases OOO2, OOO4, MAD2, MAD4 MAD < sum(Fetch, Decode, Dispatch, Issue, Execute, WriteBack) LSU: More than 2-OOO, similar to 4-OOO

Summary MAD is an novel and useful customization for memory access phases Performance improvement and Power reduction Flexible & effective for accelerators

Questions

Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam

Similar presentations

Presentation on theme: "Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam

Similar presentations

Presentation on theme: "Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam"— Presentation transcript:

Similar presentations

About project

Feedback