Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam

Similar presentations


Presentation on theme: "Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam"— Presentation transcript:

1 Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam
Efficient Execution of Memory Access Phases Using Dataflow Specialization Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam

2 Aggregation, matrix multiply, image processing…
Memory Access Phase A dynamic portion of a program where its instruction stream is predominantly for memory accesses and address generation. for (f=0; f<FSIZE; f+=4) { __m128 xmm_in_r = _mm_loadu_ps(in_r+p+f); __m128 xmm_in_i = _mm_loadu_ps(in_i+p+f); __m128 xmm_mul_r = _mm_mul_ps(xmm_in_r, xmm_coef); __m128 xmm_mul_i = _mm_mul_ps(xmm_in_i, xmm_coef); accum_r = _mm_add_ps(xmm_accum_r, _mm_sub_ps( xmm_mul_r, xmm_mul_i)); accum_i = _mm_add_ps(xmm_accum_i, _mm_add_ps( xmm_mul_r, xmm_mul_i)); } for (i=0;i<v_size;++i){ A[K[i]] += V[i]; } for(i=0; i<8; ++i) { for(j=0; j<8; ++j) { float sum=0; for(k=0; k<8; ++k) { sum+=matAT[i*matAcol+k]* matB[j*matBrow+k]; } matCT[i*matBcol+j]+=sum; Aggregation, matrix multiply, image processing… for(int y = 0; y < srcImg.height; ++y ) for(int x = 0; x < srcImg.width; ++x ){ p = srcImg.build3x3Window(x, y); NPU_SEND(p[0][0]);NPU_SEND(p[0][1]); NPU_SEND(p[0][2]);NPU_SEND(p[1][0]); NPU_SEND(p[1][1]);NPU_SEND(p[1][2]); NPU_SEND(p[2][0]);NPU_SEND(p[2][1]); NPU_SEND(p[2][2]);NPU_RECEIVE(pixel); dstImg.setPixel(x, y, pixel); }

3 Execution Model Read D$, little comp., write D$
Speedup In-order OOO2 OOO4 Natural 1.0 1.5 2.2 Read D$, send to accel, write D$ Speedup In-order OOO2 OOO4 DySER 1.0 1.5 2.7 SSE 1.7 2.9 NPU 1.6 2.2

4 Core becomes bottleneck (Power)
Total watts Address Computation + Data Access < 40%

5 Goal: To more efficiently access memory, obtain OOO’s performance without power overheads

6 Memory Access Dataflow
A specialized dataflow architecture to access memory (Processor pipeline turned off) Big idea: exposing the concept of triggering events & actions Cache MAD Processor Core (off) Accelerator

7 What does the core do? address ready, control variable resolved, value returned from cache Create Events React Core follows few computation patterns to compute the address and control behavior; The outcomes create recurring events like value returned from cache, address ready in the processor etc. Based on these events, the core performs actions moving data between accelerator & memory above three are recurring and occur concurrently Access memory with loads and stores computes the address and control variables Compute patterns

8 MAD ISA Primitives Dataflow Graph Nodes Actions Events
Analogous to compute instructions & reg state Actions Analogous to ld/st and move instructions Events Analogous to program counter sequencing Arch. Primer! Conventional RISC/CISC ISA: Register state Compute instructions LD/St instruction Program counter and control flow

9 Transforming ISA + < BaseA BaseB 1 n i Pseudo Program RISC ISA
for(i=0; i<n; ++j) { a[i] = accel(a[i],b[i]) } Computation Ports RISC ISA Named registers .L0 ld, $r0+$r1 -> $acc0 ld, $r2+$r1 -> $acc1 st, $acc2 -> $r0+$r1 addi, $r1, 1 -> $r1 ble, $r4, $r1, .L0 Branch, PC.. Data Movement

10 Transforming ISA + < BaseA BaseB 1 n i Named Event Queues
# Dataflow Graph Nodes N0: $eq7 + base A -> $eq0,$eq2 #Addr A N1: $eq7 + base B -> $eq4 #Addr B N2: $eq > $eq6 #i++ N3: $eq7 < n -> $eq8 #i<n

11 on Event if Condition do Action
MAD ISA (cont’d) Static Dataflow: Computations Dynamic Dataflow: Event-Condition- Actions (ECA) rules on Event if Condition do Action A combination of primitive dataflow events (the arrival of data) data states load, store, or moves

12 EQ States (Conditions)
Transforming ISA Named Event Queues Data Movement MAD ISA # ECA Rules On $eq0 , if , do A0:ld,$eq0->$eq1 On $eq2∧eq3 , if , do A1:st,$eq3->$eq2 On $eq4 , if , do A2:ld,$eq4->$eq5 On $eq8∧eq6 , if $eq8(true), do A3:mv,$eq6->$eq7, $eq8-> EQ States (Conditions) # Dataflow-Graph Nodes N0: $eq7 + base A -> $eq0,$eq2 #Addr A N1: $eq7 + base B -> $eq4 #Addr B N2: $eq > $eq6 #i++ N3: $eq7 < n -> $eq8 #i<n Computation

13 Data-driven computation
Microarchitecture Move data Matching Events Data-driven computation

14 MAD Execution Accelerator Code Gen. Processor Off MAD ISA MAD (Access)

15 Evaluation Methodology
Baseline: In-order, OOO2 and OOO4 MAD integration: 256 Dataflow Nodes, 64 Event Queues Integrated to OOO2/OOO4’s LSU Natural and Induced Memory Access Phases Accelerators: DySER, SIMD, NPU, C-Cores Reproduce/reuse benchmarks relevant to each accelerator

16 Evaluation & Analysis Performance MAD should consume less energy/power
Explicit static & dynamic dataflow, larger instruction window, less speculative Can MAD match 2/4-OOO? MAD should consume less energy/power

17 Summary - Performance MAD’s performance is similar to OOO4
MAD can utilize OOO2’s LSU better, MAD+OOO2 > OOO2, with OOO4 MAD can be better than OOO4 In DySER programs, there are more opportunities for OOO4 to speculatively execute memory instructions

18 Summary Energy ~Half energy compared to OOO2
Compared to In-Order, OOO2 delivers better performance but does not save energy ~30% energy compared to OOO4

19 Power: Natural Phases OOO2, OOO4, MAD2, MAD4 MAD < sum(Fetch, Decode, Dispatch, Issue, Execute, WriteBack) LSU: More than 2-OOO, similar to 4-OOO

20 Summary MAD is an novel and useful customization for memory access phases Performance improvement and Power reduction Flexible & effective for accelerators

21 Questions


Download ppt "Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam"

Similar presentations


Ads by Google