Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam Efficient Execution of Memory Access Phases Using Dataflow Specialization Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam
Aggregation, matrix multiply, image processing… Memory Access Phase A dynamic portion of a program where its instruction stream is predominantly for memory accesses and address generation. for (f=0; f<FSIZE; f+=4) { __m128 xmm_in_r = _mm_loadu_ps(in_r+p+f); __m128 xmm_in_i = _mm_loadu_ps(in_i+p+f); __m128 xmm_mul_r = _mm_mul_ps(xmm_in_r, xmm_coef); __m128 xmm_mul_i = _mm_mul_ps(xmm_in_i, xmm_coef); accum_r = _mm_add_ps(xmm_accum_r, _mm_sub_ps( xmm_mul_r, xmm_mul_i)); accum_i = _mm_add_ps(xmm_accum_i, _mm_add_ps( xmm_mul_r, xmm_mul_i)); } for (i=0;i<v_size;++i){ A[K[i]] += V[i]; } for(i=0; i<8; ++i) { for(j=0; j<8; ++j) { float sum=0; for(k=0; k<8; ++k) { sum+=matAT[i*matAcol+k]* matB[j*matBrow+k]; } matCT[i*matBcol+j]+=sum; Aggregation, matrix multiply, image processing… for(int y = 0; y < srcImg.height; ++y ) for(int x = 0; x < srcImg.width; ++x ){ p = srcImg.build3x3Window(x, y); NPU_SEND(p[0][0]);NPU_SEND(p[0][1]); NPU_SEND(p[0][2]);NPU_SEND(p[1][0]); NPU_SEND(p[1][1]);NPU_SEND(p[1][2]); NPU_SEND(p[2][0]);NPU_SEND(p[2][1]); NPU_SEND(p[2][2]);NPU_RECEIVE(pixel); dstImg.setPixel(x, y, pixel); }
Execution Model Read D$, little comp., write D$ Speedup In-order OOO2 OOO4 Natural 1.0 1.5 2.2 Read D$, send to accel, write D$ Speedup In-order OOO2 OOO4 DySER 1.0 1.5 2.7 SSE 1.7 2.9 NPU 1.6 2.2
Core becomes bottleneck (Power) Total watts Address Computation + Data Access < 40%
Goal: To more efficiently access memory, obtain OOO’s performance without power overheads
Memory Access Dataflow A specialized dataflow architecture to access memory (Processor pipeline turned off) Big idea: exposing the concept of triggering events & actions Cache MAD Processor Core (off) Accelerator
What does the core do? address ready, control variable resolved, value returned from cache Create Events React Core follows few computation patterns to compute the address and control behavior; The outcomes create recurring events like value returned from cache, address ready in the processor etc. Based on these events, the core performs actions moving data between accelerator & memory above three are recurring and occur concurrently Access memory with loads and stores computes the address and control variables Compute patterns
MAD ISA Primitives Dataflow Graph Nodes Actions Events Analogous to compute instructions & reg state Actions Analogous to ld/st and move instructions Events Analogous to program counter sequencing Arch. Primer! Conventional RISC/CISC ISA: Register state Compute instructions LD/St instruction Program counter and control flow
Transforming ISA + < BaseA BaseB 1 n i Pseudo Program RISC ISA for(i=0; i<n; ++j) { a[i] = accel(a[i],b[i]) } Computation Ports RISC ISA Named registers .L0 ld, $r0+$r1 -> $acc0 ld, $r2+$r1 -> $acc1 st, $acc2 -> $r0+$r1 addi, $r1, 1 -> $r1 ble, $r4, $r1, .L0 Branch, PC.. Data Movement
Transforming ISA + < BaseA BaseB 1 n i Named Event Queues # Dataflow Graph Nodes N0: $eq7 + base A -> $eq0,$eq2 #Addr A N1: $eq7 + base B -> $eq4 #Addr B N2: $eq7 + 1 -> $eq6 #i++ N3: $eq7 < n -> $eq8 #i<n
on Event if Condition do Action MAD ISA (cont’d) Static Dataflow: Computations Dynamic Dataflow: Event-Condition- Actions (ECA) rules on Event if Condition do Action A combination of primitive dataflow events (the arrival of data) data states load, store, or moves
EQ States (Conditions) Transforming ISA Named Event Queues Data Movement MAD ISA # ECA Rules On $eq0 , if , do A0:ld,$eq0->$eq1 On $eq2∧eq3 , if , do A1:st,$eq3->$eq2 On $eq4 , if , do A2:ld,$eq4->$eq5 On $eq8∧eq6 , if $eq8(true), do A3:mv,$eq6->$eq7, $eq8-> EQ States (Conditions) # Dataflow-Graph Nodes N0: $eq7 + base A -> $eq0,$eq2 #Addr A N1: $eq7 + base B -> $eq4 #Addr B N2: $eq7 + 1 -> $eq6 #i++ N3: $eq7 < n -> $eq8 #i<n Computation
Data-driven computation Microarchitecture Move data Matching Events Data-driven computation
MAD Execution Accelerator Code Gen. Processor Off MAD ISA MAD (Access)
Evaluation Methodology Baseline: In-order, OOO2 and OOO4 MAD integration: 256 Dataflow Nodes, 64 Event Queues Integrated to OOO2/OOO4’s LSU Natural and Induced Memory Access Phases Accelerators: DySER, SIMD, NPU, C-Cores Reproduce/reuse benchmarks relevant to each accelerator
Evaluation & Analysis Performance MAD should consume less energy/power Explicit static & dynamic dataflow, larger instruction window, less speculative Can MAD match 2/4-OOO? MAD should consume less energy/power
Summary - Performance MAD’s performance is similar to OOO4 MAD can utilize OOO2’s LSU better, MAD+OOO2 > OOO2, with OOO4 MAD can be better than OOO4 In DySER programs, there are more opportunities for OOO4 to speculatively execute memory instructions
Summary Energy ~Half energy compared to OOO2 Compared to In-Order, OOO2 delivers better performance but does not save energy ~30% energy compared to OOO4
Power: Natural Phases OOO2, OOO4, MAD2, MAD4 MAD < sum(Fetch, Decode, Dispatch, Issue, Execute, WriteBack) LSU: More than 2-OOO, similar to 4-OOO
Summary MAD is an novel and useful customization for memory access phases Performance improvement and Power reduction Flexible & effective for accelerators
Questions