Jason Cong, Guoling Han, Zhiru Zhang VLSI CAD Lab

Architecture and Compilation for Data Bandwidth Improvement in Configurable Embedded Processors
Jason Cong, Guoling Han, Zhiru Zhang VLSI CAD Lab Computer Science Department University of California, Los Angeles Supported by NSF, GSRC, Altera, Xilinx.

Outline Motivation Architectural extension
Limited data bandwidth has become the performance bottleneck of instruction-set extendible processors Architectural extension Hash-mapped shadow registers Associated compilation techniques Shadow register binding and hash function generation Experimental results Conclusions 2018/11/13 UCLA VLSICAD LAB

Target Reconfigurable Platform
General purpose processor core + programmable fabric Loosely coupled as a coprocessor Xilinx MicroBlaze, etc. Tightly integrated as extra function units in application-specific instruction-set processors GPP has the capability to extend basic instruction set Programmable fabric implements the customized instructions Examples: Altera Nios / Nios II, Tensilica Xtensa, etc. Custom instruction logic for Nios II [source: 2018/11/13 UCLA VLSICAD LAB

Target Core Processor Model
Classic single-issue pipelined RISC core (fetch / decode / execute / write-back) The number of input and output operands an instruction is pre-determined The custom instruction cannot execute until all the input operands are available The custom instruction read the core register file during the execute stage, and commit the result during the write-back stage 2018/11/13 UCLA VLSICAD LAB

Data Bandwidth Problem
Fact: about 60% speedup comes from clusters with more than two inputs [P. Ienne et al] Architecture problem: limited register file bandwidth (two read ports, one write port) One solution: introducing state registers and move instructions to load extra operands [F. Sun et al, ICCAD’02] With the extra move instructions, 36% speedup drop on average is observed in our previous study [Cong et al, FPGA’05] mov(c); t1 = extop1(a, b, c); mov(d); mov(e); t2 = extop2(b, c, d, e); t3 = t1 + t2; t1 = a * b; t2 = b * c;; t3 = d * e; t4 = t1 + t2; t5 = t2 + t3; t6 = t5 + t4; t1 = extop1(a, b, c); t2 = extop2(b, c, d, e); t3 = t1 + t2; * + c e a b d extop2 extop1 *: 2 clock cycles +: 1 clock cycle Speedup: 1.8 Speedup: 1.125 2018/11/13 UCLA VLSICAD LAB

Existing Architecture Solutions
Multiport Register File Low utilization when executing basic instructions Extra address encoding space in the instruction word Area and power grows cubically [S. Rixner et al, HPCA’00 ] Register File Replication Complete or partial register file copy [Chimaera: S. Hauk et al, TVLSI’04 ] Power inefficient Predetermined one-to-one correspondence Limited compiler optimization 2018/11/13 UCLA VLSICAD LAB

Previous Approach – Shadow Registers (1)
Core registers are augmented by an extra set of shadow registers [Cong et al, 2005] Conditionally written Read only by the custom logic 2018/11/13 UCLA VLSICAD LAB

Previous Approach – Shadow Registers (2)
Controlling three shadow registers Two bits are required to be added or encoded in the instruction format Advantage Provides opportunities for compiler optimization How to effectively bind the shadow register to maximize the performance gain? Limitation log2K+1 bits are required for K shadow registers Only allows a small number of shadow registers Operation Copy to target shadow register Skip Instruction Subword 00 01 10 11 Target shadow register 1 2 - 2018/11/13 UCLA VLSICAD LAB

Proposed Approach: Hash-Mapped Shadow Registers Scheme
Shadow registers with single control bit Control bit = 1 means copy the data, 0 means skip Hashing unit determines the mapping between core registers and shadow registers Namely, the execution result to register R[i] in the core register file will be conditionally copied to register SR[j] in the shadow register set where j = hash(i). 2018/11/13 UCLA VLSICAD LAB

Shadow Registers: Single Control Bit vs. Multiple Control Bits
Hash-mapped shadow registers Advantages: Only one additional control bit is needed Much easier to be encoded in the 32-bit instruction format; More shadow registers are allowed Control bit count is always 1, independent of the number of actual shadow registers Hashing unit is configurable Hashing scheme retargetable to different applications Limitation: Less flexibility Each core register can be only mapped to one shadow register Less room for compiler optimizations 2018/11/13 UCLA VLSICAD LAB

ASIP Compilation Flow with Shadow Register Binding
C code Arch constraint SUIF / CDFG generator 1. Pattern generation CDFG 2. Pattern selection Pattern library 3. Application mapping & Code replacement Optimized code Backend compilation 4. Shadow register binding & hash function generation Implementation 2018/11/13 UCLA VLSICAD LAB

An Example Control Data Flow Graph
Each node represents an instruction Each edge represents a data transfer, which is associated with a live interval In CDFG, a live interval [s, t] is from the time a data transfer is initiated through the time it is terminated One variable might corresponds to multiple live intervals variable lifetime Live intervals r1 = …; r2 = ext1 (…, r1, …); r3 = …; r4 = ext2 (…, r1, …); r5 = ext3 (…, r3, …); r6 = ext4 (…, r3, …); 1 e1 2 l1 l2 r1 e2 3 e3 4 e4 5 6 2018/11/13 UCLA VLSICAD LAB

Shadow Register Binding  Motivation
It is not necessary to keep a variable in the shadow register for its entire lifetime 2-read-port register file 3-input extended instruction Without shadow register 4 additional moves Binding for one shadow register Assume: r1 and r3 are hash-mapped to the same shadow register r1 = …; r2 = ext1 (…, r1, …); r3 = …; r4 = ext2 (…, r1, …); r5 = ext3 (…, r3, …); r6 = ext4 (…, r3, …); 1 e1 2 l1 e2 r1 3 e3 l4 4 e4 5 r3 Binding 1: either r1 or r3 in shadow register saves 2 moves 6 Binding 2: l1 and l4 in shadow register saves 3 moves 2018/11/13 UCLA VLSICAD LAB

Binding for One Shadow Register  Problem Formulation
Binding problem for one shadow register with predetermined hash function Problem formulation: Given: (i) A shadow register sr (ii) A hash function h (iii) An interval set S in which each interval will be hash-mapped to sr Goal: Select a subset of non-overlapping live intervals in S and bind them to sr so that the maximum number of move operations can be saved 2018/11/13 UCLA VLSICAD LAB

Shadow Register Binding  Algorithm
Weighted interval graph G(V’, E’) Create a vertex v for each live interval [s, t] Weight on each vertex represents # saves if the interval is bound to the shadow register Create an edge e(v, v’) iff t < s’ where v = [s, t] and v’ = [s’, t’] Theorem: Binding problem is equivalent to find a maximum weighted chain in the compatibility graph Can be optimally solved in time O(|V’|2) Extension to K shadow registers Each live interval can only be mapped to one shadow registers The algorithm can be extended to handle K shadow-register by independently solving a series of one-shadow-register binding problem 2018/11/13 UCLA VLSICAD LAB

Hash Function Generation  Motivation
Hash function also affects the performance speedup 2-read-port register file 3-input extended instruction No shadow registers Four additional moves Two shadow registers available If r1 and r3 are hash-mapped to the same shadow register Three moves can be saved If r1 and r3 are hash-mapped to different shadow registers All four moves can be saved 1 r1 = …; r2 = ext1 (…, r1, …); r3 = …; r4 = ext2 (…, r1, …); r5 = ext3 (…, r3, …); r6 = ext4 (…, r3, …); e1 2 e2 3 e3 4 e4 5 6 2018/11/13 UCLA VLSICAD LAB

Hash Function Generation  Problem Formulation
Given: (i) A set of core registers R = {r1, … rN} (ii) A set of shadow registers SR = {sr1, … srK} Goal: Find a many to one function h: RSR so that the maximum number of move operations can be saved using h as the hash function 2018/11/13 UCLA VLSICAD LAB

Hash Function Generation  Algorithm
Hash function generation problem is equivalent to a multi-way set partitioning problem A two-step approach is used to solve the problem Reorder the core register indices to obtain a linear permutation One simple heuristic: use a mod function to derive the permutation If N=6 and K=2, sequence r1, r2, r3, r4, r5, r6  r1, r3, r5, r2, r4, r6 Given the permutation, solve a one dimensional K-way partitioning problem Adopt the algorithm in [Alpert, DAC’94] Optimally solvable by dynamic programming r1 r2 r3 r4 r5 r6 sr1 sr2 2018/11/13 UCLA VLSICAD LAB

Simulation-Based Performance Evaluation Flow
We adopt a SimpleScalar-based simulation flow to estimate the performance speedup Difficult to make any architectural and compiler extensions on commercial processors Binary code Arch constraint CDFG extractor 1. Pattern generation CDFG 2. Pattern selection Pattern library 3. Application mapping & Code replacement Optimized code Backend compilation 4. Shadow register binding & hash function generation SimpleScalar 2018/11/13 UCLA VLSICAD LAB Est. Performance

Experimental Setting Simplescalar v3.0
Benchmarks: Mediabench and Mibench Use entire programs instead of small pieces of code for instruction set generation and simulation Machine Configuration Single issue in-order processor DL1: 8KB, 4-way, 1 cycle IL1: 8KB, direct mapped, 1 cycle Unified L2: 256KB, 4-way, 8 cycles Functional units: 2 IntALU, 1 IntMult, 1 FPALU, 1 FPMult Reconfigurable units: use critical path latencies of the collapsed instructions 2018/11/13 UCLA VLSICAD LAB

Speedup under Different Shadow Register Architectures (1)
Under 3-input constraint Over 90% of the performance gap can be closed with 5 hash-mapped shadow registers 2018/11/13 UCLA VLSICAD LAB

Speedup under Different Shadow Register Architectures (2)
Under 4-input constraint Over 95% of the performance gap can be closed with 8 hash-mapped shadow registers 2018/11/13 UCLA VLSICAD LAB

Speedup Comparison: Shadow registers vs. Register Replication (1)
Under 3-input constraint With the same number of registers, shadow register architecture consistently outperforms partial register replication 2018/11/13 UCLA VLSICAD LAB

Speedup Comparison: Shadow registers vs. Register Replication (2)
Under 4-input constraint With the same number of registers, shadow register architecture consistently outperforms partial register replication 2018/11/13 UCLA VLSICAD LAB

Conclusions A novel low-cost hash-mapped shadow register architecture is proposed Solve a global shadow register binding and hash function generation problem Experiments show encouraging speedup 2018/11/13 UCLA VLSICAD LAB

Jason Cong, Guoling Han, Zhiru Zhang VLSI CAD Lab

Similar presentations

Presentation on theme: "Jason Cong, Guoling Han, Zhiru Zhang VLSI CAD Lab"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jason Cong, Guoling Han, Zhiru Zhang VLSI CAD Lab

Similar presentations

Presentation on theme: "Jason Cong, Guoling Han, Zhiru Zhang VLSI CAD Lab"— Presentation transcript:

Similar presentations

About project

Feedback