Mapping DSP algorithms to a general purpose out-of-order processor ECE 734 Ilhyun Kim Donghyun Baik
Outline Introduction Out-of-order execution overview Dependence graph change for mapping to GPP To-do list Expected results
Introduction Why DSP applications are implemented on GPP Lower development cost commodity part, lower maintenance cost, faster turn-around GPP meets performance requirement GPP becomes faster. Only DSP chips could do it in the past Problems with algorithm transformations Used for faster operations / efficient hardware On GPP, no control over hardware configurations Some of them are effective while others not Problems with extracting parallelism Duplicated efforts on each of layers (source code, compiler, processor) Narrow machine scope over independent operations What are the efficient ways to map algorithm to GPP? Understanding how a GPP executes instructions How can software improve the performance?
Out-of-order execution overview Dynamic parallelism extraction Instruction re-ordering Dynamically searching independent operations within a limited scope Trying to keep all available hardware resources busy for(i=1~m) for(j=1~n) c(i,j)=c(i,j-1) +c(i-1,j) functional units
Understanding DG change for mapping to GPP instruction window for(i=1~m) for(j=1~n) c(i,j)=c(i,j-1) + c(i-1,j) for(i=1~m) for(j=1~n) c(i,j)=c(i,j-1) + c(i-1,j) for(i=1~m) for(j=1~n) c(i,j)=c(i,j-1) + c(i-1,j) instruction window instruction window instruction window instruction window computation instruction window instruction window mem access instruction window instruction window instruction window control-related
To-do list Infrastructure The effect of single assignment Building a perfect machine model simulator that keeps track of only computations in the algorithm, measuring ideal execution time of a compiled binary assuming perfect parallelism Building a profile tool that locates an instruction that we are interested in among instructions in the binary The effect of single assignment Characterizations on various machine configurations The effect of unfolding The effect of SIMD parallelism Optimization techniques for the Alpha architecture based on characterization data Optimizing an existing DSP application: MPEG-2 decoder 20 inst integer window, 15 inst fp window 2 int, 2 agen/mem 2fp
Expected Results Single assignment transformation doesn’t work Rather, try to recycle storage space whenever possible HW-based single assignment is performed Unfolding transformation It works on iteration-independent loops w/ trivial computations It reduces the loop indices overhead (even on iteration-dependent loops) Do not unfold loops w/ non-trivial computations SIMD parallelism to reduce memory communications Alpha doesn’t support SIMD instruction sets but it has 64-bit datapath and instructions read/write 64 bits at a time by splitting/merging narrow words There are more computing units (4) than memory units (2) : reducing memory operations helps performance Performance improvement of MPEG-2 decoder based on the optimizations that we applied 20 inst integer window, 15 inst fp window 2 int, 2 agen/mem 2fp