Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling
Latencies in stream computing Scheduling algorithms Stream offsets 2 Overview
Consider a simple arithmetic pipeline Each operation has a latency – Number of cycles from input to output – May be zero – Throughput is still 1 value per cycle, L values can be in-flight in the pipeline 3 Latencies in Stream Computing (A + B) + C
4 + + Output Input A Input B Input C Basic hardware implementation
+ + Output Input A Input B Input C Data propagates through the circuit in “lock step”
+ + Output Input A Input B Input C
+ + Output Input A Input B Input C X Data arrives at wrong time due to pipeline latency
8 + + Output Input A Input B Input C Insert buffering to correct
+ + Output Input A Input B Input C Now with buffering
+ + Output Input A Input B Input C
+ + Output Input A Input B Input C
+ + Output Input A Input B Input C
+ + Output Input A Input B Input C
+ + Output Input A Input B Input C Success!
A stream scheduling algorithm transforms an abstract dataflow graph into one that produces the correct results given the latencies of the operations Can be automatically applied on a large dataflow graph (many thousands of nodes) Can try to optimize for various metrics – Latency from inputs to outputs – Amount of buffering inserted generally most interesting – Area (resource sharing) 15 Stream Scheduling Algorithms
16 ASAP As Soon As Possible
17 Input A Input A Input B Input C 000 Build up circuit incrementally Keeping track of latencies
18 + Input A Input A Input B Input C 000 1
Input A Input A Input B Input C Input latencies are mismatched
Input A Input A Input B Input C Insert buffering
Output Input A Input A Input B Input C
22 ALAP As Late As Possible
23 Output 0 Start at output
24 + Output 0 Latencies are negative relative to end of circuit
Output Input C -2 0
Output Input A Input A Input B Input C -2 0
Output Input A Input A Input B Input C -2 0 Buffering is saved
Output 1 Input A Input A Input B Input C Output 2 Sometimes this is suboptimal What if we add an extra output?
Output 1 Input A Input A Input B Input C -2 0 Output 2 Unnecessary buffering is added 0 Neither ASAP nor ALAP can schedule this design optimally
ASAP and ALAP both fix either inputs or outputs in place More complex scheduling algorithms may be able to develop a more optimal schedule e.g. using ILP 30 Optimal Scheduling
Consider: We can see that we might need some explicit buffering to hold more than one data element on-chip We could do this explicitly, with buffering elements 31 Buffering data on-chip a = a + (buffer(a, 1) + buffer(b, 1)) a[i] = a[i] + (a[i - 1] + b[i - 1])
Output Input A Input B Buffer(1) The buffer has zero latency in the schedule
Output Input A Input B Buffer(1) This will schedule thus Buffering =
Accessing previous values with buffers is looking backwards in the stream This is equivalent to having a wire with negative latency – Can not be implemented directly, but can affect the schedule 34 Buffers and Latency
Output Input A Input B Offset wires can have negative latency Offset(-1)
Output Input A Input B This is scheduled Buffering = 0 Offset(-1)
A stream offset is just a wire with a positive or negative latency Negative latencies look backwards in the stream Positive latencies look forwards in the stream The entire dataflow graph will re-schedule to make sure the right data value is present when needed Buffering could be placed anywhere, or pushed into inputs or outputs more optimal than manual instantiation 37 Stream Offsets
38 + Output Input A 0 Offset(1) a = a + stream.offset(a, +1) a[i] = a + a[i + 1]
39 + Output Input A Scheduling produces a circuit with 1 buffer 0 Offset(1) 1 1 2
For the questions below, assume that the latency of an addition operation is 10 cycles, and a multiply takes 5 cycles, while inputs/outputs take 0 cycles. 1.Write pseudo-code algorithms for ASAP and ALAP scheduling of a dataflow graph 2.Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and draw the buffering introduced by ASAP scheduling to: a)c = ( (a1 + a2) + a3) + a4 b)c = (a1 + a2) + (a3 + a4) 3.Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and write out the inequalities that must be satisfied to schedule: a)c = ((a1 * a2) + (a3 * a4)) + a1 b)c = stream.offset(a1, -10)*a2 + stream.offset(a1, -5)*a3 + stream.offset(a1, +15)*a4 How many values of stream a1 will be buffered on-chip for (b)? 40 Exercises