Drinking from the Firehose

Drinking from the Firehose
Number nine of a series Drinking from the Firehose Everything in its due time – software pipelining in the Mill™ CPU Architecture

http://millcomputing.com/docs You are here Talks in this series
Encoding The Belt Memory Prediction Metadata and speculation Execution Security Specification Software pipelining … You are here Slides and videos of other talks are at:

The Mill CPU The Mill is a new general-purpose commercial CPU family.
The Mill has a 10x single-thread power/performance gain over conventional out-of-order superscalar architectures, yet runs the same programs, without rewrite. This talk will explain how the Mill: pipelines without prologues and epilogues pipelines loop-carried data pipelines mixed-latency operations pipelines outer loops tail-recursion induction

Gross over-simplification!
Caution! Gross over-simplification! This talk tries to convey an intuitive understanding to the non-specialist. The reality is more complicated. (we try not to over-simplify, but sometimes…)

What’s a software pipeline?
A review What’s a software pipeline?

An example Source: loop body time per iteration: 3 cycles
for (int i = 0; i < N; ++i) A[i] = A[i]+3; load add store loop body time per iteration: 3 cycles (assuming all ops are one cycle) Ignoring control variable update and test

An example Source: time memory memory memory
for (int i = 0; i < N; ++i) A[i] = A[i]+3; time load0 load0 add0 add0 store0 load1 load1 add1 add1 store1 load2 load2 add2 add2 store2 memory memory memory load0 reg 1 add1 add0 load1 load2 add2

An example Source: time
for (int i = 0; i < N; ++i) A[i] = A[i]+3; time load0 add0 store0 load1 add1 store1 load2 add2 store2 Subscripts indicate the iteration number of the operation

What’s under the hood? Source: time idle idle idle idle idle idle idle
for (int i = 0; i < N; ++i) A[i] = A[i]+3; time load0 add0 store0 load1 add1 store1 load2 add2 store2 load unit load unit load unit idle idle idle adder adder adder idle idle idle idle store unit store unit store unit idle idle idle

Run the units in parallel, every cycle
Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; time load0 load1 load2 load3 load4 load5 add0 add1 add2 add3 add4 store0 store1 store2 store3 machine cycles Requires wide-issue: superscalar, VLIW, Mill

Run the units in parallel, every cycle
Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; time load0 load1 load2 load3 load4 load5 add0 add1 add2 add3 add4 store0 store1 store2 store3 One iteration, spread over three cycles

An example Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 store0 add0 load1 add1 load2 store1 add2 load3 store2 add3 load4 store3 add4 load5 In steady state, each cycle executes one third of each of three iterations. This is the steady state of the pipeline. Time per iteration: one cycle

But – what happens to the data?
Data produced in one iteration must be passed to the consuming operation of the same iteration. Not to the consuming operation of a different iteration. On a conventional machine, data is passed from operation to operation in general registers.

An example Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0
reg 1 load0 load4 add0 add0 add1 add1 add2 add2 add3 add3 add4 add4 add1 add3 add2 reg 2 add0 store0 store1 store2 store3

Loop-carried variables
Source: for (int i = 1; i < N; ++i) A[i] = A[i]+ 3; A[i+1]; loop-carried variable The number of iterations that a value must be carried over is called the distance of the carried variable. Change the loop to use a single value in several iterations. The largest carried distance is the loop distance.

? An example Source: Does the pipeline code still work?
for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load0 load1 reg 1 add0 ? reg 2 Does the pipeline code still work? Where is the second argument to add?

? An example Source: Try doing two loads to start the loop?
for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load0 load1 load1 load2 reg 1 add0 ? reg 2 Try doing two loads to start the loop? Where is the second argument to add?

BUT… An example Source: Add another register?
for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load0 load1 load1 load2 load2 load3 load3 reg 1 load0 load2 add0 add0 add1 add1 load1 load1 reg 2 store0 BUT… Add another register? add0 reg 3

An example Source: for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1];
this load this load went here load0 load0 load1 load1 load2 load2 load3 load3 reg 1 load2 went here load3 reg 2 reg 3

An example Source: But the loads are the SAME operation –
for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load0 load1 load1 load2 load2 load3 load3 load2 reg 1 But the loads are the SAME operation – how can they have different result registers? reg 2 load3 Only by duplicating the code. reg 3 Must unroll the loop distance times!

An example Source: So use two instructions –
for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load0 load1 load1 load2 load2 load3 load3 load4 load4 load5 load5 load0 load2 load4 reg 1 load2 add0 add0 add1 add1 add2 add2 add3 add3 load1 reg 2 load3 load3 load1 store0 store1 store2 So use two instructions – one to one register and one to the other add1 add2 add0 reg 3

An example Source: These go to reg1… for (int i = 1; i < N; ++i)
A[i] = A[i]+A[i+1]; load0 load0 load1 load2 load2 load3 load4 load4 load5 reg 1 add0 add1 add2 add3 reg 2 store0 store1 store2 These go to reg1… reg 3

An example Source: And the others go to reg2…
for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load1 load1 load2 load3 load3 load4 load5 load5 reg 1 add0 add1 add2 add3 reg 2 store0 store1 store2 And the others go to reg2… in effect, the loop is unrolled 2X reg 3

An example Source: So this is the steady state…
for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load1 load2 load3 load4 load5 reg 1 add0 add1 add2 add3 reg 2 store0 store1 store2 So this is the steady state… reg 3 two instructions, not one.

… An example Source: Now change the loop distance…
for (int i = 1; i < N; ++i) A[i] = A[i]+A [i+1]; [i+10]; load0 load1 load2 load3 load4 load5 … add0 add1 add2 add3 store0 store1 store2 Now change the loop distance… and you have to unroll 10X

An example Source: Or – you can use a copy.
for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load0 load1 load1 load2 load2 load3 load3 load4 load4 load5 load5 reg 1 load1 load0 copy0 copy1 load2 copy2 load3 copy3 load4 copy4 add0 store0 add1 store1 add2 store2 add3 reg 2 Or – you can use a copy. reg 3

An example Source: for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1];
load0 load0 load1 load1 load2 load2 load3 load3 load4 load4 load5 load5 load0 load2 load2 load1 load1 load3 load4 reg 1 load4 load3 copy0 copy0 copy1 copy1 copy2 copy2 copy3 copy3 copy4 copy4 copy3 copy0 copy2 copy1 reg 2 add0 add0 add1 add1 add2 add2 add3 add3 store0 store1 store2 reg 3 add2 add1 add0

An example Source: DRAWBACK: One copy operation, and one register,
for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load1 load2 load3 load4 load5 reg 1 load5 copy0 copy1 copy2 copy3 copy4 reg 2 copy4 add0 add1 add2 add3 DRAWBACK: One copy operation, and one register, per LCV times distance store0 store1 store2 add3 reg 3 One instruction steady state

The Belt A review The Mill has no general registers.
All operations results are entered in a FIFO, the Belt

Functional units can read any position
The Belt Like a conveyor belt – a fixed length FIFO 5 8 3 5 3 Functional units can read any position adder

We call it the Belt Like a conveyor belt – a fixed length FIFO adder
New results drop on the front 8 Pushing the last off the end 5 8 3 3 3 Functional units can read any position adder

Functional units can read any mix of belt positions
Multiple reads Functional units can read any mix of belt positions 8 3 5 8 3 5 5 3 3 3 adder adder adder

All results retiring in a cycle drop together
Multiple drops All results retiring in a cycle drop together adder adder adder 8 8 6 5 8 3 3 8 3 adder adder adder

The simple example – using the Belt
Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 load0 Belt

Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 load1 load1 add0 add0 load0 load0 Belt

Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 load1 load2 store0 add1 load2 add0 add1 load0 load1 add0 load0 load1 add0 Belt

Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 load1 add1 load2 load3 add2 load3 store1 add0 add2 load0 load2 store0 add0 load1 add0 load0 load2 add1 load2 add1 Belt

Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 load1 add1 load2 add2 load3 store1 add3 load3 store1 add0 This is the steady state load0 load2 load2 store0 add0 add1 load1 add0 load0 load2 add1 load3 add2 load3 add2 Belt

The loop-carried example, on the Belt
Source: for (int i = 0; i < N; ++i) A[i] = A[i]+ 3; A[i+1]; load0 load0 Belt

Source: for (int i = 0; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load1 load1 load0 Belt

Source: for (int i = 0; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load1 load2 add1 load2 add0 load1 load1 load0 load0 Belt

Source: for (int i = 0; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load1 add0 load2 add1 load3 store0 load3 add1 load0 load1 load1 load0 load2 add0 load2 add0 load1 Belt

Source: for (int i = 0; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load1 add0 load2 store0 add1 load3 load4 add2 load4 store1 This is the steady state add2 No unrolling No copies One instruction load0 load2 load1 load2 add0 load3 load1 load2 add0 load3 add1 add1 load2 load0 Belt

Prologues Source: Prologue instructions: prologue Prologue operations:
for (int i = 1; i < N; ++i) A[i] = A[i]+3 Prologue instructions: unpipelined latency + loop distance - 1 load0 add0 load1 add1 load2 store0 prologue Prologue operations: (N*(N-1))/2 steady-state ops steady state With three ops: 2 instructions, 3 ops With 6 ops: 6 instructions, 15 ops With 20 ops: 20 instructions, 190 ops

The retire operation retire supplies values that the loop hasn’t calculated yet. retire(4); tells the hardware that four result operands are supposed to retire to the belt in the current cycle. If fewer will retire, retire invents the missing results. Metadata marks invented operands as a None.

failing operation location
NaR bits Every data element has a NaR (Not A Result) bit in the element metadata. The bit is set whenever a detected error precludes producing a valid value. operation OK oops value payload failing operation location kind where error kind A debugger displays the fault detection point.

An error – or just missing data?
A None is a kind of NaR that identifies a missing value. Most operations are speculable – they have no side effects. NaRs and Nones just pass through unchanged. NaR value None speculable operation

non-speculable operation
An error – or just missing data? A non-speculable operation has side effects. A store to memory is the most common example. Normal data, Nones and NaRs differ in their response to non-speculable operations NaR value None non-speculable operation FAULT! discarded nothing happens

The simple example – using Nones
Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; First, fill the Belt with Nones

An example – using Nones
Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 add-1 store-2 Next, execute the steady-state instruction

Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 load0 add-1 store-2 Supply data from the belt And retire results and side-effects Advance the belt

Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 add-1 store-2 load1 load1 add0 store-1 add0 Rinse, repeat. load0 load0

Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 add-1 store-2 load1 add0 store-1 load2 add1 store0 load2 No prologue code required - add1 The steady-state loop body is the prologue load0 to memory load0 load1 add0 load1 add0

Pipelining in-flight values
Source: Assume multiply take three cycles for (int i = 0; i < N; ++i) A[i] = A[i] 3; * + load0 mul0 noop0 noop0 store0 must wait for mul result All example operations executed in one cycle What about operations that take longer?

Source: for (int i = 0; i < N; ++i) A[i] = A[i]*3; load0 mul-1 store-4 load0 Will the previous code work? So far so good

Source: for (int i = 0; i < N; ++i) A[i] = A[i]*3; load0 load0 mul-1 store-4 load1 load1 mul0 store-3 Still OK load0 load0

Source: for (int i = 0; i < N; ++i) A[i] = A[i]*3; load0 mul-1 store-3 load1 mul0 store-2 load2 load2 mul1 store-1 load0 to memory OOPS! load0 load1 load1 load0

The retire operation Source:
for (int i = 0; i < N; ++i) A[i] = A[i]*3; load0 load0 mul-1 store-4 retire(2) The retire op forces the drop count of the cycle in which it executes by dropping Nones if necessary. This example drops two results in steady-state, so the instruction contains: retire(2)

The retire operation Source: for (int i = 0; i < N; ++i)
A[i] = A[i]*3; load0 mul-1 store-4 retire(2) load1 load1 mul0 store-3 retire(2) load0 load0

A[i] = A[i]*3; load0 mul-1 store-4 retire(2) load1 mul0 store-3 retire(2) load2 load2 mul1 store-2 retire(2) load2 load1 load0 load1

A[i] = A[i]*3; load0 mul-1 store-4 retire(2) load1 mul0 store-3 retire(2) load2 mul1 store-2 retire(2) load3 load3 mul2 store-1 retire(2) mul0 load2 load2 load2 load0 load1 load2

The retire operation Source:
for (int i = 0; i < N; ++i) A[i] = A[i]*3; One trip through the loop to reach steady state load0 mul-1 store-4 retire(2) load1 mul0 store-3 retire(2) load2 mul1 store-2 retire(2) load3 mul2 store-1 retire(2) load4 load4 mul3 store0 retire(2) mul1 Thereafter one iteration per cycle load2 load2 load2 ILP limited only by hardware compute capacity load1 load2 mul0 load3 load3 mul0 load0 to memory!!!

Use it or lose it The Belt has a fixed size, suitable for common usage. If there are too many loop-carried variables, they will fall off the end before they can be used. Excess belt data that will be needed later can be spilled to the scratchpad, a special hardware buffer.

Review - the scratchpad
Frame local – each function has a new scratchpad Fixed max size, must explicitly allocate Static byte addressing, must be aligned Three cycle spill-to-fill latency belt 8 3 6 3 8 spill fill 3 scratchpad

The rotator Each scratchpad allocation makes a new portion of scratchpad available to spill and fill. scratchf(…) base fence fence scratchpad

The rotator Each scratchpad allocation makes a new portion of scratchpad available to spill and fill. scratchf(…) … scratchf(…) base fence scratchpad

The rotator Each scratchpad allocation makes a new portion of scratchpad available to spill and fill. scratchf(…) … scratchf(…) Each allocation has a corresponding rotator base fence scratchpad

The rotator Each scratchpad allocation makes a new portion of scratchpad available to spill and fill. scratchf(…) … scratchf(…) Each allocation has a corresponding rotator base fence outer rotator inner rotator

The rotator A rotator is a circular address remapper.
Scratchpad addresses in spill and fill are biased by the cursor, with wrap-around, before conversion to physical cursor cursor rotator rotate(20); The rotate operation advances the cursor, with wrap-around. This gives belt-like renaming to the rotator’s scratchpad.

The inner operation The Mill treats loops as if they were like functions. The inner operation starts a new, nested loop. inner take arguments just like call: inner(b5, b3); starts the loop with an empty belt, initialized with two values from the outer environment. Typically the arguments are initial values for control variables. inner does not change the stack frame or protection. Like call, operations in-flight at inner are completed after loop exit

The leave operation The leave operation exits the innermost loop.
leave take arguments just like return: leave(b4); restores the belt to its state when the corresponding inner was executed, and drops the leave arguments at the front. leave arguments are used for searches. leave discards any computation in-flight in the loop. Most pipelines can be broken-off by leave, with no epilogue. leave discards the innermost rotator if one was allocated in the loop.

Loop entrance b0 b7 Outer belt 8 3 6 3 8 8 6 inner head,b1,b5,b3,b3 X
Inner belt 1 4 9 5 2 7 2 leave b4 8 3 6 3 Outer belt An loop has the same belt effects as an op like add A loop can drop multiple results

Belt save/restore b0 b7 Outer belt 8 3 6
The Spiller is a background save/restore engine Values are marked with the owning frame Belt access is to the values of the current frame Change the current frame id - the belt is empty! Data is still there, can be spilled at leisure Arguments passed by copy, get new frame id Inner loop 2 8 3 6 Outer belt

* In-flight over a loop NO! 8 3 6 b7 b0 8 3 6 in inner loop 9 in inner
muls * 8 3 6 inner in inner loop 9 in inner Should we drop in the middle of the inner loop? NO!

* In-flight over loop 8 3 6 b7 b0 8 3 6 (whole inner loop) 8 3 6 2 8 3
muls * 8 3 6 call (whole inner loop) 8 3 6 2 8 3 6 2 9 8 Loops are atomic In-flights retire after loop exits

inner/leave at 30,000 feet for (int i = 0; i < N; ++i) {
int x = A[i]; for (int j = 0; j < M; ++j) { if (S[j].f == x) { B[i] = S[j].g; break; } load <A[i]> inner

Can pipeline essentially any loop
Summary #1: The Mill: Can pipeline essentially any loop Embedded calls, control flow no problem No prologue needed Steady-state loop body works as prologue Treats loop like a function Permits pipelining nested loops at all levels

Unlimited loop distance
Summary #2 The Mill: Unlimited loop distance Near data on belt, far data on scratchpad Unlimited loop-carried variables No unrolling, no copies Loop exit automatically cleans up In-flight computation is discarded

Summary #3 The Mill: inner/leave operations support modularity
Loops get arguments, return results

millcomputing.com/docs
Shameless plug For technical info about the Mill CPU architecture: millcomputing.com/docs For future announcements, white papers etc.: millcomputing.com/mailing-list For investor information: millcomputing.com/investor-list

Drinking from the Firehose

Similar presentations

Presentation on theme: "Drinking from the Firehose"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Drinking from the Firehose

Similar presentations

Presentation on theme: "Drinking from the Firehose"— Presentation transcript:

Similar presentations

About project

Feedback