10 June 2015 1 Mill Computing, Inc.Patents pending One of a series… Drinking from the Firehose Compilation for a Belt Architecture.

10 June 2015 1 Mill Computing, Inc.Patents pending One of a series… Drinking from the Firehose Compilation for a Belt Architecture

10 June 2015 2 Mill Computing, Inc.Patents pending Talks in this series 1.Encoding 2.The Belt 3.Memory 4.Prediction 5.Metadata 6.Execution 7.Security 8.Specification 9.Pipelining 10.Compiling 11.… You are here Slides and videos of other talks are at: MillComputing.com/docs

10 June 2015 3 Mill Computing, Inc.Patents pending Caution! Gross over-simplification! This talk tries to convey an intuitive understanding to the non-specialist. The reality is more complicated. (we try not to over-simplify, but sometimes…)

10 June 2015 4 Mill Computing, Inc.Patents pending Specification abstract Mill CPU architecture family members TinCopperSilverGold The Mill is a family of member CPUs sharing an abstract operation set and micro-architecture. specification driven Members differ in concrete operation set and micro-architecture.. A designers describes a concrete member by writing a specification.

10 June 2015 5 Mill Computing, Inc.Patents pending Specification abstract Mill CPU architecture family members TinSilverGold tools compilerasmdebuggerHWgensim Toolchain software automatically creates system software, verification tests, documentation, and a hardware framework for the new member from the specification. specification driven Copper data driven

10 June 2015 6 Mill Computing, Inc.Patents pending Late binding to family member Mill compiles to the abstract target – the universal superset Mill specializes to the concrete target – the executing family member clang LLVM middle LLVM back C++ gen Form special izer pre linker post linker gen Asm gen assembler con assembler con Form con Asm CPU target This talk is mostly about the specializer

10 June 2015 7 Mill Computing, Inc.Patents pending Specializer inputs: member specification Micro-architecture attributes: functional unit population supported data sizes resource constraints Operation attributes:(1000+) +: 1 *: 3 -: 1 &: 1 retn: 0 op latency issue → retire latency arg/result count, size bit encoding Large static data structure, dynamically linked Mechanically generated from ~2 page spec

10 June 2015 8 Mill Computing, Inc.Patents pending Specializer inputs: code int foo(int a, b, c, d) { return (a-(b+c)) & ((b+c)*d); } Static Single Assignment dataflow define i32 @foo(i32 %a, i32 %b, i32 %c, i32 %d) { entry: %1 = add %b %c %2 = sub %a %1 %3 = mul %1 %d %4 = and %2 %3 ret %4 } abdc * - & retn + function args

10 June 2015 9 Mill Computing, Inc.Patents pending Substitution pass Goal: replace unsupported ops with emulation code Walk graph For each op, check spec for support Replace unsupported with inline function Inline may call out-of-line code Only a subset of operations exist in hardware Few members have native decimal, or quad * - & retn + function args call

10 June 2015 10 Mill Computing, Inc.Patents pending shiftmul Wide issue The Mill is wide-issue, like a VLIW or EPIC mul shift add PC slot #012 instruction Instruction slots correspond to function pipelines mult’er shifter adder mult’er shifter adder mult’er shifter adder pipe #012 Decode routes ops to matching pipes add

10 June 2015 11 Mill Computing, Inc.Patents pending * Exposed pipeline Every operation has a fixed latency a+b – c*d sub + - a b c d a+b ? a+b – c*d c*d a+b addmul

10 June 2015 12 Mill Computing, Inc.Patents pending Exposed pipeline Every operation has a fixed latency addmul sub + - a b c d a+b a+b – c*d c*d a+b a+b – c*d Who holds this? *

10 June 2015 13 Mill Computing, Inc.Patents pending * Exposed pipeline Every operation has a fixed latency addmul sub - a+b – c*d c*d a+b a+b – c*d + a bc d Code is best when producers feed directly to consumers

10 June 2015 14 Mill Computing, Inc.Patents pending Latency pass Goal: compute minimal dataflow latency as if hardware had infinite FU resources Give schedule priority to longer latency Reduces overall schedule latency; faster execution +: 1 *: 3 -: 1 &: 1 retn: 0 op specs Walk graph Look up latency in spec of each op Mark each op with max argument latency Mark each result with issue + op latency 0000 2 5 1 4 Mark ops with issue cycle Mark results with retire cycle - & retn + function args * 0 11 1 4 5

10 June 2015 15 Mill Computing, Inc.Patents pending Dependency count pass Goal: count outstanding dependencies Need to know how many consumers must be placed before producer op can be placed - & retn + function args * Mark each op with number of consumers Enter no-consumer ops on worklist work list 4 2 11 1 0

10 June 2015 16 Mill Computing, Inc.Patents pending Schedule pass Goal: schedule producers so their results retire just before when consumers want them - & retn + function args * work list 4 2 11 1 0 Take last-retiring op from worklist Schedule it ahead of its consumers Decrement the consumer count of the producers of its arguments If consumer count of arg producer becomes zero, enter producer on worklist schedule: retn 0 0000 2 11 4 5 # of unplaced consumers retire cycle

10 June 2015 17 Mill Computing, Inc.Patents pending Schedule pass Goal: schedule producers so their results retire just before when consumers want them - & retn + function args * work list 4 2 11 0 0 Take longest-latency op from worklist Schedule it ahead of its consumers Decrement the consumer count of the producers of its arguments If consumer count of arg producer becomes zero, enter producer on worklist schedule: retn 0 0000 2 11 4 5 & 0 # of unplaced consumers retire cycle

10 June 2015 18 Mill Computing, Inc.Patents pending Schedule pass Goal: schedule producers so their results retire just before when consumers want them - & retn + function args * work list 4 2 00 0 0 Take longest-latency op from worklist Schedule it ahead of its consumers Decrement the consumer count of the producers of its arguments If consumer count of arg producer becomes zero, enter producer on worklist schedule: retn 1 0000 2 11 4 5 & 3 * # of unplaced consumers retire cycle

10 June 2015 19 Mill Computing, Inc.Patents pending Schedule pass Goal: schedule producers so their results retire just before when consumers want them - & retn + function args * work list 3 1 00 0 0 Take longest-latency op from worklist Schedule it ahead of its consumers Decrement the consumer count of the producers of its arguments If consumer count of arg producer becomes zero, enter producer on worklist schedule: retn 0 0000 2 11 4 5 & 2 * - # of unplaced consumers retire cycle

10 June 2015 20 Mill Computing, Inc.Patents pending Schedule pass Goal: schedule producers so their results retire just before when consumers want them - & retn + function args * work list 2 0 00 0 0 Take longest-latency op from worklist Schedule it ahead of its consumers Decrement the consumer count of the producers of its arguments If consumer count of arg producer becomes zero, enter producer on worklist schedule: retn 0000 2 11 4 5 & 1 * - + 0 # of unplaced consumers retire cycle

10 June 2015 21 Mill Computing, Inc.Patents pending Schedule pass Goal: schedule producers so their results retire just before when consumers want them - & retn + function args * work list 0 0 00 0 0 Take longest-latency op from worklist Schedule it ahead of its consumers Decrement the consumer count of the producers of its arguments If consumer count of arg producer becomes zero, enter producer on worklist schedule: retn 0000 2 11 4 5 & * - + function args # of unplaced consumers retire cycle

10 June 2015 22 Mill Computing, Inc.Patents pending Placement pass Goal: place ops in instructions using limited FUs schedule: retn & * - +function args tableau: branch 65432106543210 loadALUmult cycle FU +: 1 *: 3 -: 1 &: 1 retn: 0 - & retn + function args * 0

10 June 2015 23 Mill Computing, Inc.Patents pending Placement pass Goal: place ops in instructions using limited FUs schedule: retn * - +function args tableau: branch 65432106543210 loadALUmult cycle FU +: 1 *: 3 -: 1 &: 1 retn: 0 - & retn + function args * 1 0 &

10 June 2015 24 Mill Computing, Inc.Patents pending Placement pass Goal: place ops in instructions using limited FUs schedule: retn - +function args tableau: branch 65432106543210 loadALUmult cycle FU +: 1 *: 3 -: 1 &: 1 retn: 0 - & retn + function args * 4 1 0 * &

10 June 2015 25 Mill Computing, Inc.Patents pending Placement pass Goal: place ops in instructions using limited FUs schedule: retn - +function args tableau: branch 65432106543210 loadALUmult cycle FU +: 1 *: 3 -: 1 &: 1 retn: 0 - & retn + function args * 2 1 0 * & 4

10 June 2015 26 Mill Computing, Inc.Patents pending Placement pass Goal: place ops in instructions using limited FUs schedule: retn function args tableau: branch 65432106543210 loadALUmult cycle FU +: 1 *: 3 -: 1 &: 1 retn: 0 - & retn + function args * 5 4 1 2 0 * & - +

10 June 2015 27 Mill Computing, Inc.Patents pending Placement pass Goal: place ops in instructions using limited FUs schedule: retn tableau: branch 65432106543210 loadALUmult cycle FU +: 1 *: 3 -: 1 &: 1 retn: 0 - & retn + function args * 6 4 1 2 5 0 * & - + args

10 June 2015 28 Mill Computing, Inc.Patents pending Symex pass After instructions have been populated and issue and retire cycles determined, the producer results must still be passed to the consumer arguments. On a general register machine, they would be passed in registers The Mill doesn’t have registers The Mill has its own way to pass data between functional units.

10 June 2015 29 Mill Computing, Inc.Patents pending We call it the Belt Like a conveyor belt – a fixed length FIFO 58353833 5 adder Functional units can read any position 3

10 June 2015 30 Mill Computing, Inc.Patents pending We call it the Belt 3 5853833 adder Functional units can read any position 8 New results drop on the front Pushing the last off the end 3 Like a conveyor belt – a fixed length FIFO

10 June 2015 31 Mill Computing, Inc.Patents pending Multiple reads Functional units can read any mix of belt positions 5853833 adder 8 333553

10 June 2015 32 Mill Computing, Inc.Patents pending Multiple drops All results retiring in a cycle drop together 83 55838 3 adder 8 8 6

10 June 2015 33 Mill Computing, Inc.Patents pending Belt addressing Belt operands are addressed by relative position 68558388 b3b5 “b3” is the fourth most recent value to drop to the belt “b5” is the sixth most recent value to drop to the belt This is temporal addressing add b3, b5 No result address!

10 June 2015 34 Mill Computing, Inc.Patents pending Temporal addressing The temporal address of a datum changes with more drops b3 833 558 68388 b6

10 June 2015 35 Mill Computing, Inc.Patents pending Symex pass The issue schedule and op latency give retire order retn branch 65432106543210 loadALUmult cycle FU - & retn + function args * * & - + args Retire order is belt order infinite belt - & * + ABCD cycle: 1540

10 June 2015 36 Mill Computing, Inc.Patents pending Symex pass retn branch 65432106543210 loadALUmult cycle FU * & - + args - & *+ ABCD cycle: 5410 add mul nop sub and retn 012 - & + function args *

10 June 2015 37 Mill Computing, Inc.Patents pending Symex pass retn branch 65432106543210 loadALUmult cycle FU * & - + args - & *+ ABCD cycle: 5410 add mul nop sub and retn 2 b2 01 b1 - & retn + function args * 21

10 June 2015 38 Mill Computing, Inc.Patents pending Symex pass retn branch 65432106543210 loadALUmult cycle FU * & - + args - & *+ ABCD cycle: 5410 add mul nop sub and retn b2 01 b1 b0b1 - & retn + function args * 01

10 June 2015 39 Mill Computing, Inc.Patents pending Symex pass retn branch 65432106543210 loadALUmult cycle FU * & - + args - & *+ ABCD cycle: 5410 add mul nop sub and retn b2 04 b1 b4b0 - & retn + function args * 01234 b1b0

10 June 2015 40 Mill Computing, Inc.Patents pending Symex pass retn branch 65432106543210 loadALUmult cycle FU * & - + args - & *+ ABCD cycle: 5410 add mul nop sub and retn b2 01 b1 b0 - & retn + function args * 01 b1b0 b4

10 June 2015 41 Mill Computing, Inc.Patents pending Symex pass retn branch 65432106543210 loadALUmult cycle FU * & - + args - & *+ ABCD cycle: 5410 add mul nop sub and retn b2 0 b1 b0 - & retn + function args * 0 b1b0 b4 b0b1

10 June 2015 42 Mill Computing, Inc.Patents pending Symex pass branch 18 17 16 15 14 13 0 loadALUmult cycle FU * & - + args - & *+ ABCD cycle: add b2 b1 mul b0 b1 nop sub b4 b0 and b1 b0 retn 171614130 retnb0 - & retn + function args * 023 b23 But what if there isn’t a b23?

10 June 2015 43 Mill Computing, Inc.Patents pending Use it or lose it Compiler schedules producers near to consumers Nearly all one-use values consumed while on belt Belt is Single-Assignment - no hazards – no renames 300 rename registers become 8/16/32 belt positions But - long-lived values must be saved

10 June 2015 44 Mill Computing, Inc.Patents pending The scratchpad 8 83368388 3 belt scratchpad spill 3 fill Frame local – each function has a new scratchpad Fixed max size, must explicitly allocate Static byte addressing, must be aligned Three cycle spill-to-fill latency

10 June 2015 45 Mill Computing, Inc.Patents pending Symex pass branch 18 17 16 15 14 13 loadALUmult cycle FU * & - + args - & *+ ABCD retn - & + function args * Insert spill-fill ops fill spill 0 1 12 spill fill 0 b0retn - and reschedule

10 June 2015 46 Mill Computing, Inc.Patents pending Symex pass Added spill/fill ops may change the schedule so some other results need spill/fill too. Add more spills/fills, and re-reschedule Iteration is guaranteed to stop with a feasible schedule Iteration limit has every producer spilled and a fill for every consumer, which is feasible. In practice: Most functions need no spills at all More than one reschedule is very rare

10 June 2015 47 Mill Computing, Inc.Patents pending The load problem load add shift store stall You write: add load shift store You get: stall Every architecture must deal with this problem.

10 June 2015 48 Mill Computing, Inc.Patents pending Every CPU’s goal – hide memory latency General strategy: Issue loads as early as possible - as soon as the address is known - or even earlier – aka prefetch Find something else to do while waiting for data -hardware approach – dynamic scheduling Tomasulo algorithm on IBM 360/91 - software approach – static scheduling exposed pipeline, delay slots Ignore program order: issue operations as soon as their data is ready

10 June 2015 49 Mill Computing, Inc.Patents pending Mill “deferred loads” load( Generic Mill load operation: address: 64-bit base; offset; optional scaled index width:scalar 1/2/4/8/16 byte, or vector of same delay:number of issue cycles before retire load(…, …, 4) instruction consumer load issues here data available here,, ) retire is deferred for four instructions load retires here

10 June 2015 50 Mill Computing, Inc.Patents pending Mill “deferred loads” int foo(int a, b, int* p) { return a*b + *p; } abp load + retn * load + retn * function args args stall tableau: 543210543210 branchloadALUmult cycle FU (assuming load latency == 1)

10 June 2015 51 Mill Computing, Inc.Patents pending Mill “deferred loads” int foo(int a, b, int* p) { return a*b + *p; } load + retn * function args tableau: 543210543210 branchloadALUmult cycle FU load

10 June 2015 52 Mill Computing, Inc.Patents pending Mill “deferred loads” int foo(int a, b, int* p) { return a*b + *p; } retire + retn * retire + retn * function args tableau: 543210543210 branchloadALUmult cycle FU issue retire What is the latency of “issue”?

10 June 2015 53 Mill Computing, Inc.Patents pending Mill “deferred loads” int foo(int a, b, int* p) { return a*b + *p; } retire + retn * retire + retn * function args args tableau: 543210543210 branchloadALUmult cycle FU issue retire Is it maxLatency? issue maxLatency stall

10 June 2015 54 Mill Computing, Inc.Patents pending Mill “deferred loads” int foo(int a, b, int* p) { return a*b + *p; } retire + retn * retire + retn * function args args tableau: 543210543210 branchloadALUmult cycle FU issue retire issue What we want is… needed latency highest non-load cycle minus retire cycle

10 June 2015 55 Mill Computing, Inc.Patents pending Mill “deferred loads” The algorithm: Temporarily assign all “issue” as maxLatency Perform latency pass normally Schedule all ops except “issue” normally retire retn * retire + function args issue 0 5 0 9 8 10 maxLatency = 8

10 June 2015 56 Mill Computing, Inc.Patents pending Mill “deferred loads” The algorithm: retire retn * retire + function args issue 0 5 0 9 8 10 maxLatency = 8 retire + retn * retire + retn * function args issue retire 543210543210 branchloadALUmult cycle FU Temporarily assign all “issue” as maxLatency Perform latency pass normally Schedule all ops except “issue” normally

10 June 2015 57 Mill Computing, Inc.Patents pending Mill “deferred loads” retire retn * retire + function args issue 0 5 0 9 8 10 maxLatency = 8 retire + retn * retire + retn * function args issue retire + retn * 543210543210 branchloadALUmult cycle FU retire When scheduling an “issue”, adjust latency to: cycle of highest placed op minus cycle of corresponding “retire” minus predicted cycle of “issue” - or to one, whichever is larger 4 2 0 2 2 cycle latency issue args

10 June 2015 58 Mill Computing, Inc.Patents pending Want more? Sign up for technical announcements, white papers, etc.: MillComputing.com/mailing-list

10 June 2015 1 Mill Computing, Inc.Patents pending One of a series… Drinking from the Firehose Compilation for a Belt Architecture.

Similar presentations

Presentation on theme: "10 June 2015 1 Mill Computing, Inc.Patents pending One of a series… Drinking from the Firehose Compilation for a Belt Architecture."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

10 June 2015 1 Mill Computing, Inc.Patents pending One of a series… Drinking from the Firehose Compilation for a Belt Architecture.

Similar presentations

Presentation on theme: "10 June 2015 1 Mill Computing, Inc.Patents pending One of a series… Drinking from the Firehose Compilation for a Belt Architecture."— Presentation transcript:

Similar presentations

About project

Feedback