Drinking from the Firehose

Slides:

Advertisements

Similar presentations

In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution.

Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

CPS3340 COMPUTER ARCHITECTURE Fall Semester, /17/2013 Lecture 12: Procedures Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE CENTRAL.

Infix, Postfix and Stacks

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Instruction Level Parallelism (ILP) Colin Stevens.

EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Microprocessors Introduction to RISC Mar 19th, 2002.

Appendix A Pipelining: Basic and Intermediate Concepts

Code Generation CS 480. Can be complex To do a good job of teaching about code generation I could easily spend ten weeks But, don’t have ten weeks, so.

(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.

RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.

10 June Mill Computing, Inc.Patents pending One of a series… Drinking from the Firehose Compilation for a Belt Architecture.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

11/12/ Out-of-the-Box Computing Patents pending IEEE-SVC 2013/11/12 Drinking from the Firehose Cool and cold transfer prediction in the Mill™ CPU.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

Buffering Techniques Greg Stitt ECE Department University of Florida.

CS 352H: Computer Systems Architecture

Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Data Prefetching Smruti R. Sarangi.

Central Processing Unit Architecture

Virtual Memory - Part II

A Closer Look at Instruction Set Architectures

Multiscalar Processors

Topics Introduction to Repetition Structures

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

CS203 – Advanced Computer Architecture

Procedures (Functions)

Topics Introduction to Repetition Structures

Henk Corporaal TUEindhoven 2009

Drinking from the Firehose

CSL718 : VLIW - Software Driven ILP

User-Defined Functions

Drinking from the Firehose Decode in the Mill™ CPU Architecture

Pipelining: Advanced ILP

Instruction Scheduling for Instruction-Level Parallelism

Levels of Parallelism within a Single Processor

Yingmin Li Ting Yan Qi Zhao

CS 704 Advanced Computer Architecture

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

Henk Corporaal TUEindhoven 2011

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Instruction Execution Cycle

Data Prefetching Smruti R. Sarangi.

MARIE: An Introduction to a Simple Computer

Midterm 2 review Chapter

Optimizing ARM Assembly

Addressing mode summary

Levels of Parallelism within a Single Processor

Topics Introduction to Repetition Structures

Loop-Level Parallelism

Lecture 4: Instruction Set Design/Pipelining

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Drinking from the Firehose Number nine of a series Drinking from the Firehose Everything in its due time – software pipelining in the Mill™ CPU Architecture

http://millcomputing.com/docs You are here Talks in this series Encoding The Belt Memory Prediction Metadata and speculation Execution Security Specification Software pipelining … You are here Slides and videos of other talks are at: http://millcomputing.com/docs

The Mill CPU The Mill is a new general-purpose commercial CPU family. The Mill has a 10x single-thread power/performance gain over conventional out-of-order superscalar architectures, yet runs the same programs, without rewrite. This talk will explain how the Mill: pipelines without prologues and epilogues pipelines loop-carried data pipelines mixed-latency operations pipelines outer loops tail-recursion induction

Gross over-simplification! Caution! Gross over-simplification! This talk tries to convey an intuitive understanding to the non-specialist. The reality is more complicated. (we try not to over-simplify, but sometimes…)

What’s a software pipeline? A review What’s a software pipeline?

An example Source: loop body time per iteration: 3 cycles for (int i = 0; i < N; ++i) A[i] = A[i]+3; load add store loop body time per iteration: 3 cycles (assuming all ops are one cycle) Ignoring control variable update and test

An example Source: time memory memory memory for (int i = 0; i < N; ++i) A[i] = A[i]+3; time load0 load0 add0 add0 store0 load1 load1 add1 add1 store1 load2 load2 add2 add2 store2 memory memory memory load0 reg 1 add1 add0 load1 load2 add2

An example Source: time for (int i = 0; i < N; ++i) A[i] = A[i]+3; time load0 add0 store0 load1 add1 store1 load2 add2 store2 Subscripts indicate the iteration number of the operation

What’s under the hood? Source: time idle idle idle idle idle idle idle for (int i = 0; i < N; ++i) A[i] = A[i]+3; time load0 add0 store0 load1 add1 store1 load2 add2 store2 load unit load unit load unit idle idle idle adder adder adder idle idle idle idle store unit store unit store unit idle idle idle

Run the units in parallel, every cycle Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; time load0 load1 load2 load3 load4 load5 add0 add1 add2 add3 add4 store0 store1 store2 store3 machine cycles Requires wide-issue: superscalar, VLIW, Mill

Run the units in parallel, every cycle Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; time load0 load1 load2 load3 load4 load5 add0 add1 add2 add3 add4 store0 store1 store2 store3 One iteration, spread over three cycles

An example Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 store0 add0 load1 add1 load2 store1 add2 load3 store2 add3 load4 store3 add4 load5 In steady state, each cycle executes one third of each of three iterations. This is the steady state of the pipeline. Time per iteration: one cycle

But – what happens to the data? Data produced in one iteration must be passed to the consuming operation of the same iteration. Not to the consuming operation of a different iteration. On a conventional machine, data is passed from operation to operation in general registers.

An example Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 reg 1 load0 load4 add0 add0 add1 add1 add2 add2 add3 add3 add4 add4 add1 add3 add2 reg 2 add0 store0 store1 store2 store3

Loop-carried variables Source: for (int i = 1; i < N; ++i) A[i] = A[i]+ 3; A[i+1]; loop-carried variable The number of iterations that a value must be carried over is called the distance of the carried variable. Change the loop to use a single value in several iterations. The largest carried distance is the loop distance.

? An example Source: Does the pipeline code still work? for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load0 load1 reg 1 add0 ? reg 2 Does the pipeline code still work? Where is the second argument to add?

? An example Source: Try doing two loads to start the loop? for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load0 load1 load1 load2 reg 1 add0 ? reg 2 Try doing two loads to start the loop? Where is the second argument to add?

BUT… An example Source: Add another register? for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load0 load1 load1 load2 load2 load3 load3 reg 1 load0 load2 add0 add0 add1 add1 load1 load1 reg 2 store0 BUT… Add another register? add0 reg 3

An example Source: for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; this load this load went here load0 load0 load1 load1 load2 load2 load3 load3 reg 1 load2 went here load3 reg 2 reg 3

An example Source: But the loads are the SAME operation – for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load0 load1 load1 load2 load2 load3 load3 load2 reg 1 But the loads are the SAME operation – how can they have different result registers? reg 2 load3 Only by duplicating the code. reg 3 Must unroll the loop distance times!

An example Source: So use two instructions – for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load0 load1 load1 load2 load2 load3 load3 load4 load4 load5 load5 load0 load2 load4 reg 1 load2 add0 add0 add1 add1 add2 add2 add3 add3 load1 reg 2 load3 load3 load1 store0 store1 store2 So use two instructions – one to one register and one to the other add1 add2 add0 reg 3

An example Source: These go to reg1… for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load0 load1 load2 load2 load3 load4 load4 load5 reg 1 add0 add1 add2 add3 reg 2 store0 store1 store2 These go to reg1… reg 3

An example Source: And the others go to reg2… for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load1 load1 load2 load3 load3 load4 load5 load5 reg 1 add0 add1 add2 add3 reg 2 store0 store1 store2 And the others go to reg2… in effect, the loop is unrolled 2X reg 3

An example Source: So this is the steady state… for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load1 load2 load3 load4 load5 reg 1 add0 add1 add2 add3 reg 2 store0 store1 store2 So this is the steady state… reg 3 two instructions, not one.

… An example Source: Now change the loop distance… for (int i = 1; i < N; ++i) A[i] = A[i]+A [i+1]; [i+10]; load0 load1 load2 load3 load4 load5 … add0 add1 add2 add3 store0 store1 store2 Now change the loop distance… and you have to unroll 10X

An example Source: Or – you can use a copy. for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load0 load1 load1 load2 load2 load3 load3 load4 load4 load5 load5 reg 1 load1 load0 copy0 copy1 load2 copy2 load3 copy3 load4 copy4 add0 store0 add1 store1 add2 store2 add3 reg 2 Or – you can use a copy. reg 3

An example Source: for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load0 load1 load1 load2 load2 load3 load3 load4 load4 load5 load5 load0 load2 load2 load1 load1 load3 load4 reg 1 load4 load3 copy0 copy0 copy1 copy1 copy2 copy2 copy3 copy3 copy4 copy4 copy3 copy0 copy2 copy1 reg 2 add0 add0 add1 add1 add2 add2 add3 add3 store0 store1 store2 reg 3 add2 add1 add0

An example Source: DRAWBACK: One copy operation, and one register, for (int i = 1; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load1 load2 load3 load4 load5 reg 1 load5 copy0 copy1 copy2 copy3 copy4 reg 2 copy4 add0 add1 add2 add3 DRAWBACK: One copy operation, and one register, per LCV times distance store0 store1 store2 add3 reg 3 One instruction steady state

The Belt A review The Mill has no general registers. All operations results are entered in a FIFO, the Belt

Functional units can read any position The Belt Like a conveyor belt – a fixed length FIFO 5 8 3 5 3 Functional units can read any position adder

We call it the Belt Like a conveyor belt – a fixed length FIFO adder New results drop on the front 8 Pushing the last off the end 5 8 3 3 3 Functional units can read any position adder

Functional units can read any mix of belt positions Multiple reads Functional units can read any mix of belt positions 8 3 5 8 3 5 5 3 3 3 adder adder adder

All results retiring in a cycle drop together Multiple drops All results retiring in a cycle drop together adder adder adder 8 8 6 5 8 3 3 8 3 adder adder adder

The simple example – using the Belt Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 load0 Belt

The simple example – using the Belt Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 load1 load1 add0 add0 load0 load0 Belt

The simple example – using the Belt Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 load1 load2 store0 add1 load2 add0 add1 load0 load1 add0 load0 load1 add0 Belt

The simple example – using the Belt Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 load1 add1 load2 load3 add2 load3 store1 add0 add2 load0 load2 store0 add0 load1 add0 load0 load2 add1 load2 add1 Belt

The simple example – using the Belt Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 load1 add1 load2 add2 load3 store1 add3 load3 store1 add0 This is the steady state load0 load2 load2 store0 add0 add1 load1 add0 load0 load2 add1 load3 add2 load3 add2 Belt

The loop-carried example, on the Belt Source: for (int i = 0; i < N; ++i) A[i] = A[i]+ 3; A[i+1]; load0 load0 Belt

The loop-carried example, on the Belt Source: for (int i = 0; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load1 load1 load0 Belt

The loop-carried example, on the Belt Source: for (int i = 0; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load1 load2 add1 load2 add0 load1 load1 load0 load0 Belt

The loop-carried example, on the Belt Source: for (int i = 0; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load1 add0 load2 add1 load3 store0 load3 add1 load0 load1 load1 load0 load2 add0 load2 add0 load1 Belt

The loop-carried example, on the Belt Source: for (int i = 0; i < N; ++i) A[i] = A[i]+A[i+1]; load0 load1 add0 load2 store0 add1 load3 load4 add2 load4 store1 This is the steady state add2 No unrolling No copies One instruction load0 load2 load1 load2 add0 load3 load1 load2 add0 load3 add1 add1 load2 load0 Belt

Prologues Source: Prologue instructions: prologue Prologue operations: for (int i = 1; i < N; ++i) A[i] = A[i]+3 Prologue instructions: unpipelined latency + loop distance - 1 load0 add0 load1 add1 load2 store0 prologue Prologue operations: (N*(N-1))/2 steady-state ops steady state With three ops: 2 instructions, 3 ops With 6 ops: 6 instructions, 15 ops With 20 ops: 20 instructions, 190 ops

The retire operation retire supplies values that the loop hasn’t calculated yet. retire(4); tells the hardware that four result operands are supposed to retire to the belt in the current cycle. If fewer will retire, retire invents the missing results. Metadata marks invented operands as a None.

failing operation location NaR bits Every data element has a NaR (Not A Result) bit in the element metadata. The bit is set whenever a detected error precludes producing a valid value. operation OK oops value payload failing operation location kind where error kind A debugger displays the fault detection point.

An error – or just missing data? A None is a kind of NaR that identifies a missing value. Most operations are speculable – they have no side effects. NaRs and Nones just pass through unchanged. NaR value None speculable operation

non-speculable operation An error – or just missing data? A non-speculable operation has side effects. A store to memory is the most common example. Normal data, Nones and NaRs differ in their response to non-speculable operations NaR value None non-speculable operation FAULT! discarded nothing happens

The simple example – using Nones Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; First, fill the Belt with Nones

An example – using Nones Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 add-1 store-2 Next, execute the steady-state instruction

An example – using Nones Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 load0 add-1 store-2 Supply data from the belt And retire results and side-effects Advance the belt

An example – using Nones Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 add-1 store-2 load1 load1 add0 store-1 add0 Rinse, repeat. load0 load0

An example – using Nones Source: for (int i = 0; i < N; ++i) A[i] = A[i]+3; load0 add-1 store-2 load1 add0 store-1 load2 add1 store0 load2 No prologue code required - add1 The steady-state loop body is the prologue load0 to memory load0 load1 add0 load1 add0

Pipelining in-flight values Source: Assume multiply take three cycles for (int i = 0; i < N; ++i) A[i] = A[i] 3; * + load0 mul0 noop0 noop0 store0 must wait for mul result All example operations executed in one cycle What about operations that take longer?

Pipelining in-flight values Source: for (int i = 0; i < N; ++i) A[i] = A[i]*3; load0 mul-1 store-4 load0 Will the previous code work? So far so good

Pipelining in-flight values Source: for (int i = 0; i < N; ++i) A[i] = A[i]*3; load0 load0 mul-1 store-4 load1 load1 mul0 store-3 Still OK load0 load0

Pipelining in-flight values Source: for (int i = 0; i < N; ++i) A[i] = A[i]*3; load0 mul-1 store-3 load1 mul0 store-2 load2 load2 mul1 store-1 load0 to memory OOPS! load0 load1 load1 load0

The retire operation Source: for (int i = 0; i < N; ++i) A[i] = A[i]*3; load0 load0 mul-1 store-4 retire(2) The retire op forces the drop count of the cycle in which it executes by dropping Nones if necessary. This example drops two results in steady-state, so the instruction contains: retire(2)

The retire operation Source: for (int i = 0; i < N; ++i) A[i] = A[i]*3; load0 mul-1 store-4 retire(2) load1 load1 mul0 store-3 retire(2) load0 load0

The retire operation Source: for (int i = 0; i < N; ++i) A[i] = A[i]*3; load0 mul-1 store-4 retire(2) load1 mul0 store-3 retire(2) load2 load2 mul1 store-2 retire(2) load2 load1 load0 load1

The retire operation Source: for (int i = 0; i < N; ++i) A[i] = A[i]*3; load0 mul-1 store-4 retire(2) load1 mul0 store-3 retire(2) load2 mul1 store-2 retire(2) load3 load3 mul2 store-1 retire(2) mul0 load2 load2 load2 load0 load1 load2

The retire operation Source: for (int i = 0; i < N; ++i) A[i] = A[i]*3; One trip through the loop to reach steady state load0 mul-1 store-4 retire(2) load1 mul0 store-3 retire(2) load2 mul1 store-2 retire(2) load3 mul2 store-1 retire(2) load4 load4 mul3 store0 retire(2) mul1 Thereafter one iteration per cycle load2 load2 load2 ILP limited only by hardware compute capacity load1 load2 mul0 load3 load3 mul0 load0 to memory!!!

Use it or lose it The Belt has a fixed size, suitable for common usage. If there are too many loop-carried variables, they will fall off the end before they can be used. Excess belt data that will be needed later can be spilled to the scratchpad, a special hardware buffer.

Review - the scratchpad Frame local – each function has a new scratchpad Fixed max size, must explicitly allocate Static byte addressing, must be aligned Three cycle spill-to-fill latency belt 8 3 6 3 8 spill fill 3 scratchpad

The rotator Each scratchpad allocation makes a new portion of scratchpad available to spill and fill. scratchf(…) base fence fence scratchpad

The rotator Each scratchpad allocation makes a new portion of scratchpad available to spill and fill. scratchf(…) … scratchf(…) base fence scratchpad

The rotator Each scratchpad allocation makes a new portion of scratchpad available to spill and fill. scratchf(…) … scratchf(…) Each allocation has a corresponding rotator base fence scratchpad

The rotator Each scratchpad allocation makes a new portion of scratchpad available to spill and fill. scratchf(…) … scratchf(…) Each allocation has a corresponding rotator base fence outer rotator inner rotator

The rotator Each scratchpad allocation makes a new portion of scratchpad available to spill and fill. scratchf(…) … scratchf(…) Each allocation has a corresponding rotator base fence outer rotator inner rotator

The rotator A rotator is a circular address remapper. Scratchpad addresses in spill and fill are biased by the cursor, with wrap-around, before conversion to physical cursor cursor rotator rotate(20); The rotate operation advances the cursor, with wrap-around. This gives belt-like renaming to the rotator’s scratchpad.

The inner operation The Mill treats loops as if they were like functions. The inner operation starts a new, nested loop. inner take arguments just like call: inner(b5, b3); starts the loop with an empty belt, initialized with two values from the outer environment. Typically the arguments are initial values for control variables. inner does not change the stack frame or protection. Like call, operations in-flight at inner are completed after loop exit

The leave operation The leave operation exits the innermost loop. leave take arguments just like return: leave(b4); restores the belt to its state when the corresponding inner was executed, and drops the leave arguments at the front. leave arguments are used for searches. leave discards any computation in-flight in the loop. Most pipelines can be broken-off by leave, with no epilogue. leave discards the innermost rotator if one was allocated in the loop.

Loop entrance b0 b7 Outer belt 8 3 6 3 8 8 6 inner head,b1,b5,b3,b3 X Inner belt 1 4 9 5 2 7 2 leave b4 8 3 6 3 Outer belt An loop has the same belt effects as an op like add A loop can drop multiple results

Belt save/restore b0 b7 Outer belt 8 3 6 The Spiller is a background save/restore engine Values are marked with the owning frame Belt access is to the values of the current frame Change the current frame id - the belt is empty! Data is still there, can be spilled at leisure Arguments passed by copy, get new frame id Inner loop 2 8 3 6 Outer belt

* In-flight over a loop NO! 8 3 6 b7 b0 8 3 6 in inner loop 9 in inner muls * 8 3 6 inner in inner loop 9 in inner Should we drop in the middle of the inner loop? NO!

* In-flight over loop 8 3 6 b7 b0 8 3 6 (whole inner loop) 8 3 6 2 8 3 muls * 8 3 6 call (whole inner loop) 8 3 6 2 8 3 6 2 9 8 Loops are atomic In-flights retire after loop exits

inner/leave at 30,000 feet for (int i = 0; i < N; ++i) { int x = A[i]; for (int j = 0; j < M; ++j) { if (S[j].f == x) { B[i] = S[j].g; break; } load <A[i]> inner

Can pipeline essentially any loop Summary #1: The Mill: Can pipeline essentially any loop Embedded calls, control flow no problem No prologue needed Steady-state loop body works as prologue Treats loop like a function Permits pipelining nested loops at all levels

Unlimited loop distance Summary #2 The Mill: Unlimited loop distance Near data on belt, far data on scratchpad Unlimited loop-carried variables No unrolling, no copies Loop exit automatically cleans up In-flight computation is discarded

Summary #3 The Mill: inner/leave operations support modularity Loops get arguments, return results

millcomputing.com/docs Shameless plug For technical info about the Mill CPU architecture: millcomputing.com/docs For future announcements, white papers etc.: millcomputing.com/mailing-list For investor information: millcomputing.com/investor-list