SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John Wawrzynek U.C. Berkeley BRASS group
Outline Lecture 1 – Introduction – Related Work – SCORE Computational Model – Hardware Requirements – Language Instantiation Lecture 2 – Execution Example – SCORE Run-Time Environment – Example: JPEG – Results and Conclusion
Introduction Problem: Lack of unifying computational model which allows applications portability and longevity without sacrificing a substantial fraction of raw capabilities Solution: Stream based compute model. Divide computation into fixed “pages.” Time multiplex “pages” into hardware.
Introduction SCORE – Ease development, deployment, and range of RC applications – Efficient implementation maximizing resources
Introduction Current Issues? – Existing targets not portable Software for RC hardware tied to a particular device – Existing targets expose fixed resource limitations Impaired expressiveness Algorithms used restricted by available hardware No dynamic resource allocation Addressing Issues – Virtualize resources computations, communication, and memory resources – Convenient and efficient model
Introduction SCORE - Programming model is natural abstraction of communication between spatial, hardware blocks. Data flow communications graph captures the blocks of computation (operators) and the communication (streams) between them. Then capture and map to hardware efficiently
Related Work Villasenor et At circa 1995 – Motion-wavelet video coder – Hand-partitioning design into “pages” and manually reconfiguring each device Run on 1/3 as many machines Only experienced 10% overhead SCORE builds on: – Instruction Set Architecture, Data Flow, Disturbed and streaming computation models – PRISC, DISC, GARP
SCORE Computational Model Compute Model – Abstract model capturing essential semantics of computation Programming Model – Programming constructs providing convenient way to express computations in the compute model Execution Model – Low-level description of the computation and the semantics which the hardware is expected to provide when interpreting this description
Compute Model Graph of computation operators and memory blocks linked together by streams Streams – Provide node-to-node communication – Single source, single sink FIFO Queues Operators – Finite State Machine (FSM) node Interact via stream links – Turing Complete (TM) node Support resource allocation and stream operations
Compute Model Operations are fully deterministic – Determinism of individual operators – Timing independent communication – Operators cannot side-effect each other’s state 1. Communicate through streams which guarantee a timing independent order of execution 2. Memory segments have single unique owner (no multiple read-write hazards)
Programming Model Framework independent of device limits Guidelines for efficient execution on any hardware implementation Key Abstractions for Programming model – Operators – Streams – Memory Segments
Programming Model Operators – Represents an algorithmic transformation of input data to produce output data – Computation building blocks for computation (Multiplier, FIR, FFT) – Size of operator in hardware is implementation dependent, is not limited to programming model – Partitioning is integral part to automate the compilation process
Programming Model Streams – Communication uses streaming data flow – Producer connected to consumer via streams – Defines where data is logically routed – Acts as unbounded length queue for data tokens – Data Presence Signals Operators signal when producing data and consuming data
Programming Model Memory Segments – Contiguous block of memory – serves as the basic unit for memory management – used by giving a specific operating mode, then linking it into a data flow graph
Programming Model Dynamic Features – Dynamic rate operators Consume / produce tokens at data-dependent rates Efficient operators for tasks: – Data Compression (JPEG), decompression, searching, and filtering Scheduling decisions should be made at Run Time – Dynamic graph composition and instantiation Computational graphs can be created, extended or modified during execution – Dynamic handling of uncommon events (Exception Handling)
Execution Model 3 Key Components – Compute Page (CP) fixed size block of RC logic which is the basic unit of virtualization and scheduling – Memory Segment contiguous block of memory which is the basic unit for data page management – Stream Link logical connection between the output of one page and the input of another page
Hardware Virtualization Compute pages, segments, and streams fundamental units for – allocation – virtualization – management of hardware resources
Example of Stream Buffer Execution
Model Implications Advice for Programmers – Describe computations as spatial pipelines with multiple, independent computational paths – Avoid or minimize feedback cycles – Expose large data streams to SCORE operators
Hardware Requirements Sequential Processor and RC device RC Device divided into a number of equivalent and independent compute pages Multiple distributed memory blocks required to store intermediate data High bandwidth, Low Latency communication, among compute pages and memory, allowing memory pages to be used concurrently
Language Instantiation One could define – subsets of conventional HDLs – subsets of conventional programming languages (C++, Java) Instead they define – RTL language to describe SCORE operators TDF: Intermediate language
Language Requirements SCORE Operators are synchronous, single clock entities with their own state – Communicate only through designed I/O streams – Operation is gated by data presence on the I/O streams – Each operation is viewed as a FSM with associated Data Path SCORE does not have a global shared memory abstraction among operators – Remember memory segments (no two operators can share memory at same time)
TDF RTL Description with special syntax for handling input and output data dreams from the operator – Data Path operators similar to C To allow dynamic operators, basic form is FSM – Each State specifies the inputs which must be present before it can “fire” – When input arrives, operator consumes the inputs and the FSM may choose to change states
END PART 1 Tune in next week for exciting examples
Execution Example Reference Figure 16 – Shows example of C++ program which uses the merge and uniq operators * SCORE operator instantiation and composition can be performed from C++ code
Example - Assumptions Design consists of 3 behavioral operators – Fully implementation of each operator requires only one compute page The RC array contains one compute page and three configurable memory blocks – Each CMB partitioned into 4 segments (s0 - s3) s0 and s1 buffer computation data s2 and s3 store state / configuration for a compute page
Example - Assumptions CMB state maintained by controller – Details are not shown in this example Each compute page has 2 input 2 output FIFO buffers Scheduling and array reconfiguration are performed at the beginning of each timeslice
Execution Example Physical view of array at each point in timeline Single Letter identifiers assigned – A: merge (inputs i0, i1) – B: merge (inputs t1, t2) – C: uniq – Segments: S0, S1
Timeline for Execution Example
Step-by-Step Execution Example
SCORE Run-Time Environment Building Applications Run-Time Environment
Example: JPEG
Conclusion
Figure 18
Figure 19
Figure 20
Table 2
Figure 21
Figure 4