Download presentation
Presentation is loading. Please wait.
Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB
5/15/02Eylon Caspi – EE290N2 Overview Motivation SCORE, “Page Packing” Bounding Memory using Automata Composition Preliminary Results Open Questions / Future Work
5/15/02Eylon Caspi – EE290N3 Reconfigurable Computing Programmable logic + Programmable interconnect (e.g. FPGA) Fills the gap 10x-100x better than microprocessor 10x-100x worse than ASIC In performance(MOPS), In density(MOPS/mm 2, MOPS/mW) Hardware scales by tiling / duplicating High parallelism; spatial data paths But no abstraction for software survival No binary compatibility No performance scaling Designer targets a specific device, specific resource constraints Graphics copyright by their respective company
4 Virtual Hardware Compute model has unbounded resources Programmer no longer targets particular device size Paging “Compute pages” swapped in/out (like VM) Page context = thread (FSM to access streams, block) Efficient virtualization Amortize reconfiguration cost over an entire input buffer buffers TransformQuantizeRLEEncode compute pages
5/15/02Eylon Caspi – EE290N5 SCORE Model Program = data-flow graph of stream-connected SFSMs Kahn process network blocking read, non-blocking write (almost) Compute: SFSM (Streaming Finite State Machine) Concretely: page + FSM to implement token-flow semantics Abstractly: task with local control Communication: Stream FIFO channel, unbounded buffering Storage: Memory Segment Memory block with streaming interface Dynamics: Dynamic local behavior in SFSM Unbounded resource usage: stream buffer expansion Dynamic graph allocation in STM (Streaming Turing Machine) Model admits parallelism at multiple levels: ILP, pipeline, data Stream Computations Organized for Reconfigurable Execution
6 SCORE Programming Model: TDF TDF = intermediate, behavioral language for: EFSM Operators Static operator graphs State machine for: Firing signatures Control flow (branching) Firing semantics: When in state X, wait for X’s inputs, then fire (consume, act) select ( input boolean s, input unsigned[8] t, input unsigned[8] f, output unsigned[8] o ) { state S (s) : if (s) goto T; else goto F; state T (t) : o=t; goto S; state F (f) : o=f; goto S; } stf o select
5/15/027 The Compilation Problem Programming ModelExecution Model Communicating EFSM operators Communicating page configs - unrestricted size, # IOs, timing- fixed size, # IOs, timing Paged virtual hardware Compile memory segment TDF operator stream memory segment compute page stream Compilation is a resource-binding xform on state machines + data-paths
5/15/028 A Problem with Page Packing Intent:Pack multiple, small operators onto 1 page to reduce fragmentation Problem:Streams within a page are registers; must guarantee bounded stream depth Theorem:Memory bound is undecidable (for a Turing complete process network model) Page 1 Page 2
5/15/02Eylon Caspi – EE290N9 Possible Solutions Handle unbounded streams External buffering Registers (guess at depth) + external buffer as fall-back not practical for many small ops Guarantee depth bound for some cases Only one FSM per page + I/O pipelines Identify compatible FSMs, balanced schedules
10 Bounded Buffer Example: Single Stream x==xx= =x x= =x x= =x x= =x Static RateDynamic Rate A pair composition with a single stream needs no buffer
11 Unbounded Buffer Example: Multi Stream Bounded bufferUnbounded buffer Ad-hoc analysis gets complicated quickly What about >2 SFSMs? x= y= =x =y x= y= =x=y x= y= =x =y x= y= =x=y x= y= =x =y
12 Interface Automata A finite state machine that transitions on I/O actions Not input-enabled (not every I/O on every cycle) G = (V, E, A i, A o, A h, V start ) A i =input actionsx?(in CSP notation) A o =output actionsy! ” A h =internal actionsz; ” E V x (A i A o A h ) x V(transition on action) Execution trace = (v, a, v, a, …) (non-deterministic branching) S s? S’ T F T’ F’ st; sf; t? f? o! s t f o select o stf de Alfaro + Henzinger, Symp. Found. SW Eng. (FSE) 2001
5/15/0213 ABA’B AB’A’B’ Automata Composition Automata Composition Composition ~product FSM with synchronization (rendezvous) on common actions A x? A’ y! B z! B’ y? AB yxz x? y; z! x? z! ABA’B AB’A’B’ x? y! x? y! z!y?z!y? Direct Product Composition edges: (I) step A on unshared action (ii) step B on unshared action (iii) step both on shared action Compatible Composition Bounded Memory
5/15/02Eylon Caspi – EE290N14 Compatibility Illegal (P,Q) = {reachable product states (p,q) V P V Q s.t. p produces a shared action that q does not accept, or vice versa } Interface automata P, Q are compatible if: CSP:Always I/O Autaomata:Illegal (P,Q) = Interface Automata:Illegal (P,Q,Env) = Least restrictive environment Env accepts all outputs, provides no inputs Compatible states are those that never reach illegal states via internal, output transitions This is overly restrictive for SCORE Can enter an illegal state, stall the illegal producer, and step the consumer on a different action! Illegal SCORE (P,Q) = {reachable product states (p,q) V P V Q s.t. (p produces a shared action that q does not accept, and q has no alternative non-shared actions), or vice versa } = reachable deadlock(V P V Q ) \ deadlock(V P ) deadlock(V Q ) A x! B x? A x! B t;
5/15/02Eylon Caspi – EE290N15 Alternate Composition Semantics Automata Composition (P Q) (the rest of this talk) Compatibility = no reachable deadlock Pessimistic; correct in any environment Any state can stall state explosion Interface Composition (P II Q) Compatibility = no reachable illegal states for given env. Optimistic; correct in environment that provides no inputs Outputs cannot stall smaller composition SCORE Composition? (P Q) How to get smaller compositions, correct in any environment? Strategic use of output stall Compatibility by construction? (disallow transitions to illegal paths) Minimal stutter (stall) transitions?
5/15/02Eylon Caspi – EE290N16 An Incompatible SCORE Composition AA’A’’ i?x! y! B B’ B’’ y? x? o! AB x io y AB A’B A’’B AB’A’B’ A’’B’ AB’’A’B’’A’’B’’ y; i? x; i? o!
5/15/02Eylon Caspi – EE290N17 Adding a Buffer AB x io y Q x AA’A’’ i?x! y! Q Q’ x?x! AQA’Q i? A’’Q x; AQ’A’Q’A’’Q’ x! y! x! i?
5/15/0218 Buffered Composition AQ B x io y A’Q i? A’’Q x; AQ’A’Q’A’’Q’ x! y! x! i? AQB’A’QB’ i? A’’QB’ x; AQ’ B’A’Q’ B’A’’Q’ B’ i? AQB’’A’QB’’ i? A’’QB’’ x; AQ’ B’’A’Q’ B’’A’’Q’ B’’ i? AQBA’QB i? A’’QB x; AQ’B A’Q’B A’’Q’B i? o! y; x; B B’ B’’ y? x? o!
5/15/02Eylon Caspi – EE290N19 Adding a Buffer, Alternate Order AB x io y Q x QQ’ x? x! B B’ B’’ y? o! x? QBQ’B x? QB’Q’B’ x? QB’’Q’B’’ x? y? x; o!
5/15/02Eylon Caspi – EE290N20 Buffered Composition, Alternate Order A QB x io y Q’B x? QB’Q’B’ x? QB’’Q’B’’ x? y? x; o! AQBAQ’B AQB’AQ’B’ AQB’’AQ’B’’ x; o! A’QB A’Q’B A’QB’A’Q’B’ A’QB’’A’Q’B’’ x; o! A’’QBA’’Q’B A’’QB’ A’’Q’B’ A’’QB’’A’’Q’B’’ x; i? o! x; y; i? AA’A’’ i?x! y!
Composition is Associative* A QB x io y AQBAQ’B AQB’AQ’B’ AQB’’AQ’B’’ x; o! A’QB A’Q’B A’QB’A’Q’B’ A’QB’’A’Q’B’’ x; o! A’’QBA’’Q’B A’’QB’ A’’Q’B’ A’’QB’’A’’Q’B’’ x; i? o! x; y; i? AQB’A’QB’ i? A’’QB’ x; AQ’ B’A’Q’ B’A’’Q’ B’ i? AQB’’A’QB’’ i? A’’QB’’ x; AQ’ B’’A’Q’ B’’A’’Q’ B’’ i? AQBA’QB i? A’’QB x; AQ’B A’Q’B A’’Q’B i? o! y; x; A (Q B) (A Q) B AQ B x io y
22 Static Stream Depth Bound Analysis Basic idea: try to compose A, B with increasingly large queues Given: Graph of TDF operators Output: Stream depth bound (N {∞}) for each stream Initialize: depth[e i ] 0 for all streams (edges) e i For each pair (A,B) of connected operators Let {e i } be set of streams connecting A, B Construct interface automata for A, B each e i induces actions:shared actiona ei if depth[e i ]<∞ non-shared actionsa i ei, a o ei if depth[e i ]=∞ While not Done Construct composition: C A B { Q(depth[e i ]) e i s.t. depth[e i ]<∞} Compute illegal states: Illegal SCORE (C) If Illegal SCORE (C) = –Done with pair A, B Else –For each shared action a ei that is output but not input in some illegal state s Illegal SCORE (C) »depth[e i ]++ »If (depth[e i ] ≥ depth_threshhold) then depth[e i ] ∞ –If depth[e i ] = ∞ e i then Done with pair A, B Return depth[] AB e1e1 io e2e2 AB io
5/15/0223 Results – Pair Composition* App#streamsTriviallyNon-Triv.Un-#SFSMs#SFSMTriviallyNon-Triv.Not BoundedBoundedboundedpairsComposComposCompos IIR 990087700 Wavelet Encode 583502330241509 Wavelet Decode 5734121127312605 JPEG Encode 622513241311614 JPEG Decode 61---12---- MPEG Encode IP 42135143709215414455 MPEG Encode IPB 488402582211421119285 * Max stream depth = 2 (with streams to mem)(without streams to mem)
5/15/02Eylon Caspi – EE290N24 Maximum Depth Parameter App#streams Non-Trivially Bounded#SFSM Non-Trivially Composable for given max stream depthpairsfor given max stream depth 0123401234 IIR 900000700000 Wavelet Encode 58000002400000 Wavelet Decode 5712121212123100000 JPEG Encode 6210121315161101111 JPEG Decode 611718---922--- MPEG Encode IP 421394243--154455-- MPEG Encode IPB 488535758--211688--
5/15/0225 Composite Automaton Size App# Nodes in Largest CompositionRun time for given max stream depth(seconds) 012342 IIR 12121212120.2 Wavelet Encode 15871587309411,41730,6307.5 Wavelet Decode 961192211,79837,71284,7847.2 JPEG Encode 37856175808616,24427,1389.5 JPEG Decode 3785196,576196,576*196,576* 196,576*245* MPEG Encode IP 788789,541334,757334,757*334,757*125 MPEG Encode IPB 8478100,909385,334385,334*385,334*150 * Crashed; Partial Results
5/15/02Eylon Caspi – EE290N26 Composing More than 2 SFSMs Page Packing by incrementally growing a cluster? Larger composition should improve stream depth bound Restricts environment around a pair of SFSMs Fewer transitions fewer reachable deadlocks BUT larger composition can expose deadlocked feedback loop Page 1 Page 2 1 2 4 ∞ Compose 2 SFSMs Page 1 1 2 ? 4 ∞ ? Compose 3 SFSMs
5/15/02Eylon Caspi – EE290N27 Synthesizing a Composite SFSM? How to turn a composite automaton into TDF or page logic? TDF does not support all non-deterministic branches Multiple inputs:ok (state with multiple signatures / cases) Multiple outputs:must sequentialize (how?) Input + output:??? Input before output— may cause deadlock if output feeds back to input Output before input— may stall composite on output back-pressure Conjecture: It is always safe to sequentialize outputs before inputs Heavier-weight automata can check input availability / output space before blocking on I/O “System-Level Types for Component-Based Design,” Lee + Xiong, EMSOFT 2001 (used in Ptolemy) ABA’B AB’ A’B’ IA Composition A || B x? y; z! x? z!
5/15/02Eylon Caspi – EE290N28 Summary SCORE P rocess network model to support software longevity, scalability on massively parallel HW Automata composition with finite queues Compatibility Bounded memory Initial results: pair composition Future work: Faster run time (semantics for smaller composite size) Compose more than 2 SFSMs Page Packing
5/15/02Eylon Caspi – EE290N29 Supplemental
5/15/02Eylon Caspi – EE290N30 TDF Dataflow Process Network Dataflow Process Network [Parks+Lee, IEEE May ‘95] Process enabled by set of firing rules:R = {R 1, R 2, …, R N } Firing rule = set of patterns:R i = {R i,1, R i,2, …, R i,p } DF process for a TDF operator: Feedback arc for state One firing rule per state Patterns match state value + presence of desired inputs E.g. for state i:R i = {R i,1, R i,2, …, [i]} Patterns:R i,j = [*]if input j is in state i’s input signature R i,j = if input j is not in state i’s input signature R i,p = [i]for final input, representing state arc These are sequential firing rules Partitioned SFSM adds “wait” state process state
5/15/02Eylon Caspi – EE290N31 SFSM Partitioning Transform Only 1 partition active at a time Transform to activate via streams New state in each partition: “wait” Used when not active Waits for activation from other partition(s) Has one input signature (firing rule) per activator Firing rules are not sequential, but determinism guaranteed Only 1 possible activator Activation streams from given source to given dest. partitions can be merged + binary-encoded A B C D A B Wait AB C D Wait CD {A,B} {C,D}
5/15/02Eylon Caspi – EE290N32 SCORE Hardware Model Paged FPGA Compute Page (CP) Fixed-size slice of RC hardware Fixed number of I/O ports Distributed, on-chip memory Configurable Memory Block (CMB) Stream access High-level interconnect Microprocessor Run-time support + user code
5/15/02Eylon Caspi – EE290N33 Functional Simulation FPGA based on HSRA [Berkeley, FPGA ’99] CP:512 4-LUTs CMB:2Mbit DRAM Area for CP-CMB pair: Page reconfiguration:5000 cycles (from CMB) Synchronous operation(same clock speed as processor) x86 microprocessor Page Scheduler task Swap on timer interrupt (every 250,000 cycles) Fully dynamic scheduling.25 :12.9mm 2 (1/9 of PII-450).18 : 6.7mm 2 (1/16 of PIII-600)
5/15/02Eylon Caspi – EE290N34 Application: JPEG Encode
5/15/02Eylon Caspi – EE290N35 Execution Results Hardware Size (CP-CMB Pairs)
5/15/02Eylon Caspi – EE290N36 Execution Results Hardware Size (CP-CMB Pairs)
5/15/02Eylon Caspi – EE290N37 Execution Results Hardware Size (CP-CMB Pairs)
5/15/02Eylon Caspi – EE290N38 Execution Results Hardware Size (CP-CMB Pairs)
5/15/02Eylon Caspi – EE290N39 Page Hardware Model Page = fixed-size slice of rsrcs + stream interface FSM for: Firing Output emission Data-path control Branching FSM Reconfigurable Fixed logic
5/15/02Eylon Caspi – EE290N40 Page Firing Logic Sample firing logic 3 inputs (A,B,C) 3 outputs (X,Y,Z) Single signature
Similar presentations
© 2025 Inc.
All rights reserved.