Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB.

Similar presentations


Presentation on theme: "Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB."— Presentation transcript:

1 Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB

2 5/15/02Eylon Caspi – EE290N2 Overview  Motivation  SCORE, “Page Packing”  Bounding Memory using Automata Composition  Preliminary Results  Open Questions / Future Work

3 5/15/02Eylon Caspi – EE290N3 Reconfigurable Computing  Programmable logic + Programmable interconnect (e.g. FPGA)  Fills the gap  10x-100x better than microprocessor  10x-100x worse than ASIC  In performance(MOPS), In density(MOPS/mm 2, MOPS/mW)  Hardware scales by tiling / duplicating  High parallelism; spatial data paths  But no abstraction for software survival  No binary compatibility  No performance scaling  Designer targets a specific device, specific resource constraints Graphics copyright by their respective company

4 4 Virtual Hardware  Compute model has unbounded resources  Programmer no longer targets particular device size  Paging  “Compute pages” swapped in/out (like VM)  Page context = thread (FSM to access streams, block)  Efficient virtualization  Amortize reconfiguration cost over an entire input buffer buffers TransformQuantizeRLEEncode compute pages

5 5/15/02Eylon Caspi – EE290N5 SCORE Model  Program = data-flow graph of stream-connected SFSMs  Kahn process network  blocking read, non-blocking write (almost)  Compute: SFSM (Streaming Finite State Machine)  Concretely: page + FSM to implement token-flow semantics  Abstractly: task with local control  Communication: Stream  FIFO channel, unbounded buffering  Storage: Memory Segment  Memory block with streaming interface  Dynamics:  Dynamic local behavior in SFSM  Unbounded resource usage: stream buffer expansion  Dynamic graph allocation in STM (Streaming Turing Machine)  Model admits parallelism at multiple levels: ILP, pipeline, data Stream Computations Organized for Reconfigurable Execution

6 6 SCORE Programming Model: TDF  TDF = intermediate, behavioral language for:  EFSM Operators Static operator graphs  State machine for:  Firing signatures Control flow (branching)  Firing semantics:  When in state X, wait for X’s inputs, then fire (consume, act) select ( input boolean s, input unsigned[8] t, input unsigned[8] f, output unsigned[8] o ) { state S (s) : if (s) goto T; else goto F; state T (t) : o=t; goto S; state F (f) : o=f; goto S; } stf o select

7 5/15/027 The Compilation Problem Programming ModelExecution Model Communicating EFSM operators Communicating page configs - unrestricted size, # IOs, timing- fixed size, # IOs, timing Paged virtual hardware Compile memory segment TDF operator stream memory segment compute page stream Compilation is a resource-binding xform on state machines + data-paths

8 5/15/028 A Problem with Page Packing  Intent:Pack multiple, small operators onto 1 page to reduce fragmentation  Problem:Streams within a page are registers; must guarantee bounded stream depth  Theorem:Memory bound is undecidable (for a Turing complete process network model) Page 1 Page 2

9 5/15/02Eylon Caspi – EE290N9 Possible Solutions  Handle unbounded streams  External buffering  Registers (guess at depth) + external buffer as fall-back   not practical for many small ops  Guarantee depth bound for some cases  Only one FSM per page + I/O pipelines  Identify compatible FSMs, balanced schedules

10 10 Bounded Buffer Example: Single Stream x==xx= =x x= =x x= =x x= =x Static RateDynamic Rate  A pair composition with a single stream needs no buffer

11 11 Unbounded Buffer Example: Multi Stream Bounded bufferUnbounded buffer  Ad-hoc analysis gets complicated quickly  What about >2 SFSMs? x= y= =x =y x= y= =x=y x= y= =x =y x= y= =x=y x= y= =x =y

12 12 Interface Automata  A finite state machine that transitions on I/O actions  Not input-enabled (not every I/O on every cycle)  G = (V, E, A i, A o, A h, V start )  A i =input actionsx?(in CSP notation)  A o =output actionsy! ”  A h =internal actionsz; ”  E  V x (A i  A o  A h ) x V(transition on action)  Execution trace = (v, a, v, a, …) (non-deterministic branching) S s? S’ T F T’ F’ st; sf; t? f? o! s t f o select o stf de Alfaro + Henzinger, Symp. Found. SW Eng. (FSE) 2001

13 5/15/0213 ABA’B AB’A’B’ Automata Composition Automata Composition  Composition ~product FSM with synchronization (rendezvous) on common actions A x? A’ y! B z! B’ y? AB yxz x? y; z! x? z! ABA’B AB’A’B’ x? y! x? y! z!y?z!y? Direct Product Composition edges: (I) step A on unshared action (ii) step B on unshared action (iii) step both on shared action Compatible Composition  Bounded Memory

14 5/15/02Eylon Caspi – EE290N14 Compatibility  Illegal (P,Q) = {reachable product states (p,q)  V P  V Q s.t. p produces a shared action that q does not accept, or vice versa }  Interface automata P, Q are compatible if:  CSP:Always  I/O Autaomata:Illegal (P,Q) =   Interface Automata:Illegal (P,Q,Env) =   Least restrictive environment Env accepts all outputs, provides no inputs  Compatible states are those that never reach illegal states via internal, output transitions  This is overly restrictive for SCORE  Can enter an illegal state, stall the illegal producer, and step the consumer on a different action!  Illegal SCORE (P,Q) = {reachable product states (p,q)  V P  V Q s.t. (p produces a shared action that q does not accept, and q has no alternative non-shared actions), or vice versa }  = reachable deadlock(V P  V Q ) \ deadlock(V P )  deadlock(V Q ) A x! B x? A x! B t;

15 5/15/02Eylon Caspi – EE290N15 Alternate Composition Semantics  Automata Composition (P  Q)  (the rest of this talk)  Compatibility = no reachable deadlock  Pessimistic; correct in any environment  Any state can stall  state explosion  Interface Composition (P II Q)  Compatibility = no reachable illegal states for given env.  Optimistic; correct in environment that provides no inputs  Outputs cannot stall  smaller composition  SCORE Composition? (P Q)  How to get smaller compositions, correct in any environment?  Strategic use of output stall  Compatibility by construction? (disallow transitions to illegal paths)  Minimal stutter (stall) transitions?

16 5/15/02Eylon Caspi – EE290N16 An Incompatible SCORE Composition AA’A’’ i?x! y! B B’ B’’ y? x? o! AB x io y AB A’B A’’B AB’A’B’ A’’B’ AB’’A’B’’A’’B’’ y; i? x; i? o!

17 5/15/02Eylon Caspi – EE290N17 Adding a Buffer AB x io y Q x AA’A’’ i?x! y! Q Q’ x?x! AQA’Q i? A’’Q x; AQ’A’Q’A’’Q’ x! y! x! i?

18 5/15/0218 Buffered Composition AQ B x io y A’Q i? A’’Q x; AQ’A’Q’A’’Q’ x! y! x! i? AQB’A’QB’ i? A’’QB’ x; AQ’ B’A’Q’ B’A’’Q’ B’ i? AQB’’A’QB’’ i? A’’QB’’ x; AQ’ B’’A’Q’ B’’A’’Q’ B’’ i? AQBA’QB i? A’’QB x; AQ’B A’Q’B A’’Q’B i? o! y; x; B B’ B’’ y? x? o!

19 5/15/02Eylon Caspi – EE290N19 Adding a Buffer, Alternate Order AB x io y Q x QQ’ x? x! B B’ B’’ y? o! x? QBQ’B x? QB’Q’B’ x? QB’’Q’B’’ x? y? x; o!

20 5/15/02Eylon Caspi – EE290N20 Buffered Composition, Alternate Order A QB x io y Q’B x? QB’Q’B’ x? QB’’Q’B’’ x? y? x; o! AQBAQ’B AQB’AQ’B’ AQB’’AQ’B’’ x; o! A’QB A’Q’B A’QB’A’Q’B’ A’QB’’A’Q’B’’ x; o! A’’QBA’’Q’B A’’QB’ A’’Q’B’ A’’QB’’A’’Q’B’’ x; i? o! x; y; i? AA’A’’ i?x! y!

21 Composition is Associative* A QB x io y AQBAQ’B AQB’AQ’B’ AQB’’AQ’B’’ x; o! A’QB A’Q’B A’QB’A’Q’B’ A’QB’’A’Q’B’’ x; o! A’’QBA’’Q’B A’’QB’ A’’Q’B’ A’’QB’’A’’Q’B’’ x; i? o! x; y; i? AQB’A’QB’ i? A’’QB’ x; AQ’ B’A’Q’ B’A’’Q’ B’ i? AQB’’A’QB’’ i? A’’QB’’ x; AQ’ B’’A’Q’ B’’A’’Q’ B’’ i? AQBA’QB i? A’’QB x; AQ’B A’Q’B A’’Q’B i? o! y; x; A  (Q  B) (A  Q)  B AQ B x io y

22 22 Static Stream Depth Bound Analysis  Basic idea: try to compose A, B with increasingly large queues  Given: Graph of TDF operators  Output: Stream depth bound  (N  {∞}) for each stream  Initialize: depth[e i ]  0 for all streams (edges) e i  For each pair (A,B) of connected operators  Let {e i } be set of streams connecting A, B  Construct interface automata for A, B  each e i induces actions:shared actiona ei if depth[e i ]<∞ non-shared actionsa i ei, a o ei if depth[e i ]=∞  While not Done  Construct composition: C  A  B  { Q(depth[e i ])  e i s.t. depth[e i ]<∞}  Compute illegal states: Illegal SCORE (C)  If Illegal SCORE (C) =  –Done with pair A, B  Else –For each shared action a ei that is output but not input in some illegal state s  Illegal SCORE (C) »depth[e i ]++ »If (depth[e i ] ≥ depth_threshhold) then depth[e i ]  ∞ –If depth[e i ] = ∞  e i then Done with pair A, B  Return depth[] AB e1e1 io e2e2 AB io

23 5/15/0223 Results – Pair Composition* App#streamsTriviallyNon-Triv.Un-#SFSMs#SFSMTriviallyNon-Triv.Not BoundedBoundedboundedpairsComposComposCompos IIR 990087700 Wavelet Encode 583502330241509 Wavelet Decode 5734121127312605 JPEG Encode 622513241311614 JPEG Decode 61---12---- MPEG Encode IP 42135143709215414455 MPEG Encode IPB 488402582211421119285 * Max stream depth = 2 (with streams to mem)(without streams to mem)

24 5/15/02Eylon Caspi – EE290N24 Maximum Depth Parameter App#streams Non-Trivially Bounded#SFSM Non-Trivially Composable for given max stream depthpairsfor given max stream depth 0123401234 IIR 900000700000 Wavelet Encode 58000002400000 Wavelet Decode 5712121212123100000 JPEG Encode 6210121315161101111 JPEG Decode 611718---922--- MPEG Encode IP 421394243--154455-- MPEG Encode IPB 488535758--211688--

25 5/15/0225 Composite Automaton Size App# Nodes in Largest CompositionRun time for given max stream depth(seconds) 012342 IIR 12121212120.2 Wavelet Encode 15871587309411,41730,6307.5 Wavelet Decode 961192211,79837,71284,7847.2 JPEG Encode 37856175808616,24427,1389.5 JPEG Decode 3785196,576196,576*196,576* 196,576*245* MPEG Encode IP 788789,541334,757334,757*334,757*125 MPEG Encode IPB 8478100,909385,334385,334*385,334*150 * Crashed; Partial Results

26 5/15/02Eylon Caspi – EE290N26 Composing More than 2 SFSMs  Page Packing by incrementally growing a cluster?  Larger composition should improve stream depth bound  Restricts environment around a pair of SFSMs  Fewer transitions  fewer reachable deadlocks  BUT larger composition can expose deadlocked feedback loop Page 1 Page 2 1 2 4 ∞ Compose 2 SFSMs Page 1 1 2 ? 4 ∞ ? Compose 3 SFSMs

27 5/15/02Eylon Caspi – EE290N27 Synthesizing a Composite SFSM?  How to turn a composite automaton into TDF or page logic?  TDF does not support all non-deterministic branches  Multiple inputs:ok (state with multiple signatures / cases)  Multiple outputs:must sequentialize (how?)  Input + output:???  Input before output— may cause deadlock if output feeds back to input  Output before input— may stall composite on output back-pressure  Conjecture:  It is always safe to sequentialize outputs before inputs  Heavier-weight automata can check input availability / output space before blocking on I/O  “System-Level Types for Component-Based Design,” Lee + Xiong, EMSOFT 2001 (used in Ptolemy) ABA’B AB’ A’B’ IA Composition A || B x? y; z! x? z!

28 5/15/02Eylon Caspi – EE290N28 Summary  SCORE  P rocess network model to support software longevity, scalability on massively parallel HW  Automata composition with finite queues  Compatibility  Bounded memory  Initial results: pair composition  Future work:  Faster run time (semantics for smaller composite size)  Compose more than 2 SFSMs  Page Packing

29 5/15/02Eylon Caspi – EE290N29 Supplemental

30 5/15/02Eylon Caspi – EE290N30 TDF  Dataflow Process Network  Dataflow Process Network [Parks+Lee, IEEE May ‘95]  Process enabled by set of firing rules:R = {R 1, R 2, …, R N }  Firing rule = set of patterns:R i = {R i,1, R i,2, …, R i,p }  DF process for a TDF operator:  Feedback arc for state  One firing rule per state  Patterns match state value + presence of desired inputs  E.g. for state i:R i = {R i,1, R i,2, …, [i]}  Patterns:R i,j = [*]if input j is in state i’s input signature R i,j =  if input j is not in state i’s input signature R i,p = [i]for final input, representing state arc  These are sequential firing rules  Partitioned SFSM adds “wait” state process state

31 5/15/02Eylon Caspi – EE290N31 SFSM Partitioning Transform  Only 1 partition active at a time  Transform to activate via streams  New state in each partition: “wait”  Used when not active  Waits for activation from other partition(s)  Has one input signature (firing rule) per activator  Firing rules are not sequential, but determinism guaranteed  Only 1 possible activator  Activation streams from given source to given dest. partitions can be merged + binary-encoded A B C D A B Wait AB C D Wait CD {A,B} {C,D}

32 5/15/02Eylon Caspi – EE290N32 SCORE Hardware Model  Paged FPGA  Compute Page (CP)  Fixed-size slice of RC hardware  Fixed number of I/O ports  Distributed, on-chip memory  Configurable Memory Block (CMB)  Stream access  High-level interconnect  Microprocessor  Run-time support + user code

33 5/15/02Eylon Caspi – EE290N33 Functional Simulation  FPGA based on HSRA [Berkeley, FPGA ’99]  CP:512 4-LUTs  CMB:2Mbit DRAM  Area for CP-CMB pair:  Page reconfiguration:5000 cycles (from CMB)  Synchronous operation(same clock speed as processor)  x86 microprocessor  Page Scheduler task  Swap on timer interrupt (every 250,000 cycles)  Fully dynamic scheduling.25  :12.9mm 2 (1/9 of PII-450).18  : 6.7mm 2 (1/16 of PIII-600)

34 5/15/02Eylon Caspi – EE290N34 Application: JPEG Encode

35 5/15/02Eylon Caspi – EE290N35 Execution Results Hardware Size (CP-CMB Pairs)

36 5/15/02Eylon Caspi – EE290N36 Execution Results Hardware Size (CP-CMB Pairs)

37 5/15/02Eylon Caspi – EE290N37 Execution Results Hardware Size (CP-CMB Pairs)

38 5/15/02Eylon Caspi – EE290N38 Execution Results Hardware Size (CP-CMB Pairs)

39 5/15/02Eylon Caspi – EE290N39 Page Hardware Model  Page = fixed-size slice of rsrcs + stream interface  FSM for:  Firing Output emission Data-path control Branching FSM Reconfigurable Fixed logic

40 5/15/02Eylon Caspi – EE290N40 Page Firing Logic  Sample firing logic  3 inputs (A,B,C)  3 outputs (X,Y,Z)  Single signature


Download ppt "Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB."

Similar presentations


Ads by Google