Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB
3/6/012 The Compilation Problem Programming ModelExecution Model Communicating EFSM operators Communicating page configs - unrestricted size, # IOs, timing- fixed size, # IOs, timing Paged virtual hardware Compile memory segment TDF operator stream memory segment compute page stream Compilation is a resource-binding xform on state machines + data-paths
3/6/01Eylon Caspi – Qualifying Exam3 Overview Motivation Paged virtual hardware – software survival + scalability SCORE programming model Compilation methodology New page partitioning techniques Automatic synthesis & partitioning of communicating FSMs Evaluation + Architectural Studies Timeline
3/6/01Eylon Caspi – Qualifying Exam4 Reconfigurable Computing Programmable logic + Programmable interconnect (e.g. FPGA) 10x-100x gain vs. microprocessors in: Performance Functional density (work per area-time) Spatial Computing Parallelism; custom data paths Programmability Custom execution sequence; specialization BUT current models expose resource constraints to the programmer Programmer has to target a specific device Limits software longevity Graphics copyright by their respective company
3/6/01Eylon Caspi – Qualifying Exam5 Solution: Virtual Hardware Compute model with unbounded resources Programmer no longer targets a specific device Enables software longevity, scalability Requires efficient hardware virtualization Large device concurrent spatial execution Small device time multiplexing Paging model
3/6/01Eylon Caspi – Qualifying Exam6 Previous Approaches to Paging WASMII: Register IO [Ling+Amano, FCCM ‘93] Page IO via registers Evaluate each page for a cycle, then reconfigure Reconfiguration time dominates execution DPGA: Configuration Cache [DeHon, FPGA ‘94], TM-FPGA [Xilinx, FCCM ‘97] Fast reconfiguration area, power Reconfiguration power dominates execution PipeRench: Stripes [CMU, FPGA ‘98] Pipelined reconfiguration Feed-forward computation only time
3/6/01Eylon Caspi – Qualifying Exam7 Paging + Streaming Streaming allows efficient, useful virtualization Amortizes reconfiguration cost over a larger epoch Exploits program structure Less restrictive communication topology Compiler and scheduler’s joint responsibility buffers Swap
3/6/01Eylon Caspi – Qualifying Exam8 SCORE Compute Model Program = DFG of compute nodes Kahn process network blocking read, non-blocking write Compute: SFSM (Streaming Finite State Machine) Concretely: page + FSM to implement token-flow semantics Abstractly: task with local control Communication: Stream Abstraction of wire, with buffering Storage: Memory Segment Dynamics: Dynamic local behavior in SFSM Unbounded resource usage: stream buffer expansion Dynamic graph allocation in STM (Streaming Turing Machine)
9 SCORE Programming Model: TDF TDF = intermediate, behavioral language for: EFSM Operators Static operator graphs State machine for: Firing signatures Control flow (branching) Firing semantics: When in state X, wait for X’s inputs, then fire (consume, act) select ( input boolean s, input unsigned[8] t, input unsigned[8] f, output unsigned[8] o ) { state S (s) : if (s) goto T; else goto F; state T (t) : o=t; goto S; state F (f) : o=f; goto S; } stf o select
3/6/01Eylon Caspi – Qualifying Exam10 SCORE Hardware Model Paged FPGA Compute Page (CP) Fixed-size slice of RC hardware Fixed number of I/O ports Distributed, on-chip memory Configurable Memory Block (CMB) Stream access High-level interconnect Microprocessor Run-time support + user code
3/6/01Eylon Caspi – Qualifying Exam11 SCORE Software Infrastructure Device Simulator Cycle-accurate behavioral simulation Parameterized (e.g. #pages) Interact with concurrent user processes (STMs) via stream API Page Scheduler Version 1: dynamic, list-based scheduling (by input availability) Version 2: static, precedence-based TDF Compiler Compiles to working C++ simulation code No partitioning (page = 1 TDF operator) Applications Wavelet, JPEG, MPEG, IIR Device size Run time
3/6/01Eylon Caspi – Qualifying Exam12 Communication is King With virtualization, Inter-page delay is unknown, sensitive to: Placement Interconnect implementation Page schedule Technology – wire delay is growing Inter-page feedback is SLOW Partitionto contain FB loops in page Scheduleto contain FB loops on device
3/6/01Eylon Caspi – Qualifying Exam13 Structural Partitioning is Not Enough Structural partitioning does not address feedback loops Wire min-cut FM, flow-based Minimum wire length Spectral Delay-optimal DAG mapping DAGON, FlowMap, Wong Structural partitioning does not address communication rates, dynamics All loops are NOT created equal
3/6/01Eylon Caspi – Qualifying Exam14 FSM Decomposition is not enough Ashar+Devadas+Newton (ICCAD ‘89) Minimize logic Kuo+Liu+Cheng (ISCAS ‘95) Minimize wires Benini+DeMicheli+Vermeulen (ISCAS ‘98) Minimize power None consider inter-page delay None consider cutting / scheduling data-path separately from FSM Ma Mb Ma Mb Ma Mb Fa Fb
3/6/01Eylon Caspi – Qualifying Exam15 Outline Motivation Compilation Methodology Evaluation + Architectural Studies Time Line
3/6/01Eylon Caspi – Qualifying Exam16 Compilation – Scope Synthesis + Partitioning of SFSMs TDF Pages Resource binding Target Parameterized hardware model / simulation Constrained optimization problem Constraints page area, IO, timing Optimality Criteria Primary:Communication delay Secondary:Communication bandwidth, Area Compile memory segment TDF operator stream memory segment compute page stream
3/6/01Eylon Caspi – Qualifying Exam17 Compilation Flow Overview (1) Optimizations (2) Data path timing + scheduling (3) Partitioning Ignore: Place / route / retime in page Known solutions in the community Page scheduling Responsibility of separate scheduler
3/6/0118 Synthesis + Partitioning Flow Pipeline Extraction Data Path Mapping Partition Large States Schedule DF into States Cluster States Page Packing Synthesize Page FSMs Compiler Optimizations Optimization Preliminary Code Data-path Partitioning p p p p p
3/6/01Eylon Caspi – Qualifying Exam19 How Big is an Operator? Wavelet Decode Wavelet Encode JPEG Encode MPEG Encode JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR
3/6/0120 Partitioning Tasks (1)Decompose/ shrink SFSMs (2)Pack SFSMs onto page Pipeline Extraction Data Path Mapping Partition Large States Schedule DF into States Cluster States Page Packing Synthesize Page FSMs Compiler Optimizations p p p p
21 Pipeline Extraction Hoist uncontrolled FF data-flow out of FSMD Benefits: Shrink FSM cyclic core Extracted pipeline has more freedom for scheduling and partitioning Extract state foo(x): if (x==0)... state foo(xz): if (xz)... x state DF CF x==0 xz x pipeline
3/6/01Eylon Caspi – Qualifying Exam22 Pipeline Extraction – Extractable Area JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR
3/6/01Eylon Caspi – Qualifying Exam23 Pipeline Extraction – Residual SFSM JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR
3/6/01Eylon Caspi – Qualifying Exam24 Data-path Mapping / Scheduling Task: Bind technology-specific area/time to data-path primitives Schedule data-path primitives in state machine Fixed-frequency target Decompose primitives into multi-cycle operations Data-path module library / tree matching Pipeline linearized sequences / loops DAG mapping state logic is insufficient Compiler technology Code motion Software pipelining
3/6/01Eylon Caspi – Qualifying Exam25 Delay-Oriented State Clustering Indivisible unit: state (CF+DF) Spatial locality in state logic Cluster states into page-size sub-machines Inter-page communication for data flow, state flow Sequential delay is in inter-page state transfer Cluster to maintain local control Cluster to contain state loops Similar to: VLIW trace scheduling [Fisher ‘81] FSM decomp. for low power [Benini/DeMicheli ISCAS ‘98] VM/cache code placement GarpCC HW/SW partitioning [Callahan ‘00]
3/6/01Eylon Caspi – Qualifying Exam26 State Clustering Formulation Min-cut transition probabilities in state flow graph Probabilities from profiling Area-constrained Balanced min-cut partitioning [Yang+Wong, ACM ‘94] Iterate to desired partition area (1- )A ≤ a(X) ≤ (1+ )A IO-constrained Add wire edges Mix edge weights: (c)w wire + (1-c)w SF Use smallest IO-feasible c Requires all states to be smaller than page p1p1 p2p2 p3p3 p4p4 p5p5 w1w1 w2w2 w4w4 w5w5 w6w6 w8w8 w9w9 w3w3 w7w7 a2a2 a1a1 a3a3 a4a4
3/6/01Eylon Caspi – Qualifying Exam27 Page Packing Cluster SFSMs + pipelines Avoid page fragmentation Min-cut streams of top-level DFG Allow cutting pipelines, not SFSMs Area and IO constrained (Wong balanced min-cut partition) Disallow certain topologies No dynamic-rate streams in page Data-flow feedback?
3/6/01Eylon Caspi – Qualifying Exam28 Outline Motivation Compilation Methodology Evaluation + Architectural Studies Time Line
3/6/01Eylon Caspi – Qualifying Exam29 Evaluating Paging Overhead Applications Must be rewritten in TDF Existing: Wavelet, JPEG, MPEG, IIR To do: ADPCM, BABAR particle detector Metrics Circuit area(#pages x page-size) Page delay(LUT depth per firing) Performance(total run-time, “makespan”) Baseline comparison “Unpartitioned”: page = 1 TDF operator Ideal virtualization with zero partitioning cost – cannot do better
3/6/0130 Page Size Studies Paging overhead varies with: Application Page size, IO Match thereof Is paging overhead robust to a mismatch? Vary page parameters, measure: (1) Pure area overhead (2) Pure performance overhead Execute spatially in expanded hardware (3) Virtualized performance overhead Execute in fixed device size (1) (2)(3)
3/6/01Eylon Caspi – Qualifying Exam31 Outline Motivation Compilation Methodology Evaluation + Architectural Studies Time Line
3/6/01Eylon Caspi – Qualifying Exam32 Status SCORE compiler / simulator / scheduler Compile+execute unpartitioned (page = 1 TDF op) Preliminary synthesis + partitioning work Pipeline extraction FSM synthesis to SIS Area-constrained state clustering To do Complete initial implementation Evaluate Improve – secondary implementation
3/6/01Eylon Caspi – Qualifying Exam33 To Complete Initial Implementation IO-constrained state clustering Decompose large states Page packing Data-path scheduling in states Synthesize partitioned SFSMs
3/6/01Eylon Caspi – Qualifying Exam34 Secondary Implementation – Possibilities Optimizations SW pipelining Use SUIF State clustering with replication Unified state clustering + page packing Cluster states of all operators simultaneously Finer-grained clustering Recast as BDF, min-cut stream rates
3/6/01Eylon Caspi – Qualifying Exam35 Time Line Impl. 1 Eval Impl. 2 Eval Thesis writing Month: Year:
3/6/01Eylon Caspi – Qualifying Exam36 Summary Partitioning and paging enables Software survival / scaling Efficient use of small HW for dynamic apps My Contributions Methodology for page synthesis + partitioning Necessary for efficient virtualization Evaluation framework Verify that paging can be efficient Architectural studies
3/6/01Eylon Caspi – Qualifying Exam37 Supplemental Material SFSMs + transforms SCORE simulation + scaling results Page hardware model Synthesis observations Architectural studies
3/6/01Eylon Caspi – Qualifying Exam38 TDF Dataflow Process Network Dataflow Process Network [Parks+Lee, IEEE May ‘95] Process enabled by set of firing rules:R = {R 1, R 2, …, R N } Firing rule = set of patterns:R i = {R i,1, R i,2, …, R i,p } DF process for a TDF operator: Feedback arc for state One firing rule per state Patterns match state value + presence of desired inputs E.g. for state i:R i = {R i,1, R i,2, …, [i]} Patterns:R i,j = [*]if input j is in state i’s input signature R i,j = if input j is not in state i’s input signature R i,p = [i]for final input, representing state arc These are sequential firing rules Partitioned SFSM adds “wait” state process state
3/6/01Eylon Caspi – Qualifying Exam39 SFSM Partitioning Transform Only 1 partition active at a time Transform to activate via streams New state in each partition: “wait” Used when not active Waits for activation from other partition(s) Has one input signature (firing rule) per activator Firing rules are not sequential, but determinism guaranteed Only 1 possible activator Activation streams from given source to given dest. partitions can be merged + binary-encoded A B C D A B Wait AB C D Wait CD {A,B} {C,D}
3/6/01Eylon Caspi – Qualifying Exam40 Distributing/Collecting Shared Streams Requires inter-page synchronization for ordering Two schemes for input distribution (1) send token to all pages –Inactive pages must discard tokens, must know how many to discard (2) send token only to active page –Distributor must know state –(a) present state requests token OR –(b) previous state pre-fetches token One scheme for output collection –Collector must know state How to cluster distributors / collectors? Distributor scheme (1) and collector incur no sequential delay (wire min-cut ok) Distributor scheme (2)(a) can be cast into delay-optimal state clustering: –Decompose reading states into sequences of single-read states –Pre-cluster states that read same stream – this forms distributors –Sequential delay of read request is now modeled as state transfer to distributor A B C D i o
3/6/01Eylon Caspi – Qualifying Exam41 Decomposing Large States A state may be larger than a page Decomposing into a sequence of page-size states leads to excessive inter-page transfer Better: delay-optimal DAG- mapping into parallel pages
3/6/01Eylon Caspi – Qualifying Exam42 SFSM Optimizations Many traditional compiler optimization techniques apply to TDF State flow ~ basic block flow Different cost model “Unlimited” registers and functional units E.g. work-reducing optimizations Constant folding / propagation Common subexpression elimintation Hoist loop invariants Strength reduction
3/6/01Eylon Caspi – Qualifying Exam43 SCORE Functional Simulation FPGA based on HSRA [Berkeley, FPGA ’99] CP:512 4-LUTs CMB:2Mbit DRAM Area for CP-CMB pair: Page reconfiguration:5000 cycles (from CMB) Synchronous operation(same clock speed as processor) x86 microprocessor Page Scheduler task Swap on timer interrupt (every 250,000 cycles) Fully dynamic scheduling.25 :12.9mm 2 (1/9 of PII-450).18 : 6.7mm 2 (1/16 of PIII-600)
3/6/01Eylon Caspi – Qualifying Exam44 Application: JPEG Encode
3/6/01Eylon Caspi – Qualifying Exam45 Scaling Results: JPEG Encode Physical Compute Pages Total Time (Makespan in millions of cycles)
3/6/01Eylon Caspi – Qualifying Exam46 Page Hardware Model Page = fixed-size slice of rsrcs + stream interface FSM for: Firing Output emission Data-path control Branching FSM Reconfigurable Fixed logic
3/6/0147 Page Firing Logic Sample firing logic 3 inputs (A,B,C) 3 outputs (X,Y,Z) Single signature
3/6/01Eylon Caspi – Qualifying Exam48 How Large is a State? JPEG Encode JPEG Decode MPEG (I) MPEG (P) IIR
49 SFSM Firing Delay Complex SFSM may require ≥1 cycle just for control Evaluate firing rule, generate control signals, compute next state Should we partition SFSM to minimize FSM logic? No – incurring inter-page communication latency is worse! JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR Histogram of FSM Delay for 47 Operators (unpartitioned) 4-LUT Depth JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR Histogram of FSM Inputs for 47 Operators (unpartitioned)
3/6/01Eylon Caspi – Qualifying Exam50 Scaling the Hardware Resources A simplified scaling model for architectural studies Scaling page size (LUTs) induces scaling of other resources, e.g.: Scaling memory Constant CP-to-CMB ratio Scaling page IO Rent’s Rule: IO = CA p, (0 ≤ p ≤ 1)