Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB.

Slides:

Advertisements

Similar presentations

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Advertisements

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.

BRASS Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, John Wawrzynek University of California, Berkeley – BRASS.

Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.

Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.

SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

Statically Bounding Memory Usage for SCORE Process Networks Eylon Caspi EE290N 5/15/02 University of California, Berkeley IAIA IBIB OAOA OBOB.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

BRASS SCORE: Eylon Caspi, Randy Huang, Yury Markovskiy, Joe Yeh, John Wawrzynek BRASS Research Group University of California, Berkeley Stream Computations.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

CS294-6 Reconfigurable Computing Day 22 November 5, 1998 Requirements for Computing Systems (SCORE Introduction)

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.

1 Quasi-Static Scheduling of Embedded Software Using Free-Choice Petri Nets Marco Sgroi, Alberto Sangiovanni-Vincentelli Luciano Lavagno University of.

CS294-6 Reconfigurable Computing Day 9 September 22, 1998 Project Startup: Mediabench With annotations from class discussion.

A Streaming Multi-Threaded Model Eylon Caspi,Randy Huang,Yury Markovskiy, Joe Yeh,André DeHon,John Wawrzynek BRASS Research Group University of California,

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 11, 2009 Dataflow.

EDA (CS286.5b) Day 19 Covering and Retiming. “Final” Like Assignment #1 –longer –more breadth –focus since assignment #2 –…but ideas are cummulative –open.

CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext.

Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

High-Level Synthesis for Reconfigurable Systems. 2 Agenda Modeling 1.Dataflow graphs 2.Sequencing graphs 3.Finite State Machine with Datapath High-level.

BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.

Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Review of Memory Management, Virtual Memory CS448.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Extreme Makeover for EDA Industry

Automated Design of Custom Architecture Tulika Mitra

Logic Synthesis for Low Power(CHAPTER 6) 6.1 Introduction 6.2 Power Estimation Techniques 6.3 Power Minimization Techniques 6.4 Summary.

Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #21 – HW/SW.

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 24: April 18, 2011 Covering and Retiming.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 14: May 24, 2001 SCORE.

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

1 Power estimation in the algorithmic and register-transfer level September 25, 2006 Chong-Min Kyung.

1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.

L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수

CALTECH CS137 Spring DeHon 1 CS137: Electronic Design Automation Day 5: April 12, 2004 Covering and Retiming.

CALTECH CS137 Spring DeHon CS137: Electronic Design Automation Day 13: May 20, 2002 Page Generation (Area and IO Constraints) [working problem with.

Sunpyo Hong, Hyesoon Kim

High Performance Embedded Computing © 2007 Elsevier Lecture 4: Models of Computation Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Slack Analysis in the System Design Loop Girish VenkataramaniCarnegie Mellon University, The MathWorks Seth C. Goldstein Carnegie Mellon University.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 22: April 16, 2014 Time Multiplexing.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #22 – Multi-Context.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 25: April 17, 2013 Covering and Retiming.

Dynamo: A Runtime Codesign Environment

James Coole PhD student, University of Florida Aaron Landy Greg Stitt

CS184b: Computer Architecture (Abstractions and Optimizations)

Parallel Programming By J. H. Wang May 2, 2017.

Parallel Algorithm Design

Introduction to cosynthesis Rabi Mahapatra CSCE617

Architectural-Level Synthesis

Architecture Synthesis

ESE535: Electronic Design Automation

Presentation transcript:

Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

3/6/012 The Compilation Problem Programming ModelExecution Model Communicating EFSM operators Communicating page configs - unrestricted size, # IOs, timing- fixed size, # IOs, timing Paged virtual hardware Compile memory segment TDF operator stream memory segment compute page stream Compilation is a resource-binding xform on state machines + data-paths

3/6/01Eylon Caspi – Qualifying Exam3 Overview  Motivation  Paged virtual hardware – software survival + scalability  SCORE programming model  Compilation methodology  New page partitioning techniques  Automatic synthesis & partitioning of communicating FSMs  Evaluation + Architectural Studies  Timeline

3/6/01Eylon Caspi – Qualifying Exam4 Reconfigurable Computing  Programmable logic + Programmable interconnect (e.g. FPGA)  10x-100x gain vs. microprocessors in:  Performance  Functional density (work per area-time)  Spatial Computing  Parallelism; custom data paths  Programmability  Custom execution sequence; specialization  BUT current models expose resource constraints to the programmer  Programmer has to target a specific device  Limits software longevity Graphics copyright by their respective company

3/6/01Eylon Caspi – Qualifying Exam5 Solution: Virtual Hardware  Compute model with unbounded resources  Programmer no longer targets a specific device  Enables software longevity, scalability  Requires efficient hardware virtualization  Large device  concurrent spatial execution  Small device  time multiplexing  Paging model

3/6/01Eylon Caspi – Qualifying Exam6 Previous Approaches to Paging  WASMII: Register IO  [Ling+Amano, FCCM ‘93]  Page IO via registers  Evaluate each page for a cycle, then reconfigure  Reconfiguration time dominates execution  DPGA: Configuration Cache  [DeHon, FPGA ‘94], TM-FPGA [Xilinx, FCCM ‘97]  Fast reconfiguration  area, power  Reconfiguration power dominates execution  PipeRench: Stripes  [CMU, FPGA ‘98]  Pipelined reconfiguration  Feed-forward computation only time

3/6/01Eylon Caspi – Qualifying Exam7 Paging + Streaming  Streaming allows efficient, useful virtualization  Amortizes reconfiguration cost over a larger epoch  Exploits program structure  Less restrictive communication topology  Compiler and scheduler’s joint responsibility buffers Swap

3/6/01Eylon Caspi – Qualifying Exam8 SCORE Compute Model  Program = DFG of compute nodes  Kahn process network  blocking read, non-blocking write  Compute: SFSM (Streaming Finite State Machine)  Concretely: page + FSM to implement token-flow semantics  Abstractly: task with local control  Communication: Stream  Abstraction of wire, with buffering  Storage: Memory Segment  Dynamics:  Dynamic local behavior in SFSM  Unbounded resource usage: stream buffer expansion  Dynamic graph allocation in STM (Streaming Turing Machine)

9 SCORE Programming Model: TDF  TDF = intermediate, behavioral language for:  EFSM Operators Static operator graphs  State machine for:  Firing signatures Control flow (branching)  Firing semantics:  When in state X, wait for X’s inputs, then fire (consume, act) select ( input boolean s, input unsigned[8] t, input unsigned[8] f, output unsigned[8] o ) { state S (s) : if (s) goto T; else goto F; state T (t) : o=t; goto S; state F (f) : o=f; goto S; } stf o select

3/6/01Eylon Caspi – Qualifying Exam10 SCORE Hardware Model  Paged FPGA  Compute Page (CP)  Fixed-size slice of RC hardware  Fixed number of I/O ports  Distributed, on-chip memory  Configurable Memory Block (CMB)  Stream access  High-level interconnect  Microprocessor  Run-time support + user code

3/6/01Eylon Caspi – Qualifying Exam11 SCORE Software Infrastructure  Device Simulator  Cycle-accurate behavioral simulation  Parameterized (e.g. #pages)  Interact with concurrent user processes (STMs) via stream API  Page Scheduler  Version 1: dynamic, list-based scheduling (by input availability)  Version 2: static, precedence-based  TDF Compiler  Compiles to working C++ simulation code  No partitioning (page = 1 TDF operator)  Applications  Wavelet, JPEG, MPEG, IIR Device size Run time

3/6/01Eylon Caspi – Qualifying Exam12 Communication is King  With virtualization, Inter-page delay is unknown, sensitive to:  Placement  Interconnect implementation  Page schedule  Technology – wire delay is growing  Inter-page feedback is SLOW  Partitionto contain FB loops in page  Scheduleto contain FB loops on device

3/6/01Eylon Caspi – Qualifying Exam13 Structural Partitioning is Not Enough  Structural partitioning does not address feedback loops  Wire min-cut  FM, flow-based  Minimum wire length  Spectral  Delay-optimal DAG mapping  DAGON, FlowMap, Wong  Structural partitioning does not address communication rates, dynamics  All loops are NOT created equal

3/6/01Eylon Caspi – Qualifying Exam14 FSM Decomposition is not enough  Ashar+Devadas+Newton (ICCAD ‘89)  Minimize logic  Kuo+Liu+Cheng (ISCAS ‘95)  Minimize wires  Benini+DeMicheli+Vermeulen (ISCAS ‘98)  Minimize power  None consider inter-page delay  None consider cutting / scheduling data-path separately from FSM Ma Mb Ma Mb Ma Mb Fa Fb

3/6/01Eylon Caspi – Qualifying Exam15 Outline  Motivation  Compilation Methodology  Evaluation + Architectural Studies  Time Line

3/6/01Eylon Caspi – Qualifying Exam16 Compilation – Scope  Synthesis + Partitioning of SFSMs  TDF  Pages  Resource binding  Target  Parameterized hardware model / simulation  Constrained optimization problem  Constraints  page area, IO, timing  Optimality Criteria  Primary:Communication delay  Secondary:Communication bandwidth, Area Compile memory segment TDF operator stream memory segment compute page stream

3/6/01Eylon Caspi – Qualifying Exam17 Compilation Flow Overview (1) Optimizations (2) Data path timing + scheduling (3) Partitioning  Ignore:  Place / route / retime in page  Known solutions in the community  Page scheduling  Responsibility of separate scheduler

3/6/0118 Synthesis + Partitioning Flow Pipeline Extraction Data Path Mapping Partition Large States Schedule DF into States Cluster States Page Packing Synthesize Page FSMs Compiler Optimizations Optimization Preliminary Code Data-path Partitioning p p p p p

3/6/01Eylon Caspi – Qualifying Exam19 How Big is an Operator? Wavelet Decode Wavelet Encode JPEG Encode MPEG Encode JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR

3/6/0120 Partitioning Tasks (1)Decompose/ shrink SFSMs (2)Pack SFSMs onto page Pipeline Extraction Data Path Mapping Partition Large States Schedule DF into States Cluster States Page Packing Synthesize Page FSMs Compiler Optimizations p p p p

21 Pipeline Extraction  Hoist uncontrolled FF data-flow out of FSMD  Benefits:  Shrink FSM cyclic core  Extracted pipeline has more freedom for scheduling and partitioning Extract state foo(x): if (x==0)... state foo(xz): if (xz)... x state DF CF x==0 xz x pipeline

3/6/01Eylon Caspi – Qualifying Exam22 Pipeline Extraction – Extractable Area JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR

3/6/01Eylon Caspi – Qualifying Exam23 Pipeline Extraction – Residual SFSM JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR

3/6/01Eylon Caspi – Qualifying Exam24 Data-path Mapping / Scheduling  Task:  Bind technology-specific area/time to data-path primitives  Schedule data-path primitives in state machine  Fixed-frequency target  Decompose primitives into multi-cycle operations  Data-path module library / tree matching  Pipeline linearized sequences / loops  DAG mapping state logic is insufficient  Compiler technology  Code motion  Software pipelining

3/6/01Eylon Caspi – Qualifying Exam25 Delay-Oriented State Clustering  Indivisible unit: state (CF+DF)  Spatial locality in state logic  Cluster states into page-size sub-machines  Inter-page communication for data flow, state flow  Sequential delay is in inter-page state transfer  Cluster to maintain local control  Cluster to contain state loops  Similar to:  VLIW trace scheduling [Fisher ‘81]  FSM decomp. for low power [Benini/DeMicheli ISCAS ‘98]  VM/cache code placement  GarpCC HW/SW partitioning [Callahan ‘00]

3/6/01Eylon Caspi – Qualifying Exam26 State Clustering Formulation  Min-cut transition probabilities in state flow graph  Probabilities from profiling  Area-constrained  Balanced min-cut partitioning [Yang+Wong, ACM ‘94]  Iterate to desired partition area (1-  )A ≤ a(X) ≤ (1+  )A  IO-constrained  Add wire edges  Mix edge weights: (c)w wire + (1-c)w SF  Use smallest IO-feasible c  Requires all states to be smaller than page p1p1 p2p2 p3p3 p4p4 p5p5 w1w1 w2w2 w4w4 w5w5 w6w6 w8w8 w9w9 w3w3 w7w7 a2a2 a1a1 a3a3 a4a4

3/6/01Eylon Caspi – Qualifying Exam27 Page Packing  Cluster SFSMs + pipelines  Avoid page fragmentation  Min-cut streams of top-level DFG  Allow cutting pipelines, not SFSMs  Area and IO constrained (Wong balanced min-cut partition)  Disallow certain topologies  No dynamic-rate streams in page  Data-flow feedback?

3/6/01Eylon Caspi – Qualifying Exam28 Outline  Motivation  Compilation Methodology  Evaluation + Architectural Studies  Time Line

3/6/01Eylon Caspi – Qualifying Exam29 Evaluating Paging Overhead  Applications  Must be rewritten in TDF  Existing: Wavelet, JPEG, MPEG, IIR  To do: ADPCM, BABAR particle detector  Metrics  Circuit area(#pages x page-size)  Page delay(LUT depth per firing)  Performance(total run-time, “makespan”)  Baseline comparison  “Unpartitioned”: page = 1 TDF operator  Ideal virtualization with zero partitioning cost – cannot do better

3/6/0130 Page Size Studies  Paging overhead varies with:  Application Page size, IO Match thereof  Is paging overhead robust to a mismatch?  Vary page parameters, measure:  (1) Pure area overhead  (2) Pure performance overhead  Execute spatially in expanded hardware  (3) Virtualized performance overhead  Execute in fixed device size (1) (2)(3)

3/6/01Eylon Caspi – Qualifying Exam31 Outline  Motivation  Compilation Methodology  Evaluation + Architectural Studies  Time Line

3/6/01Eylon Caspi – Qualifying Exam32 Status  SCORE compiler / simulator / scheduler  Compile+execute unpartitioned (page = 1 TDF op)  Preliminary synthesis + partitioning work  Pipeline extraction  FSM synthesis to SIS  Area-constrained state clustering  To do  Complete initial implementation  Evaluate  Improve – secondary implementation

3/6/01Eylon Caspi – Qualifying Exam33 To Complete Initial Implementation  IO-constrained state clustering  Decompose large states  Page packing  Data-path scheduling in states  Synthesize partitioned SFSMs

3/6/01Eylon Caspi – Qualifying Exam34 Secondary Implementation – Possibilities  Optimizations  SW pipelining  Use SUIF  State clustering with replication  Unified state clustering + page packing  Cluster states of all operators simultaneously  Finer-grained clustering  Recast as BDF, min-cut stream rates

3/6/01Eylon Caspi – Qualifying Exam35 Time Line Impl. 1 Eval Impl. 2 Eval Thesis writing Month: Year:

3/6/01Eylon Caspi – Qualifying Exam36 Summary  Partitioning and paging enables  Software survival / scaling  Efficient use of small HW for dynamic apps  My Contributions  Methodology for page synthesis + partitioning  Necessary for efficient virtualization  Evaluation framework  Verify that paging can be efficient  Architectural studies

3/6/01Eylon Caspi – Qualifying Exam37 Supplemental Material  SFSMs + transforms  SCORE simulation + scaling results  Page hardware model  Synthesis observations  Architectural studies

3/6/01Eylon Caspi – Qualifying Exam38 TDF  Dataflow Process Network  Dataflow Process Network [Parks+Lee, IEEE May ‘95]  Process enabled by set of firing rules:R = {R 1, R 2, …, R N }  Firing rule = set of patterns:R i = {R i,1, R i,2, …, R i,p }  DF process for a TDF operator:  Feedback arc for state  One firing rule per state  Patterns match state value + presence of desired inputs  E.g. for state i:R i = {R i,1, R i,2, …, [i]}  Patterns:R i,j = [*]if input j is in state i’s input signature R i,j =  if input j is not in state i’s input signature R i,p = [i]for final input, representing state arc  These are sequential firing rules  Partitioned SFSM adds “wait” state process state

3/6/01Eylon Caspi – Qualifying Exam39 SFSM Partitioning Transform  Only 1 partition active at a time  Transform to activate via streams  New state in each partition: “wait”  Used when not active  Waits for activation from other partition(s)  Has one input signature (firing rule) per activator  Firing rules are not sequential, but determinism guaranteed  Only 1 possible activator  Activation streams from given source to given dest. partitions can be merged + binary-encoded A B C D A B Wait AB C D Wait CD {A,B} {C,D}

3/6/01Eylon Caspi – Qualifying Exam40 Distributing/Collecting Shared Streams  Requires inter-page synchronization for ordering  Two schemes for input distribution  (1) send token to all pages –Inactive pages must discard tokens, must know how many to discard  (2) send token only to active page –Distributor must know state –(a) present state requests token OR –(b) previous state pre-fetches token  One scheme for output collection –Collector must know state  How to cluster distributors / collectors?  Distributor scheme (1) and collector incur no sequential delay (wire min-cut ok)  Distributor scheme (2)(a) can be cast into delay-optimal state clustering: –Decompose reading states into sequences of single-read states –Pre-cluster states that read same stream – this forms distributors –Sequential delay of read request is now modeled as state transfer to distributor A B C D i o

3/6/01Eylon Caspi – Qualifying Exam41 Decomposing Large States  A state may be larger than a page  Decomposing into a sequence of page-size states leads to excessive inter-page transfer  Better: delay-optimal DAG- mapping into parallel pages

3/6/01Eylon Caspi – Qualifying Exam42 SFSM Optimizations  Many traditional compiler optimization techniques apply to TDF  State flow ~ basic block flow  Different cost model  “Unlimited” registers and functional units  E.g. work-reducing optimizations  Constant folding / propagation  Common subexpression elimintation  Hoist loop invariants  Strength reduction

3/6/01Eylon Caspi – Qualifying Exam43 SCORE Functional Simulation  FPGA based on HSRA [Berkeley, FPGA ’99]  CP:512 4-LUTs  CMB:2Mbit DRAM  Area for CP-CMB pair:  Page reconfiguration:5000 cycles (from CMB)  Synchronous operation(same clock speed as processor)  x86 microprocessor  Page Scheduler task  Swap on timer interrupt (every 250,000 cycles)  Fully dynamic scheduling.25  :12.9mm 2 (1/9 of PII-450).18  : 6.7mm 2 (1/16 of PIII-600)

3/6/01Eylon Caspi – Qualifying Exam44 Application: JPEG Encode

3/6/01Eylon Caspi – Qualifying Exam45 Scaling Results: JPEG Encode Physical Compute Pages Total Time (Makespan in millions of cycles)

3/6/01Eylon Caspi – Qualifying Exam46 Page Hardware Model  Page = fixed-size slice of rsrcs + stream interface  FSM for:  Firing Output emission Data-path control Branching FSM Reconfigurable Fixed logic

3/6/0147 Page Firing Logic  Sample firing logic  3 inputs (A,B,C)  3 outputs (X,Y,Z)  Single signature

3/6/01Eylon Caspi – Qualifying Exam48 How Large is a State? JPEG Encode JPEG Decode MPEG (I) MPEG (P) IIR

49 SFSM Firing Delay  Complex SFSM may require ≥1 cycle just for control  Evaluate firing rule, generate control signals, compute next state  Should we partition SFSM to minimize FSM logic?  No – incurring inter-page communication latency is worse! JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR Histogram of FSM Delay for 47 Operators (unpartitioned) 4-LUT Depth JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR Histogram of FSM Inputs for 47 Operators (unpartitioned)

3/6/01Eylon Caspi – Qualifying Exam50 Scaling the Hardware Resources  A simplified scaling model for architectural studies  Scaling page size (LUTs) induces scaling of other resources, e.g.:  Scaling memory  Constant CP-to-CMB ratio  Scaling page IO  Rent’s Rule: IO = CA p, (0 ≤ p ≤ 1)