CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #21 – HW/SW.

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

High Level Languages: A Comparison By Joel Best. 2 Sources The Challenges of Synthesizing Hardware from C-Like Languages  by Stephen A. Edwards High-Level.

Give qualifications of instructors: DAP

ECE 551 Digital System Design & Synthesis Lecture 08 The Synthesis Process Constraints and Design Rules High-Level Synthesis Options.

PradeepKumar S K Asst. Professor Dept. of ECE, KIT, TIPTUR. PradeepKumar S K, Asst.

Altera FLEX 10K technology in Real Time Application.

CS 151 Digital Systems Design Lecture 37 Register Transfer Level

Behavioral Synthesis Outline –Synthesis Procedure –Example –Domain-Specific Synthesis –Silicon Compilers –Example Tools Goal –Understand behavioral synthesis.

This Lecture: AES Key Expansion Equivalent Inverse Cipher Rijndael performance summary.

Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

Define Embedded Systems Small (?) Application Specific Computer Systems.

Dr. Turki F. Al-Somani VHDL synthesis and simulation – Part 3 Microcomputer Systems Design (Embedded Systems)

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Models of Computation for Embedded System Design Alvise Bonivento.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

VHDL Coding Exercise 4: FIR Filter. Where to start? AlgorithmArchitecture RTL- Block diagram VHDL-Code Designspace Exploration Feedback Optimization.

The Design of Improved Dynamic AES and Hardware Implementation Using FPGA 游精允.

Mahapatra-A&M-Sprong'021 Co-design Finite State Machines Many slides of this lecture are borrowed from Margarida Jacome.

Study of AES Encryption/Decription Optimizations Nathan Windels.

Final presentation Encryption/Decryption on embedded system Supervisor: Ina Rivkin students: Chen Ponchek Liel Shoshan Winter 2013 Part A.

RUN-TIME RECONFIGURATION FOR AUTOMATIC HARDWARE/SOFTWARE PARTITIONING Tom Davidson, Karel Bruneel, Dirk Stroobandt Ghent University, Belgium Presenting:

Introduction to FPGA AVI SINGH. Prerequisites Digital Circuit Design - Logic Gates, FlipFlops, Counters, Mux-Demux Familiarity with a procedural programming.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

CAD Techniques for IP-Based and System-On-Chip Designs Allen C.-H. Wu Department of Computer Science Tsing Hua University Hsinchu, Taiwan, R.O.C {

Advance Encryption Standard. Topics  Origin of AES  Basic AES  Inside Algorithm  Final Notes.

Automated Design of Custom Architecture Tulika Mitra

Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #17 – Introduction.

High Performance Embedded Computing © 2007 Elsevier Lecture 3: Design Methodologies Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based.

High Performance Embedded Computing © 2007 Elsevier Chapter 1, part 2: Embedded Computing High Performance Embedded Computing Wayne Wolf.

1 2-Hardware Design Basics of Embedded Processors (cont.)

Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.

CprE / ComS 583 Reconfigurable Computing

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #21 – HW/SW.

Hardware-software Interface Xiaofeng Fan

Mahapatra-A&M-Fall'001 Co-design Finite State Machines Many slides of this lecture are borrowed from Margarida Jacome.

Anurag Dwivedi. Basic Block - Gates Gates -> Flip Flops.

CprE 588 Embedded Computer Systems Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #5 – System-Level.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.

Processor Architecture

Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.

1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.

Lecture 11: FPGA-Based System Design October 18, 2004 ECE 697F Reconfigurable Computing Lecture 11 FPGA-Based System Design.

FPGA-Based System Design: Chapter 7 Copyright  2004 Prentice Hall PTR Topics n Hardware/software co-design.

RTL Design Methodology Transition from Pseudocode & Interface

Final Presentation Encryption on Embedded System Supervisor: Ina Rivkin students: Chen Ponchek Liel Shoshan Spring 2014 Part B.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #9 – Applications.

System-on-Chip Design Hao Zheng Comp Sci & Eng U of South Florida 1.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

CDA 4253 FPGA System Design RTL Design Methodology 1 Hao Zheng Comp Sci & Eng USF.

Encryption / Decryption on FPGA Final Presentation Written by: Daniel Farcovich ID Saar Vigodskey ID Advisor: Mony Orbach Summer.

RTL Hardware Design by P. Chu Chapter 9 – ECE420 (CSUN) Mirzaei 1 Sequential Circuit Design: Practice Shahnam Mirzaei, PhD Spring 2016 California State.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #22 – Multi-Context.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

System-on-Chip Design

System-on-Chip Design Homework Solutions

Register Transfer Specification And Design

Introduction Introduction to VHDL Entities Signals Data & Scalar Types

Introduction to cosynthesis Rabi Mahapatra CSCE617

Figure 8.1. The general form of a sequential circuit.

ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL code ECE 448 – FPGA and ASIC Design.

Presentation transcript:

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #21 – HW/SW Codesign

Lect-21.2CprE 583 – Reconfigurable ComputingNovember 2, 2006 Quick Points Midterm graded and returned Average – 84.5 Median – 85.0 Maximum – 95.0 Minimum – 72.0 Standard Deviation – 6.65

Lect-21.3CprE 583 – Reconfigurable ComputingNovember 2, 2006 HW #4 Discussion Problem 1 – did just a simple adder work? Problem 2 – how did you implement the permutation table? Problem 3 – did you use a counter?

Lect-21.4CprE 583 – Reconfigurable ComputingNovember 2, 2006 AES-128E Algorithm 128-bit plaintext 128-bit key Round Transformation round++ round = 10? SubBytes ShiftRows AddRoundKey 128-bit ciphertext Yes No KeyExpansion MixColumns

Lect-21.5CprE 583 – Reconfigurable ComputingNovember 2, 2006 S-box abcdef a b c d e f 637c777bf26b6fc bfed7ab76 ca82c97dfa092125c5c3ee1ea64f8c20 b7fd932636fa253a46b5aa0ca63a775d 04c723c31842dddcb5958c dc 09832c1a1bc7208cfae1cb6d0dfa d100ed d0d445a8e2a604a d0efaafb431b124ae1f2ff4abcfd0fa6 51a3408f92cc3aa6fc09c7ef002589aa cd0c13ec5f4a35af4f12dced513b1bdc 60814fdc2a4a bc671315ba e0323a0a066f c360 48b549 e7c8376dd512ed4543a6254f19e1ed4f ba25 2ea62c34a697ee3633 1f6f51 70b f1bbbb5631bb56f25ba4a e afb5 20aab52a4a c890d 8200ba8c4f43170d04b509 y x Overview of AES (cont.) 128-bit input is copied into a two-dimensional (4x4) byte array referred to as the state Round transformations operate on the state array Final state copied back into 128-bit output AES makes use of a non-linear substitution function that operates on a single byte Can be simplified as a look-up table (S-box)

Lect-21.6CprE 583 – Reconfigurable ComputingNovember 2, 2006 AES-128E Modules: SubBytes KeyExpansion 128-bit plaintext 128-bit key Round Transformation round++ round = 10? SubBytes ShiftRows AddRoundKey 128-bit ciphertext Yes No MixColumns SubBytes S-box transformation performed independently on each byte of the state S-box S 0,0 S 0,1 S 0,2 S 0,3 S 1,0 S 1,2 S 1,3 S 2,0 S 2,1 S 2,2 S 2,3 S 3,0 S 3,1 S 3,2 S 3,3 S r,c S' 0,0 S' 0,1 S' 0,2 S' 0,3 S' 1,0 S' 1,2 S' 1,3 S' 2,0 S' 2,1 S' 2,2 S' 2,3 S' 3,0 S' 3,1 S' 3,2 S' 3,3 S' r,c state[i]state'[i]

Lect-21.7CprE 583 – Reconfigurable ComputingNovember 2, 2006 AES-128E Modules: ShiftRows KeyExpansion 128-bit plaintext 128-bit key Round Transformation round++ round = 10? SubBytes ShiftRows AddRoundKey 128-bit ciphertext Yes No MixColumns ShiftRows Bytes in the last three rows of the state are shifted cyclically over variable offsets S 0,0 S 0,1 S 0,2 S 0,3 S 1,0 S 1,1 S 1,2 S 1,3 S 2,0 S 2,1 S 2,2 S 2,3 S 3,0 S 3,1 S 3,2 S 3,3 S' 0,0 S' 0,1 S' 0,2 S' 0,3 S' 1,1 S' 1,2 S' 1,3 S' 1,0 S' 2,2 S' 2,3 S' 2,0 S' 2,1 S' 3,3 S' 3,0 S' 3,1 S' 3,2 state[i]state'[i]

Lect-21.8CprE 583 – Reconfigurable ComputingNovember 2, 2006 AES-128E Modules: MixColumns KeyExpansion 128-bit plaintext 128-bit key Round Transformation round++ round = 10? SubBytes ShiftRows AddRoundKey 128-bit ciphertext Yes No MixColumns Modulo polynomial-basis multiplication performed on each column of the state Can be simplified as series of AND and XOR operations state[i]state'[i] {03h} {02h} S 0,0 S 0,2 S 0,3 S 1,0 S 1,2 S 1,3 S 2,0 S 2,2 S 2,3 S 3,0 S 3,2 S 3,3 S 0,1 S 1,1 S 2,1 S 3,1 S' 0,0 S' 0,2 S' 0,3 S' 1,0 S' 1,2 S' 1,3 S' 2,0 S' 2,2 S' 2,3 S' 3,0 S' 3,2 S' 3,3 S' 0,1 S' 1,1 S' 2,1 S' 3,1

Lect-21.9CprE 583 – Reconfigurable ComputingNovember 2, 2006 MixColumns Implementation entity MixColumns is port (STATE_IN : in STATEtype; RNUM_IN : in RNUMtype; STATE_OUT : out STATEtype); end MixColumns; architecture behavior of MixColumns is signal tSTATE : STATEtype; begin process(STATE_IN) variable t1, t2 : std_logic_vector(7 downto 0); begin for i in 0 to 3 loop for j in 0 to Nb-1 loop -- Multiply by 2 t1 := STATE_IN(i mod 4)(j)(6 downto 0) & '0'; if (STATE_IN(i mod 4)(j)(7) = '1') then t1 := t1 xor x"1b"; end if; -- Multiply by 3 t2 := STATE_IN((i+1) mod 4)(j)(6 downto 0) & '0'; if (STATE_IN((i+1) mod 4)(j)(7) = '1') then t2 := t2 xor x"1b"; end if; t2 := t2 xor STATE_IN((i+1) mod 4)(j); tSTATE(i)(j) <= t1 xor t2 xor STATE_IN((i+2) mod 4)(j) xor STATE_IN((i+3) mod 4)(j); end loop; end process;

Lect-21.10CprE 583 – Reconfigurable ComputingNovember 2, 2006 AES-128E Modules: AddRoundKey KeyExpansion 128-bit plaintext 128-bit key Round Transformation round++ round = 10? SubBytes ShiftRows AddRoundKey 128-bit ciphertext Yes No MixColumns AddRoundKey Words from the round-specific key are XORed into columns of the state S 0,0 S 0,2 S 0,3 S 1,0 S 1,2 S 1,3 S 2,0 S 2,2 S 2,3 S 3,0 S 3,2 S 3,3 S' 0,0 S' 0,2 S' 0,3 S' 1,0 S' 1,2 S' 1,3 S' 2,0 S' 2,2 S' 2,3 S' 3,0 S' 3,2 S' 3,3 S 0,1 S 1,1 S 2,1 S 3,1 S' 0,1 S' 1,1 S' 2,1 S' 3,1 Rkey[i] w[0]w[2]w[3] w[1] state[i]state'[i]

Lect-21.11CprE 583 – Reconfigurable ComputingNovember 2, 2006 AddRoundKey Implementation entity AddRoundKey is port(STATE_IN : in STATEtype; KEY_IN : in KEYtype; STATE_OUT : out STATEtype); end AddRoundKey; architecture behavior of AddRoundKey is begin process(STATE_IN, KEY_IN) begin for j in 0 to (Nb-1) loop STATE_OUT(0)(j) <= STATE_IN(0)(j) xor KEY_IN(j)(31 downto 24); STATE_OUT(1)(j) <= STATE_IN(1)(j) xor KEY_IN(j)(23 downto 16); STATE_OUT(2)(j) <= STATE_IN(2)(j) xor KEY_IN(j)(15 downto 8); STATE_OUT(3)(j) <= STATE_IN(3)(j) xor KEY_IN(j)(7 downto 0); end loop; end process; end behavior;

Lect-21.12CprE 583 – Reconfigurable ComputingNovember 2, 2006 AES-128E Modules: KeyExpansion KeyExpansion 128-bit plaintext 128-bit key Round Transformation round++ round = 10? SubBytes ShiftRows AddRoundKey 128-bit ciphertext Yes No MixColumns KeyExpansion Initial 128-bit key is converted into separate keys for each of the 10 required rounds Consists of Sbox transformations and some XORs 128-bit key Rkey[1] Rkey[2] Rkey[3] Rkey[4] Rkey[5] Rkey[6] Rkey[7] Rkey[8] Rkey[9] Rkey[10] S S S S rcon w[0] w[1] w[2] w[3] w[4] w[5] w[6] w[7]

Lect-21.13CprE 583 – Reconfigurable ComputingNovember 2, 2006 Design Decisions Online/offline key generation Inter-round layout decisions Round unrolling Round pipelining Intra-round layout decisions Transformation pipelining Transformation partitioning Technology mapping decisions S-box synthesis as Block SelectRAM, distributed ROM primitives, or logic gates

Lect-21.14CprE 583 – Reconfigurable ComputingNovember 2, 2006 Round Unrolling / Pipelining Unrolling replaces a loop body (round) with N copies of that loop body AES-128E algorithm is a loop that iterates 10 times – N є [1, 10] N = 1 corresponds to original looping case N = 10 is a fully unrolled implementation Pipelining is a technique that increases the number of blocks of data that can be processed concurrently Pipelining in hardware can be implemented by inserting registers Unrolled rounds can be split into a certain number of pipeline stages These transformations will increase throughput but increase area and latency

Lect-21.15CprE 583 – Reconfigurable ComputingNovember 2, 2006 Unrolling factor = 10Unrolling factor = 2Unrolling factor = 1Unrolling factor = 5 Round Unrolling / Pipelining (cont.) Input plaintext R1R1 Output Ciphertext R2R2 R3R3 R4R4 R5R5 R6R6 R7R7 R8R8 R9R9 R 10 Round pipelining = ON

Lect-21.16CprE 583 – Reconfigurable ComputingNovember 2, 2006 Transformation Partitioning/Pipelining FPGA maximum clock frequency depends on critical logic path Inter-round transformations can’t improve critical path Individual transformations can be pipelined with registers similar to the rounds Transformations that are part of the maximum delay path can be partitioned and pipelined as well Can result in large gains in throughput with only minimal area increases

Lect-21.17CprE 583 – Reconfigurable ComputingNovember 2, 2006 Transformation pipelining = ON Partitioning / Pipelining (cont.) Transformation partitioning = ON SubBytesShiftRowsMixColumns KeyExpansion AddRoundKey KeyExpansion B KeyExpansion C KeyExpansion A

Lect-21.18CprE 583 – Reconfigurable ComputingNovember 2, 2006 S-box Technology Mapping With synthesis primitives, can map the S-box lookup tables to different hardware components Two S-boxes can fit on a single Block SelectRAM constant SSYNROMSTYLE: string := “select_rom”; -- {logic, select_rom} entity Sbox is port(BYTE_IN : in std_logic_vector(7 downto 0); BYTE_OUT : out std_logic_vector(7 downto 0)); attribute syn_romstyle : string; attribute syn_romstyle of BYTE_OUT : signal is SSYNROMSTYLE; end Sbox;... Sample VHDL code

Lect-21.19CprE 583 – Reconfigurable ComputingNovember 2, 2006 Recap – Retiming

Lect-21.20CprE 583 – Reconfigurable ComputingNovember 2, 2006 weight(e) = weight(e) + lag(head(e)) - lag(tail(e)) Recap – Retiming (cont.)

Lect-21.21CprE 583 – Reconfigurable ComputingNovember 2, 2006 Retiming and Pipelining Can use this retiming to pipeline Assume have enough (infinite supply) of registers at edge of circuit Retime them into circuit See [WeaMar03A] for details

Lect-21.22CprE 583 – Reconfigurable ComputingNovember 2, 2006 Recap – Retiming and Covering

Lect-21.23CprE 583 – Reconfigurable ComputingNovember 2, 2006 Outline HW #4 Discussion Recap HW/SW Codesign Motivation Specification Partitioning Automation

Lect-21.24CprE 583 – Reconfigurable ComputingNovember 2, 2006 Hardware/Software Codesign Definition 1 – the concurrent and co-operative design of hardware and software components of an embedded system Definition 2 – A design methodology supporting the cooperative and concurrent development of hardware and software (co-specification, co- development, and co-verification) in order to achieve shared functionality and performance goals for a combined system [MicGup97A]

Lect-21.25CprE 583 – Reconfigurable ComputingNovember 2, 2006 Motivation Not possible to put everything in hardware due to limited resources Some code more appropriate for sequential implementation Desirable to allow for parallelization, serialization Possible to modify existing compilers to perform the task

Lect-21.26CprE 583 – Reconfigurable ComputingNovember 2, 2006 Why put CPUs on FPGAs? Shrink a board to a chip What CPUs do best: Irregular code Code that takes advantage of a highly optimized datapath What FPGAs do best: Data-oriented computations Computations with local control

Lect-21.27CprE 583 – Reconfigurable ComputingNovember 2, 2006 Most recent work addressing this problem assumes relatively slow bus interface FPGA has direct interface to memory in this model General- Purpose Processor Memory FPGA Memory bus Computational Model

Lect-21.28CprE 583 – Reconfigurable ComputingNovember 2, 2006 Hardware/Software Partitioning CPU HW Accelerator if (foo < 8) { for (i=0; i<N; i++) x[i] = y[i]*z[i]; }

Lect-21.29CprE 583 – Reconfigurable ComputingNovember 2, 2006 Methodology Separation between function, and communication Unified refinable formal specification model Facilitates system specification Implementation independent Eases HW/SW trade-off evaluation and partitioning From a more practical perspective: Measure the application Identify what to put onto the accelerator Build interfaces

Lect-21.30CprE 583 – Reconfigurable ComputingNovember 2, 2006 Informal Specification, Constraints System model Architecture design HW/SW implementation PrototypeTest Implementation Fail Success Component profiling Performance evaluation System-Level Methodology

Lect-21.31CprE 583 – Reconfigurable ComputingNovember 2, 2006 Concurrency Concurrent applications provide the most speedup CPU accelerator if (a > b)... x[i] = y[i] * z[i] No data dependencies

Lect-21.32CprE 583 – Reconfigurable ComputingNovember 2, 2006 Process 2 Process 3 Process 1 Partitioning Can divide the application into several processes that run concurrently Process partitioning exposes opportunities for parallelism if (i>b) … for (i=0; i<N; i++) … for (j=0; j<N; j++)...

Lect-21.33CprE 583 – Reconfigurable ComputingNovember 2, 2006 process (a, b, c) in port a, b; out port c; { read(a); … write(c); } Specification Line () { a = … … detach } Processor Capture ModelFPGA Partition Synthesize Interface Automating System Partitioning Good partitioning mechanism: 1) Minimize communication across bus 2) Allows parallelism  both hardware (FPGA) and processor operating concurrently 3) Near peak processor utilization at all times (performing useful work)

Lect-21.34CprE 583 – Reconfigurable ComputingNovember 2, 2006 task SoftwareHardware List of tasks Partitioning Algorithms Assume everything initially in software Select task for swapping Migrate to hardware and evaluate cost Timing, hardware resources, program and data storage, synchronization overhead Cost evaluation and move evaluation similar to what we’ve seen regarding mincut and simulated annealing

Lect-21.35CprE 583 – Reconfigurable ComputingNovember 2, 2006 Multi-threaded Systems Single thread: Multi-thread:

Lect-21.36CprE 583 – Reconfigurable ComputingNovember 2, 2006 Performance Analysis Single threaded: Find longest possible execution path Multi-threaded with no synchronization: Find the longest of several execution paths Multi-threaded with synchronization: Find the worst-case synchronization conditions

Lect-21.37CprE 583 – Reconfigurable ComputingNovember 2, 2006 Multi-threaded Performance Analysis Synchronization causes the delay along one path to affect the delay along another synchronization point tata tbtb tctc tdtd Delay = max(t a, t b ) + t d

Lect-21.38CprE 583 – Reconfigurable ComputingNovember 2, 2006 Control Need to signal between CPU and accelerator Data ready Complete Implementations: Shared memory Handshake If computation time is very predictable, a simpler communication scheme may be possible

Lect-21.39CprE 583 – Reconfigurable ComputingNovember 2, 2006 Application Program Operating System I/O driver I/O bus Application hardware (custom) I/O driver I/O bus Send, Receive, Wait Register reads/writes Interrupt service Bus transactions Interrupts Communication Levels Easier to program at application level (send, receive, wait) but difficult to predict More difficult to specify at low level Difficult to extract from program but timing and resources easier to predict

Lect-21.40CprE 583 – Reconfigurable ComputingNovember 2, 2006 d1 d2 d3 p1p2p3 r2 r3 FPGA Control/Data FIFO Other Interface Models Synchronization through a FIFO FIFO can be implemented either in hardware or in software Effectively reconfigure hardware (FPGA) to allocate buffer space as needed Interrupts used for software version of FIFO

Lect-21.41CprE 583 – Reconfigurable ComputingNovember 2, 2006 Debugging Hard to test a CPU/accelerator system: Hard to control and observe the accelerator without the CPU Software on CPU may have bugs Build separate test benches for CPU code, accelerator Test integrated system after components have been tested

Lect-21.42CprE 583 – Reconfigurable ComputingNovember 2, 2006 Graphical EFSM ESTEREL Compilers Partitioning Sw Synthesis FormalVerification Sw Code + RTOS Logic Netlist Simulation Hw Synthesis Intfc + RTOS Synthesis CFSMs Rapid prototyping POLIS Codesign Methodology

Lect-21.43CprE 583 – Reconfigurable ComputingNovember 2, 2006 Codesign Finite State Machines POLIS uses an FSM model for Uncommitted Synthesizable Verifiable Control-dominated HW/SW specification Translators from State diagrams, Esterel, ECL, ReactiveJava HDLs Into a single FSM-based language

Lect-21.44CprE 583 – Reconfigurable ComputingNovember 2, 2006 CFSM behavior Four-phase cycle: ¶ Idle · Detect input events ¸ Execute one transition ¹ Emit output events Software response could take a long time: Unbounded delay assumption Need efficient hw/sw communication primitive: Event-based point-to-point communication

Lect-21.45CprE 583 – Reconfigurable ComputingNovember 2, 2006 CFSM2 CFSM3 C=>G CFSM1 C=>F B=>C F^(G==1) (A==0)=>B C=>A CFSM1 CFSM2 C=>B F G C C B A C=>G C=>B Globally Asynchronous, Locally Synchronous (GALS) model Network of CFSMs

Lect-21.46CprE 583 – Reconfigurable ComputingNovember 2, 2006 Summary Hardware/software codesign complicated and limited by performance estimates Algorithms not generally as good as human partitioning Other interesting issues include dual processors, special memory interfaces Will likely evolve at faster rate as compilers evolve