CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #21 – HW/SW Codesign
Lect-21.2CprE 583 – Reconfigurable ComputingNovember 2, 2006 Quick Points Midterm graded and returned Average – 84.5 Median – 85.0 Maximum – 95.0 Minimum – 72.0 Standard Deviation – 6.65
Lect-21.3CprE 583 – Reconfigurable ComputingNovember 2, 2006 HW #4 Discussion Problem 1 – did just a simple adder work? Problem 2 – how did you implement the permutation table? Problem 3 – did you use a counter?
Lect-21.4CprE 583 – Reconfigurable ComputingNovember 2, 2006 AES-128E Algorithm 128-bit plaintext 128-bit key Round Transformation round++ round = 10? SubBytes ShiftRows AddRoundKey 128-bit ciphertext Yes No KeyExpansion MixColumns
Lect-21.5CprE 583 – Reconfigurable ComputingNovember 2, 2006 S-box abcdef a b c d e f 637c777bf26b6fc bfed7ab76 ca82c97dfa092125c5c3ee1ea64f8c20 b7fd932636fa253a46b5aa0ca63a775d 04c723c31842dddcb5958c dc 09832c1a1bc7208cfae1cb6d0dfa d100ed d0d445a8e2a604a d0efaafb431b124ae1f2ff4abcfd0fa6 51a3408f92cc3aa6fc09c7ef002589aa cd0c13ec5f4a35af4f12dced513b1bdc 60814fdc2a4a bc671315ba e0323a0a066f c360 48b549 e7c8376dd512ed4543a6254f19e1ed4f ba25 2ea62c34a697ee3633 1f6f51 70b f1bbbb5631bb56f25ba4a e afb5 20aab52a4a c890d 8200ba8c4f43170d04b509 y x Overview of AES (cont.) 128-bit input is copied into a two-dimensional (4x4) byte array referred to as the state Round transformations operate on the state array Final state copied back into 128-bit output AES makes use of a non-linear substitution function that operates on a single byte Can be simplified as a look-up table (S-box)
Lect-21.6CprE 583 – Reconfigurable ComputingNovember 2, 2006 AES-128E Modules: SubBytes KeyExpansion 128-bit plaintext 128-bit key Round Transformation round++ round = 10? SubBytes ShiftRows AddRoundKey 128-bit ciphertext Yes No MixColumns SubBytes S-box transformation performed independently on each byte of the state S-box S 0,0 S 0,1 S 0,2 S 0,3 S 1,0 S 1,2 S 1,3 S 2,0 S 2,1 S 2,2 S 2,3 S 3,0 S 3,1 S 3,2 S 3,3 S r,c S' 0,0 S' 0,1 S' 0,2 S' 0,3 S' 1,0 S' 1,2 S' 1,3 S' 2,0 S' 2,1 S' 2,2 S' 2,3 S' 3,0 S' 3,1 S' 3,2 S' 3,3 S' r,c state[i]state'[i]
Lect-21.7CprE 583 – Reconfigurable ComputingNovember 2, 2006 AES-128E Modules: ShiftRows KeyExpansion 128-bit plaintext 128-bit key Round Transformation round++ round = 10? SubBytes ShiftRows AddRoundKey 128-bit ciphertext Yes No MixColumns ShiftRows Bytes in the last three rows of the state are shifted cyclically over variable offsets S 0,0 S 0,1 S 0,2 S 0,3 S 1,0 S 1,1 S 1,2 S 1,3 S 2,0 S 2,1 S 2,2 S 2,3 S 3,0 S 3,1 S 3,2 S 3,3 S' 0,0 S' 0,1 S' 0,2 S' 0,3 S' 1,1 S' 1,2 S' 1,3 S' 1,0 S' 2,2 S' 2,3 S' 2,0 S' 2,1 S' 3,3 S' 3,0 S' 3,1 S' 3,2 state[i]state'[i]
Lect-21.8CprE 583 – Reconfigurable ComputingNovember 2, 2006 AES-128E Modules: MixColumns KeyExpansion 128-bit plaintext 128-bit key Round Transformation round++ round = 10? SubBytes ShiftRows AddRoundKey 128-bit ciphertext Yes No MixColumns Modulo polynomial-basis multiplication performed on each column of the state Can be simplified as series of AND and XOR operations state[i]state'[i] {03h} {02h} S 0,0 S 0,2 S 0,3 S 1,0 S 1,2 S 1,3 S 2,0 S 2,2 S 2,3 S 3,0 S 3,2 S 3,3 S 0,1 S 1,1 S 2,1 S 3,1 S' 0,0 S' 0,2 S' 0,3 S' 1,0 S' 1,2 S' 1,3 S' 2,0 S' 2,2 S' 2,3 S' 3,0 S' 3,2 S' 3,3 S' 0,1 S' 1,1 S' 2,1 S' 3,1
Lect-21.9CprE 583 – Reconfigurable ComputingNovember 2, 2006 MixColumns Implementation entity MixColumns is port (STATE_IN : in STATEtype; RNUM_IN : in RNUMtype; STATE_OUT : out STATEtype); end MixColumns; architecture behavior of MixColumns is signal tSTATE : STATEtype; begin process(STATE_IN) variable t1, t2 : std_logic_vector(7 downto 0); begin for i in 0 to 3 loop for j in 0 to Nb-1 loop -- Multiply by 2 t1 := STATE_IN(i mod 4)(j)(6 downto 0) & '0'; if (STATE_IN(i mod 4)(j)(7) = '1') then t1 := t1 xor x"1b"; end if; -- Multiply by 3 t2 := STATE_IN((i+1) mod 4)(j)(6 downto 0) & '0'; if (STATE_IN((i+1) mod 4)(j)(7) = '1') then t2 := t2 xor x"1b"; end if; t2 := t2 xor STATE_IN((i+1) mod 4)(j); tSTATE(i)(j) <= t1 xor t2 xor STATE_IN((i+2) mod 4)(j) xor STATE_IN((i+3) mod 4)(j); end loop; end process;
Lect-21.10CprE 583 – Reconfigurable ComputingNovember 2, 2006 AES-128E Modules: AddRoundKey KeyExpansion 128-bit plaintext 128-bit key Round Transformation round++ round = 10? SubBytes ShiftRows AddRoundKey 128-bit ciphertext Yes No MixColumns AddRoundKey Words from the round-specific key are XORed into columns of the state S 0,0 S 0,2 S 0,3 S 1,0 S 1,2 S 1,3 S 2,0 S 2,2 S 2,3 S 3,0 S 3,2 S 3,3 S' 0,0 S' 0,2 S' 0,3 S' 1,0 S' 1,2 S' 1,3 S' 2,0 S' 2,2 S' 2,3 S' 3,0 S' 3,2 S' 3,3 S 0,1 S 1,1 S 2,1 S 3,1 S' 0,1 S' 1,1 S' 2,1 S' 3,1 Rkey[i] w[0]w[2]w[3] w[1] state[i]state'[i]
Lect-21.11CprE 583 – Reconfigurable ComputingNovember 2, 2006 AddRoundKey Implementation entity AddRoundKey is port(STATE_IN : in STATEtype; KEY_IN : in KEYtype; STATE_OUT : out STATEtype); end AddRoundKey; architecture behavior of AddRoundKey is begin process(STATE_IN, KEY_IN) begin for j in 0 to (Nb-1) loop STATE_OUT(0)(j) <= STATE_IN(0)(j) xor KEY_IN(j)(31 downto 24); STATE_OUT(1)(j) <= STATE_IN(1)(j) xor KEY_IN(j)(23 downto 16); STATE_OUT(2)(j) <= STATE_IN(2)(j) xor KEY_IN(j)(15 downto 8); STATE_OUT(3)(j) <= STATE_IN(3)(j) xor KEY_IN(j)(7 downto 0); end loop; end process; end behavior;
Lect-21.12CprE 583 – Reconfigurable ComputingNovember 2, 2006 AES-128E Modules: KeyExpansion KeyExpansion 128-bit plaintext 128-bit key Round Transformation round++ round = 10? SubBytes ShiftRows AddRoundKey 128-bit ciphertext Yes No MixColumns KeyExpansion Initial 128-bit key is converted into separate keys for each of the 10 required rounds Consists of Sbox transformations and some XORs 128-bit key Rkey[1] Rkey[2] Rkey[3] Rkey[4] Rkey[5] Rkey[6] Rkey[7] Rkey[8] Rkey[9] Rkey[10] S S S S rcon w[0] w[1] w[2] w[3] w[4] w[5] w[6] w[7]
Lect-21.13CprE 583 – Reconfigurable ComputingNovember 2, 2006 Design Decisions Online/offline key generation Inter-round layout decisions Round unrolling Round pipelining Intra-round layout decisions Transformation pipelining Transformation partitioning Technology mapping decisions S-box synthesis as Block SelectRAM, distributed ROM primitives, or logic gates
Lect-21.14CprE 583 – Reconfigurable ComputingNovember 2, 2006 Round Unrolling / Pipelining Unrolling replaces a loop body (round) with N copies of that loop body AES-128E algorithm is a loop that iterates 10 times – N є [1, 10] N = 1 corresponds to original looping case N = 10 is a fully unrolled implementation Pipelining is a technique that increases the number of blocks of data that can be processed concurrently Pipelining in hardware can be implemented by inserting registers Unrolled rounds can be split into a certain number of pipeline stages These transformations will increase throughput but increase area and latency
Lect-21.15CprE 583 – Reconfigurable ComputingNovember 2, 2006 Unrolling factor = 10Unrolling factor = 2Unrolling factor = 1Unrolling factor = 5 Round Unrolling / Pipelining (cont.) Input plaintext R1R1 Output Ciphertext R2R2 R3R3 R4R4 R5R5 R6R6 R7R7 R8R8 R9R9 R 10 Round pipelining = ON
Lect-21.16CprE 583 – Reconfigurable ComputingNovember 2, 2006 Transformation Partitioning/Pipelining FPGA maximum clock frequency depends on critical logic path Inter-round transformations can’t improve critical path Individual transformations can be pipelined with registers similar to the rounds Transformations that are part of the maximum delay path can be partitioned and pipelined as well Can result in large gains in throughput with only minimal area increases
Lect-21.17CprE 583 – Reconfigurable ComputingNovember 2, 2006 Transformation pipelining = ON Partitioning / Pipelining (cont.) Transformation partitioning = ON SubBytesShiftRowsMixColumns KeyExpansion AddRoundKey KeyExpansion B KeyExpansion C KeyExpansion A
Lect-21.18CprE 583 – Reconfigurable ComputingNovember 2, 2006 S-box Technology Mapping With synthesis primitives, can map the S-box lookup tables to different hardware components Two S-boxes can fit on a single Block SelectRAM constant SSYNROMSTYLE: string := “select_rom”; -- {logic, select_rom} entity Sbox is port(BYTE_IN : in std_logic_vector(7 downto 0); BYTE_OUT : out std_logic_vector(7 downto 0)); attribute syn_romstyle : string; attribute syn_romstyle of BYTE_OUT : signal is SSYNROMSTYLE; end Sbox;... Sample VHDL code
Lect-21.19CprE 583 – Reconfigurable ComputingNovember 2, 2006 Recap – Retiming
Lect-21.20CprE 583 – Reconfigurable ComputingNovember 2, 2006 weight(e) = weight(e) + lag(head(e)) - lag(tail(e)) Recap – Retiming (cont.)
Lect-21.21CprE 583 – Reconfigurable ComputingNovember 2, 2006 Retiming and Pipelining Can use this retiming to pipeline Assume have enough (infinite supply) of registers at edge of circuit Retime them into circuit See [WeaMar03A] for details
Lect-21.22CprE 583 – Reconfigurable ComputingNovember 2, 2006 Recap – Retiming and Covering
Lect-21.23CprE 583 – Reconfigurable ComputingNovember 2, 2006 Outline HW #4 Discussion Recap HW/SW Codesign Motivation Specification Partitioning Automation
Lect-21.24CprE 583 – Reconfigurable ComputingNovember 2, 2006 Hardware/Software Codesign Definition 1 – the concurrent and co-operative design of hardware and software components of an embedded system Definition 2 – A design methodology supporting the cooperative and concurrent development of hardware and software (co-specification, co- development, and co-verification) in order to achieve shared functionality and performance goals for a combined system [MicGup97A]
Lect-21.25CprE 583 – Reconfigurable ComputingNovember 2, 2006 Motivation Not possible to put everything in hardware due to limited resources Some code more appropriate for sequential implementation Desirable to allow for parallelization, serialization Possible to modify existing compilers to perform the task
Lect-21.26CprE 583 – Reconfigurable ComputingNovember 2, 2006 Why put CPUs on FPGAs? Shrink a board to a chip What CPUs do best: Irregular code Code that takes advantage of a highly optimized datapath What FPGAs do best: Data-oriented computations Computations with local control
Lect-21.27CprE 583 – Reconfigurable ComputingNovember 2, 2006 Most recent work addressing this problem assumes relatively slow bus interface FPGA has direct interface to memory in this model General- Purpose Processor Memory FPGA Memory bus Computational Model
Lect-21.28CprE 583 – Reconfigurable ComputingNovember 2, 2006 Hardware/Software Partitioning CPU HW Accelerator if (foo < 8) { for (i=0; i<N; i++) x[i] = y[i]*z[i]; }
Lect-21.29CprE 583 – Reconfigurable ComputingNovember 2, 2006 Methodology Separation between function, and communication Unified refinable formal specification model Facilitates system specification Implementation independent Eases HW/SW trade-off evaluation and partitioning From a more practical perspective: Measure the application Identify what to put onto the accelerator Build interfaces
Lect-21.30CprE 583 – Reconfigurable ComputingNovember 2, 2006 Informal Specification, Constraints System model Architecture design HW/SW implementation PrototypeTest Implementation Fail Success Component profiling Performance evaluation System-Level Methodology
Lect-21.31CprE 583 – Reconfigurable ComputingNovember 2, 2006 Concurrency Concurrent applications provide the most speedup CPU accelerator if (a > b)... x[i] = y[i] * z[i] No data dependencies
Lect-21.32CprE 583 – Reconfigurable ComputingNovember 2, 2006 Process 2 Process 3 Process 1 Partitioning Can divide the application into several processes that run concurrently Process partitioning exposes opportunities for parallelism if (i>b) … for (i=0; i<N; i++) … for (j=0; j<N; j++)...
Lect-21.33CprE 583 – Reconfigurable ComputingNovember 2, 2006 process (a, b, c) in port a, b; out port c; { read(a); … write(c); } Specification Line () { a = … … detach } Processor Capture ModelFPGA Partition Synthesize Interface Automating System Partitioning Good partitioning mechanism: 1) Minimize communication across bus 2) Allows parallelism both hardware (FPGA) and processor operating concurrently 3) Near peak processor utilization at all times (performing useful work)
Lect-21.34CprE 583 – Reconfigurable ComputingNovember 2, 2006 task SoftwareHardware List of tasks Partitioning Algorithms Assume everything initially in software Select task for swapping Migrate to hardware and evaluate cost Timing, hardware resources, program and data storage, synchronization overhead Cost evaluation and move evaluation similar to what we’ve seen regarding mincut and simulated annealing
Lect-21.35CprE 583 – Reconfigurable ComputingNovember 2, 2006 Multi-threaded Systems Single thread: Multi-thread:
Lect-21.36CprE 583 – Reconfigurable ComputingNovember 2, 2006 Performance Analysis Single threaded: Find longest possible execution path Multi-threaded with no synchronization: Find the longest of several execution paths Multi-threaded with synchronization: Find the worst-case synchronization conditions
Lect-21.37CprE 583 – Reconfigurable ComputingNovember 2, 2006 Multi-threaded Performance Analysis Synchronization causes the delay along one path to affect the delay along another synchronization point tata tbtb tctc tdtd Delay = max(t a, t b ) + t d
Lect-21.38CprE 583 – Reconfigurable ComputingNovember 2, 2006 Control Need to signal between CPU and accelerator Data ready Complete Implementations: Shared memory Handshake If computation time is very predictable, a simpler communication scheme may be possible
Lect-21.39CprE 583 – Reconfigurable ComputingNovember 2, 2006 Application Program Operating System I/O driver I/O bus Application hardware (custom) I/O driver I/O bus Send, Receive, Wait Register reads/writes Interrupt service Bus transactions Interrupts Communication Levels Easier to program at application level (send, receive, wait) but difficult to predict More difficult to specify at low level Difficult to extract from program but timing and resources easier to predict
Lect-21.40CprE 583 – Reconfigurable ComputingNovember 2, 2006 d1 d2 d3 p1p2p3 r2 r3 FPGA Control/Data FIFO Other Interface Models Synchronization through a FIFO FIFO can be implemented either in hardware or in software Effectively reconfigure hardware (FPGA) to allocate buffer space as needed Interrupts used for software version of FIFO
Lect-21.41CprE 583 – Reconfigurable ComputingNovember 2, 2006 Debugging Hard to test a CPU/accelerator system: Hard to control and observe the accelerator without the CPU Software on CPU may have bugs Build separate test benches for CPU code, accelerator Test integrated system after components have been tested
Lect-21.42CprE 583 – Reconfigurable ComputingNovember 2, 2006 Graphical EFSM ESTEREL Compilers Partitioning Sw Synthesis FormalVerification Sw Code + RTOS Logic Netlist Simulation Hw Synthesis Intfc + RTOS Synthesis CFSMs Rapid prototyping POLIS Codesign Methodology
Lect-21.43CprE 583 – Reconfigurable ComputingNovember 2, 2006 Codesign Finite State Machines POLIS uses an FSM model for Uncommitted Synthesizable Verifiable Control-dominated HW/SW specification Translators from State diagrams, Esterel, ECL, ReactiveJava HDLs Into a single FSM-based language
Lect-21.44CprE 583 – Reconfigurable ComputingNovember 2, 2006 CFSM behavior Four-phase cycle: ¶ Idle · Detect input events ¸ Execute one transition ¹ Emit output events Software response could take a long time: Unbounded delay assumption Need efficient hw/sw communication primitive: Event-based point-to-point communication
Lect-21.45CprE 583 – Reconfigurable ComputingNovember 2, 2006 CFSM2 CFSM3 C=>G CFSM1 C=>F B=>C F^(G==1) (A==0)=>B C=>A CFSM1 CFSM2 C=>B F G C C B A C=>G C=>B Globally Asynchronous, Locally Synchronous (GALS) model Network of CFSMs
Lect-21.46CprE 583 – Reconfigurable ComputingNovember 2, 2006 Summary Hardware/software codesign complicated and limited by performance estimates Algorithms not generally as good as human partitioning Other interesting issues include dual processors, special memory interfaces Will likely evolve at faster rate as compilers evolve