Reconfigurable Computing

Reconfigurable Computing
Architecture, Algorithms and Applications Lecture 01: Introduction Prof. Sherief Reda S. Reda

(Application Specific
Methods for executing algorithms Hardware (Application Specific Integrated Circuits) Reconfigurable computing Software-programmed processors Advantages: very high performance and efficient Disadvantages: not flexible (can’t be altered after fabrication) expensive Advantages: fills the gap between hardware and software much higher performance than software higher level of flexibility than hardware Advantages: software is very flexible to change Disadvantages: performance can suffer if clock is not fast fixed instruction set by hardware S. Reda EN2911X FALL’07

Temporal-based execution
Temporal vs. spatial based computing Temporal-based execution (software) Spatial-based execution (reconfigurable computing) Ability to extract parallelism (or concurrency) from algorithm descriptions is the key to acceleration using reconfigurable computing S. Reda EN2911X FALL’07

Reconfigurable devices
Field-Programmable Gate Arrays (FGPAs) are one example of reconfigurable devices An FPGA consists of an array of programmable logic blocks whose functionality is determined by programmable configuration bits The logic blocks are connected by a set of routing resources that are also programmable Custom logic circuits can be mapped to the reconfigurable fabric S. Reda EN2911X FALL’07

Configuring FPGAs [Maxfield’04] FPGAs can be dynamically reprogrammed before runtime or during runtime (virtual hardware) full partial S. Reda EN2911X FALL’07

Uses of reconfigurable devices
Low/med volume IC production Early prototyping and logic emulation Accelerating algorithms in reconfigurable computing environments Reconfigurable functional units within a host processor (custom instructions) Reconfigurable units used as coprocessors Reconfigurable units that are accessed through external I/O or a network [Compton’02] S. Reda EN2911X FALL’07

Current problems with conventional computing
Intel VP Patrick Gelsinger (ISSCC 2001) “If scaling continues at present pace, by 2005, high speed processors would have power density of nuclear reactor, by 2010, a rocket nozzle, and by 2015, surface of sun.” Technology scaling doubled the number of devices in an IC (processors, FPGAs, …, etc) every 2-3 years Scaling also provided devices with reduced delay → frequency doubling (with aggressive pipelining) → increased power density Increases in clock frequency slowed down (or stopped); available devices are used to create multi-processor (multi-core) processors S. Reda EN2911X FALL’07

Why reconfigurable computing is more relevant these days?
Demand for high-performance computation is soaring: large-scale optimization problems, physics and earth simulation, bioinformatics, signal processing (e.g. HDTV), …, etc) Why software-programmed processors are no longer attractive? Faster temporal execution of instructions) is no longer improving General-purpose multi-core processors requires coarse grain thread-level parallelism Why reconfigurable fabrics are currently attractive? Increased integration densities allow large FPGAs that can implement substantial functions Provide the spatial computational resources required to implement massively-parallel computations directly in hardware S. Reda EN2911X FALL’07

Topics that will be covered in this class…
(entry survey time) S. Reda EN2911X FALL’07

Topic 01: Programmable logic technology overview
Programming information could be stored in SRAM 4-input Look-Up Table (LUT) is the typical size S. Reda EN2911X FALL’07

Topic 01: Programmable logic technology overview
Switch box S. Reda EN2911X FALL’07

Topic 02: Reconfigurable computing methodologies
System Specification partitioning software hardware compile for target processor synthesis (compilation) Mapping (placement & routing) configuration data S. Reda EN2911X FALL’07

Topic 03: Hardware programming languages (Verilog)
Verilog is a hardware description language used to model digital systems Similar in syntax to C Differs from conventional programming languages as the execution of statements is not strictly linear. Possible to have sequential and concurrent execution statements The language can be synthesized into logic circuits module mux(a, b, select, y); input a, b, select; output y; initial begin (a or b or select) if (select) y = a; else y = b; end endmodule S. Reda EN2911X FALL’07

Topic 04: Rapid prototyping with Altera DE2 board
No need to design our board; we will use Altera’s DE2 board and Quartus II software. Features: Cyclone II FPGA 35K LUTs 10/100 Ethernet RS232 Video out (VGA 10-bit DAC) Video in (NTSC/PAL/multi-format) USB 2.0 (type A and type B) PS/2 mouse or keyboard port Line in/out, microphone in (24-bit Audio CODEC) Expansion headers (76 signal pins) Infrared port Memory 8-MBytes SDRAM, 512K SRAM, 4-MBytes flash SD memory card slot Displays 16 x 2 LCD display Eight 7-segment displays Switches and LEDs S. Reda EN2911X FALL’07

Topic 05: High-level synthesis languages (SystemC)
#include "systemc.h" SC_MODULE(adder) { sc_in<int> a, b; sc_out<int> sum; void do_add() { sum = a + b; } SC_CTOR(adder) { SC_METHOD(do_add); sensitive << a << b; }; SystemC is a system description language for hardware/software systems SystemC is a set of library and macros implemented in C++ to allow specification and simulation of concurrent processes Allow high-level description of hardware modules A subset of the language can be synthesized into logic circuits. We will use Celoxica Agility compiler as our synthesizer tool S. Reda EN2911X FALL’07

Topic 06: Algorithm acceleration using reconfigurable computing
Learn how to use FPGAs and reconfigurable computing principles to accelerate algorithms: sorting, dynamic programming, NP-hard problems, …, etc. Accelerating application in various fields Signal and image processing Cryptology Bioinformatics Pattern recognition … etc S. Reda EN2911X FALL’07

Topic 07: Soft multi-core computing environments
Nios processor Core 1 Nios processor Core 2 BUS Accelerator Memory Learn about hard and soft processors Design multi-core-based reconfigurable computing systems Design of on-chip networks for multi-core systems Design of custom instructions Design of pluggable acceleration function units S. Reda EN2911X FALL’07

Goals of this class Learn principles of reconfigurable computing with minimum hardware bakground Acquire hands-on experience and useful implementation skills Verilog / SystemC / Quartus II Develop/strengthen research skills S. Reda EN2911X FALL’07

Class organization HW assignments (paper reviews + mini labs): 20%
Class participation: 10% Midterm: 20% Class project (progress/final reports and presentation): 50% Sources: papers, lecture slides, manuals and book chapters. Class website: S. Reda EN2911X FALL’07

Reconfigurable Computing (EN2911X, Fall07)
Lecture 02: RC Principles: Programmable Logic Technology (1/3) Prof. Sherief Reda Division of Engineering, Brown University

Programmable logic element
FPGA architecture Programmable logic element [Maxfield’04] Objective of this lecture: study organization of programmable logic blocks and interconnects

Block logic element How is the number of bits in a K-input table?
[Rose’04] [Maxfield’04] How is the number of bits in a K-input table? How many Boolean functions can a K-input LUT implement? What is the best LUT size?

Example F = A0A1A3 + A1A2Ā3 + Ā0 Ā1 Ā2 4-input LUT 16 bits 3-input LUT
[from J. Zambreno] F = A0A1A3 + A1A2Ā3 + Ā0 Ā1 Ā2 4-input LUT 16 bits 3-input LUT 2-input LUT 24 bits 28 bits

Logic block clusters (logic array block LAB, configurable logic block CLB)
Assume K-input LUT in each BLE and assume N BLEs per logic cluster The BLEs in each logic clusters are fully connected or “nearly-fully” connected What are the best values for I, K, and N? [Betz-Rose 97]

To implement in FPGAs, designs need to be decomposed and mapped to LBs
Map to a LUT in a CLB [Figure form Cong FPGA’01]

Programmable routing Wires provide the necessary communication fabric to route the output of one computational node to the inputs of another computational node Why routing is more crucial than logic? Routing resources occupy a much larger area logic resources in an FPGA because circuits typically consume more routing resources for communication Wire delay grows quadratically as a function of its length → avoid using long wires unless necessary Technology scaling reduces device delay but increases wire delay

General routing definitions
channel segment track CLB A wire segment is a wire unbroken by programmable switches A track is a sequence of one or more wire segments in a line. The segments could be connected by switches at their ends A routing channel is a group of parallel tracks. The channel width is the number of tracks in the channel

Connection blocks: formed where CLB input or output pins connect to the routing channels
Life would have been easy if only logic blocks within the same column or row need to communicate!

Segment-segment switch design for bidirectional wires
channel segment track CLB [Lemieux’04]

Switch blocks: formed wherever horizontal and vertical channels intersect
box Switch box size grows quadratically as a function of the number of its input wires

Bidirectional switch details
[Lemieux’04, Tessier]

Segmented and hierarchical routing
segmented routing hierarchical routing Short wires accommodate local traffic Short wires can be connected together using switch boxes to emulate longer wires Also contain long wires to allow efficient communication without passing through switches Routing within a “group” of logic blocks occur at the local level Longer hierarchical wires connect different groups

Heterogeneous reconfigurable environments
Reconfigurable fabric might contain non- reconfigurable elements that interface to the logic blocks through the programmable interconnect fabric Examples: Embedded memory Embedded multipliers, adders, MAC Embedded processors S. Reda EN2911X FALL’07

Embedded memory blocks
Costly to implement memory with configurable logic blocks → add hard chunks of RAM blocks Position/size vary depending on the FPGA device. Size varies from few thousands (or tens of thousands) per RAM block Each block can be used independently or combined to form larger RAM blocks Could be single or dual-port RAMs [Maxfield’04] S. Reda EN2911X FALL’07

Embedded multipliers and adders
Multipliers are inherently slow if implemented by connecting a large number of programmable logic blocks → add hard-wired multiplier blocks Typically located close to the embedded RAM blocks Some FPGA use Multiply-And-Accumulate (MAC) blocks (useful in DSP applications)

Programming the FPGA Configuration memory that determine the programmability of the logic blocks and interconnects

Programmable switch technology
Anti-fuse SRAM Switch by default is OFF; when programmed it is ON. Advantages: negligible delay small area overhead Disadvantages: not really reconfigurable; one time programmable SRAM bit cell stores the programmability of the device Advantages: can be reconfigured quickly and as repeatedly as required no special fabrication steps Disadvantages: takes more area loses charge when turned off Flash Switch by default is ON; when programmed it is OFF. Advantages: programming not lost when device is turned off. Disadvantages: requires more manufacturing steps

Assigned readings Tuesday Sept 18 Reconfigurable computing: A survey of systems and software. K. Compton & S. Huack (Sections 1-3) The effect of LUT and cluster size on deep-submicron FPGA performance and density. E. Ahmed and J. Rose Thursday Sept 20 Altera’s Stratix II vs. Xilinx’s Virtex 4 architecture comparison Logic block organization Interconnect organization Non-reconfigurable components You have to submit a 1 (or more) page summary of the paper (main ideas + critique) before midnight of the lecture day. Only use the submission form on the class website. Any summaries submitted after that time will not be looked at!

Next time Case study: Altera’s Cyclone II architecture
Discussion of assigned readings

Lecture 04: Programmable Logic Technology (2/3) Prof. Sherief Reda Division of Engineering, Brown University

Case study: Altera’s Cyclone II device
Two dimensional array of Logic Array Blocks (LABs), with 16 Logic Elements (LEs) in each LAB. Embedded memory blocks (M4K) and multipliers (18x18) PLL (Phased Locked Loops) are used to generate clock signal for a range of frequencies EP2C35 (in DE2 board) has 60 columns and 45 rows for a total of LEs. 105 M4K blocks and 35 embedded multipliers.

Logic element organization (normal mode)
The LE has two operating modes: normal and arithmetic Normal mode is suitable for general logic implementation 4-input LUT 6 input connections 3 output connections LAB-wide synchronous/asynchronous clear and load signals. Clock signal

Logic element organization (arithmetic mode)
Arithmetic mode is suitable for implementing adders, counters, accumulators and comparators The LUT is split into two 3-input LUTs (ideal for implementing 2-bit full adders) and basic carry chain

Logic array block organization
Each LAB consists of the following: 16 LEs, LAB control signals, LE carry chains, register chains and local interconnects Local interconnects transfer signals between LEs in the same LAB and is driven by column and row interconnects and LE outputs within the same LAB Neighboring LABs, PLLs, M4K RAM and multipliers from the left and right can also drive an LAB’s local interconnect Each LE can drive 48 Les through fast local and direct interconnects

Register/carry chain connections with a LAB

Multi-track interconnects
Multitrack interconnect consists of row (directlink, R4, R24) and column (register chain, C4, C16) R4/C4 interconnects spans 4 blocks (right, left / top, down) R24/C16 spans 24/16 blocks and connects to R4/C4 interconnects R4/C4 can drive each other to extend their range

C4 interconnections C4 interconnects drive local and R4 interconnect up to 4 rows C16 column interconnects span 16 LABs and provide long column connections C16 column interconnects indirectly drive LAB local interconnects via C4 and R4 and interconnects

Embedded RAMs and multipliers
ideal for DSP applications 250 Mhz performance Either configured as one 18 bit multiplier or two independent 9 bit multipliers 4608 RAM bits (w or w/o parity) 250 MHz performance Either single or dual port memory Can also be configured as FIFO

IO Element (IOE) structure
IO Element (IOE) structure (allows bidirectional signals) 5 IOE per row I/O block Row I/O blocks drive C4, R4, R24 or direct link interconnects. Column I/O blocks drive C4, C16 interconnects

Reconfigurable Computing Division of Engineering, Brown University
(EN2911X, Fall07) Lecture 05: Verilog (1/3) Prof. Sherief Reda Division of Engineering, Brown University

Introduction to Verilog
Why are the advantages of Hardware Definition Languages? Verilog is a HDL similar in syntax to C Verilog is case sensitive Many online textbooks available from Brown library Verilog digital system design Verilog quickstart The Verilog hardware description language Lecture examples from “Verilog HDL” by S. Palnitkar Reconfigurable Computing S. Reda, Brown University

Verilog modules module toggle(q, clk, reset); … <functionality of module> endmodule reset toggle q clk The internal of each module can be defined at four level of abstraction Behavioral or algorithmic level Dataflow level Gate level Switch level Verilog allows different levels of abstraction to be mixed in the same module. Reconfigurable Computing S. Reda, Brown University

Basic concepts <size>’<base format><number>
Comments are designated by // to the end of a line or by /* to */ across several lines. Number specification. <size>’<base format><number> specifies the number of bits in the number d or D for decimal h or H for hexadecimal b or B for binary o or O for octal Number depends on the base Examples: 4’b1111 12’habc 16’d235 12’h13x -6’d3 12’b1111_0000_1010 X or x: don’t care Z or z: high impedence _ : used for readability Reconfigurable Computing S. Reda, Brown University

Data types Nets represent connections between hardware elements. They are continuously driven by output of connected devices. They are declared using the keyword wire. wire a; wire b, c; wire d=1’b0; Registers represent data storage elements. They retain value until another value is placed onto them. In Verilog, a register is merely a variable that can hold a value. They do not need a clock as hardware registers do. reg reset; initial begin reset = 1’b1; #100 reset=1’b0; end Reconfigurable Computing S. Reda, Brown University

Data types A net or register can be declared as vectors. Example of declarations: wire a; wire [7:0] bus; wire [31:0] busA, busB, busC; reg clock; reg [0:40] virt_address; It is possible to address bits or parts of vectors busA[7] bus[2:0] virt_addr[0:2] Use integer for counting. Example. integer counter initial counter = -1; Reconfigurable Computing S. Reda, Brown University

Data types Reals real delta; initial begin delta = 4e10; delta = 2.13;
end integer i; i = delta; Arrays. It is possible to have arrays of type reg, integer, real integer count[0:7]; reg [4:0] port_id[0:7]; integer matrix[4:0][0:255]; Reconfigurable Computing S. Reda, Brown University

Data types Memories. Used to model register files, RAMs and ROMs. Modeled in Verilog as a one-dimensional array of registers. Examples. reg mem1bit[0:1023]; reg [7:0] membyte[0:1023]; membyte[511]; Parameters. Define constants and can’t be used as variables. parameter port_id=5; Strings can be stored in reg. The width of the register variables must be large enough to hold the string. reg [8*19:1] string_value; initial string_value = “Hello Verilog World”; Reconfigurable Computing S. Reda, Brown University

Modules and ports module fulladd4(sum, c_out, a, b, c_in); output [3:0] sum; output c_out; input [3:0] a, b; input c_in; … endmodule All port declarations (input, output, inout) are implicitly declared as wire. If the output hold their value, they must be declared are reg module DFF(q, d, clk, reset); output reg q; input d, clk, reset; … endmodule Reconfigurable Computing S. Reda, Brown University

Module declaration (ANSI C style)
module fulladd4(output reg[3:0] sum, output reg c_out, input [3:0] a, b, input c_in); … endmodule Reconfigurable Computing S. Reda, Brown University

Module instantiation module Top; reg [3:0] A, B; reg C_IN;
wire [3:0] SUM; wire C_OUT; // one way fulladd4 FA1(SUM, C_OUT, A, B, CIN); // another possible way fulladd4 FA2(.c_out(C_OUT), .sum(SUM), .b(B), .c_in(C_IN), .a(A)); … endmodule externally, inputs can be a reg or a wire; internally must be wires externally must be wires module fulladd4(sum, c_out, a, b, c_in); output [3:0] sum; output c_out; input [3:0] a, b; input c_in; … endmodule Reconfigurable Computing S. Reda, Brown University

Gate level modeling (structural)
. wire Z, Z1, OUT, OUT1, OUT2, IN1, IN2; and a1(OUT1, IN1, IN2); nand na1(OUT2, IN1, IN2); xor x1(OUT, OUT1, OUT2); not (Z, OUT); buf final (Z1, Z); All instances are executed concurrently just as in hardware Instance name is not necessary The first terminal in the list of terminals is an output and the other terminals are inputs Not the most interesting modeling technique for our class Reconfigurable Computing S. Reda, Brown University

Array of gate instances
wire [7:0] OUT, IN1, IN2; // array of gates instantiations nand n_gate [7:0] (OUT, IN1, IN2); // which is equivalent to the following nand n_gate0 (OUT[0], IN1[0], IN2[0]); nand n_gate1 (OUT[1], IN1[1], IN2[1]); nand n_gate2 (OUT[2], IN1[2], IN2[2]); nand n_gate3 (OUT[3], IN1[3], IN2[3]); nand n_gate4 (OUT[4], IN1[4], IN2[4]); nand n_gate5 (OUT[5], IN1[5], IN2[5]); nand n_gate6 (OUT[6], IN1[6], IN2[6]); nand n_gate7 (OUT[7], IN1[7], IN2[7]); Reconfigurable Computing S. Reda, Brown University

Dataflow modeling Module is designed by specifying the data flow, where the designer is aware of how data flows between hardware registers and how the data is processed in the design The continuous assignment is one of the main constructs used in dataflow modeling assign out = i1 & i2; assign addr[15:0] = addr1[15:0] ^ addr2[15:0]; assign {c_out, sum[3:0]}=a[3:0]+b[3:0]+c_in; A continuous assignment is always active and the assignment expression is evaluated as soon as one of the right-hand-side variables change Left-hand side must be a scalar or vector net. Right-hand side operands can be registers, nets, integers, real, … Reconfigurable Computing S. Reda, Brown University

Operator types in dataflow expressions
Operators are similar to C except that there are no ++ or – Arithmetic: *, /, +, -, % and ** Logical: !, && and || Relational: >, <, >= and <= Equality: ==, !=, === and !== Bitwise: ~, &, |, ^ and ^~ Reduction: &, ~&, |, ~|, ^ and ^~ Shift: <<, >>, >>> and <<< Concatenation: { } Replication: {{}} Conditional: ?: Reconfigurable Computing S. Reda, Brown University

Example module mux4(out, i0, i1, i2, i3, s1, s0); output out;
input i0, i1, i2, i3; output s1, s0; assign out = (~s1 & ~s0 & i0) | (~s1 & s0 & i1) | (s1 & ~s0 & i2) | (s1 & s0 & i3); // OR THIS WAY assign out = s1 ? (s0 ? i3:i2) : (s0 ? i1:i0); endmodule Reconfigurable Computing S. Reda, Brown University

Summary Covered an introduction to Verilog
Next time behavioral modeling Lab 0 is ready to warm up I will distribute lab 1 (game implementation) next time

Dataflow modeling Module is designed by specifying the data flow, where the designer is aware of how data flows between hardware registers and how the data is processed in the design The continuous assignment is one of the main constructs used in dataflow modeling assign out = i1 & i2; assign addr[15:0] = addr1[15:0] ^ addr2[15:0]; assign {c_out, sum[3:0]}=a[3:0]+b[3:0]+c_in; A continuous assignment is always active and the assignment expression is evaluated as soon as one of the right-hand-side variables change Left-hand side must be a scalar or vector net. Right-hand side operands can be registers, nets, integers, real, …

Operator types in dataflow expressions
Operators are similar to C except that there are no ++ or – Arithmetic: *, /, +, -, % and ** Logical: !, && and || Relational: >, <, >= and <= Equality: ==, !=, === and !== Bitwise: ~, &, |, ^ and ^~ Reduction: &, ~&, |, ~|, ^ and ^~ Shift: <<, >>, >>> and <<< Concatenation: { } Replication: {{}} Conditional: ?:

Example module mux4(out, i0, i1, i2, i3, s1, s0); output out;
input i0, i1, i2, i3; output s1, s0; assign out = (~s1 & ~s0 & i0) | (~s1 & s0 & i1) | (s1 & ~s0 & i2) | (s1 & s0 & i3); // OR THIS WAY assign out = s1 ? (s0 ? i3:i2) : (s0 ? i1:i0); endmodule

Behavioral or algorithmic modeling
Design is expressed in algorithmic level, which frees designers from thinking in terms of logic gates or data flow. Designing at this model is very similar to programming in C. All algorithmic statements in Verilog can appear only inside two statements: always and initial. Each always and initial statement represents a separate activity flow in Verilog. Remember that activity flows in Verilog run in parallel. You can have multiple initial and always statements but you can’t nest them. . reg a, b, c; initial a=1’b0; always begin b = a ^ 1’b1; c = a + b; end

initial statements An initial block start at time 0, executes exactly once and then never again. If there are multiple initial blocks, each blocks starts to execute concurrently at time 0 and each blocks finish execution independently of the others. Multiple behavioral statements must be grouped using begin and end. If there is one statement then grouping is not necessary. reg x, y, m; initial m=1’b0; initial begin x=1’b0; y=1’b1; end

always statement The always statement starts at time 0 and executes the statements in the always block continuously in a looping fashion. It models a block of activity that is repeated continuously in a digital circuit. Multiple behavioral statements must be grouped using begin and end. If there is one statement then grouping is not necessary. integer count; count=0; always begin count=count+1; end

Events-based timing control
An event is the change in the value on a register or a net. Events can be utilized to trigger the execution of a statement of a block of statements. symbol is used to specify an event control. Statements can be executed on changes in signal value or at a positive (posedge) or negative (negedge) transition of the signal. input clock; integer count; count=0; begin count=count+1; end input clock; integer count; count=0; begin count=count+1; end input clock1, clock 2; integer count; count=0; or clock2) begin count=count+1; end

Procedural assignments
Procedural assignments update values of reg, integer, or real variables. The value will remain unchanged until another procedural assignment updates the variable with a different value → different from dataflow continuous assignments. Two types of procedural assignments: blocking and nonblocking. Blocking statements, specified using the = operator, are executed in the order they are specified in a sequential block. Nonblocking statements, specified using the <= operator, are executed without blocking the statements that flow in a sequential block. reg x, y; initial begin x<=1’b1; y<=1’b0; end reg x, y; initial begin x=1’b1; y=1’b0; end

Uses of nonblocking assignments
If the intention is to swap the contents of and b, which one of these will work? clock) begin a = b; b = a; end clock) begin a <= b; b <= a; end Nonblocking assignments eliminate the race conditions. At the positive edge of clock, the values of all the RHS variables are “read”, expressions evaluated and then assigned to the LHS.

Conditional statements
Very similar to C Can always appear inside always and initial blocks . if(alu_control == 0) y = x + z; else if (alu_control == 1) y = x – z; else if (alu_control == 2) y = x * z; else y = x; . if(x) begin y= 1’b1; z= 1’b0; end if (count < 10) count = count+1; else count = 0; expression reg [1:0] alu_control; .. case (alu_control) 2’d0 : y = x + z; 2’d1 : y = x – z; 2’d2 : y = x * z; default: y=x; endcase

Loops integer count; integer y=1; integer x=2; initial
for (count = 0; count < 128; count = count + 1) begin x <= x + y; y <= x; end initial count = 0; while (count < 128) begin . count = count +1; end initial count = 0; repeat(128) begin . count = count +1; end Must contain a number or a signal value; only evaluated once at the beginning

Example: Mux4x1 module mux4x1(out, i0, i1, i2, i3, s1, s0);
output out; input i0, i1, i2, i3; input s1, s0; reg out; or s0 or i0 or i1 or i2 or i3) begin case({s1, s0}) 2’d0: out = i0; 2’d1: out = i1; 2’d2: out = i2; 2’d3: out = i3; endcase endmodule

DE2 board overview CLOCK_50 LEDG[0] … LEDG[8] LEDR[0] … LEDR[17]
HEX0[6:0] … HEX7[6:0] LEDG[0] … LEDG[8] LEDR[0] … LEDR[17] SW[0] … SW[17] KEY[0] … KEY[3] Import the given pin assignment file to make things easy for you!

D2 example: A 1 second blinking light
module sec (input CLOCK_50, output reg [8:0] LEDG); integer count=0; initial LEDG[0]=1'b0; CLOCK_50) begin count=count+1; if(count == 50_000_000) count=0; if(LEDG[0]) LEDG[0]=1'b0; else LEDG[0]=1'b1; end endmodule

Lab 1 Please go through the lab0 tutorial to get familiar with the tool and the synthesis environment Please check the class webpage for helpful resources You are required to form teams (2 students per team). Since there are 11 students enrolled in the class, one team has to be composed of either 3 students or just 1 student. Deliverables (1st game Oct 4th and 2nd game Oct 9th) include Working design which will be tested Quartus II project files Written documentation includes Verilog source code with comments Report the amount of logic and routing area utilized in the FPGA Snapshot of the final layout of the FPGA as produced by the synthesis tool Simple documentations on any additions you volunteered to add to the game

Game 1: Secret Code Grabber AKA Simon
The objective of this game to memorize a “random” pattern of lights that is displayed to you on the DE2 board LEDs, and input it back using the available push buttons or switches. At the beginning, the board should display the user a pattern by lighting one LED at a time for a “short” period, and then the gamer should input back the pattern in the same sequence. After that, the board should display some sign on the 7 segment display to tell the gamer whether his/her input is correct or not, and replay with another “random pattern.” There are two knobs that you can use to make the game harder: the period where each LED is ON and the length of the pattern. You can either fix those in advance, or make change them as the user progresses in playing.

Game 2: Catch the ant In this game we have an ant that continuously traverses the board from left to right and then from right to left. The position of the ant is indicated by the LED that is lightened up. The ant is quick and stops at each position for a “short” period. The ant also sometimes “randomly” changes its direction which makes it hard to predict its next location. Your objective is to catch the ant as many times as you could. Each position corresponds to a push button and you want to press the push button that corresponds to the ant position. Every time you correctly get the ant, you score 1 point and every time you miss you lose 1 point. The score should be displayed on the seven segments.

Game 3: Match the alien symbol
In this game the DE2 board is possessed by some alien. It displays some alien symbol on one of the 7 segment displays and then displays four symbols on four other 7 segment displays. Your objective is to choose (via the push buttons) the number (or location) of the symbol that matches the alien symbol. You have to be quick because the board will allow you only very “short” time to make your choice. A green LED should lighten up if you match successfully; otherwise, a red LED should lighten up.

Behavioral modeling Last lecture we started behavioral modeling
Focus on synthesizable subset of Verilog Understood the difference between always and initial Understood the difference between blocking and nonblocking assignments Explained event-based control timing Explained conditional statements: if and else Explained multi-way branching: case Explained looping statements: while, for and repeat

Hierarchical naming As described, every module instance, signal, or variable is identified with an identifier. Each identifier has a unique place in the design hierarchy. Hierarchical name referencing allows us to denote every identifier in the design hierarchy with a unique name. A hierarchical name is a list of identifiers separated by dots “.” for each level of hierarchy Examples: stimulus.q, stimulus.m1.Q, stimulus.m1.n2

Named blocks Blocks can be given names
Local variables can be declared from the names block Variables in a named block can be accessed by using hierarchical name referencing Named blocks can be disabled … always begin : block1 integer i; i=1; end begin : block2 integer j; j = block1.i^1;

Disabling named blocks
The keyword disable provides a way to terminate the execution of a named block. Disable can be used to get out of loops, handle error conditions, or control execution of pieces of code based on control signal Disabling a block causes the execution control to be passed to the statement immediately succeeding the block initial begin i=0; flag=8’b0010_0101; begin: block1 while (i < 16) if (flag[i]) disable block1; i = i+1; end

Tasks and functions Often it is required to implement the same functionality at many times in a behavioral design. Verilog provides tasks and functions to break up large behavioral code into smaller pieces. Tasks and functions are included in the design hierarchy. Like named blocks, tasks and functions can be addressed by means of hierarchical names. Tasks have input, output and inout arguments Functions have input arguments Tasks and functions are included in the design hierarchy. Like named blocks, tasks or functions can be addressed by means of hierarchical names

Tasks Tasks are declared with the keywords task and endtask.
module … … always begin BOP (AB_AND, AB_OR, AB_XOR, A, B); BOP (CD_AND, CD_OR, CD_XOR, C, D); end task BOP; output [15:0] ab_and, ab_or, ab_xor; input [15:0] a, b; ab_and = a & b; ab_or = a | b; ab_xor = a ^ b; endtask endmodule Tasks are declared with the keywords task and endtask. Tasks can have input, inout, and output arguments to pass values (different than in modules). Tasks or functions can have local variables but cannot have wires. Tasks and functions can only contain behavioral statements. Tasks and functions do not contain always or initial statements but are called from always blocks, initial blocks, or other tasks and functions. Can operate directly on reg variables defined in the module

Difference between module and task instantiation
… always BOP (AB_AND, AB_OR, AB_XOR, A, B); BOP (CD_AND, CD_OR, CD_XOR, C, D); task automatic BOP; output [15:0] ab_and, ab_or, ab_xor; input [15:0] a, b; begin ab_and = a & b; ab_or = a | b; ab_xor = a ^ b; end endtask endmodule Instantiated modules create duplicate copies in hardware. In contrast tasks are static in nature. All declared items are statically allocated and they are shared across all uses of the task. If a task is called form within a behavioral block, only one copy is needed However, a task/function might be called concurrently form different behavioral blocks, which can lead to incorrect operation → avoid by using the automatic keyword to make a task re-entrant

Functions Functions are typically used for combinational modeling (use for conversions and commonly used calculations. Need at least one input argument but cannot have output or inout arguments. The function is invoked by specifying function name and input arguments, and at the end execution, the return value is placed where the function was invoked Functions cannot invoke other tasks; they can only invoke other functions. Recursive functions are not synthesizable module … … reg [31:0] parity; begin parity = calc_parity(addr); end // you can declare C sytle function calc_parity; input [31:0] address; calc_parity = ^address; endfunction endmodule

Example: clock display on DE2
Last lecture we had a simple example of a 1 second blinking LED Let’s generalize it to a clock display minutes HEX3 HEX2 seconds HEX1 HEX0

Task to display digits task digit2sev(input integer digit, output [6:0] disp); begin if (digit == 0) disp = 7'b ; else if (digit == 1) disp = 7'b ; else if (digit == 2) disp = 7'b ; else if (digit == 3) disp = 7'b ; else if (digit == 4) disp = 7'b ; else if (digit == 5) disp = 7'b ; else if (digit == 6) disp = 7'b ; else if (digit == 7) disp = 7'b ; else if (digit == 8) disp = 7'b ; else if (digit == 9) disp = 7'b ; end endtask

One way module clock(CLOCK_50, HEX0, HEX1, HEX2, HEX3);
output reg [6:0] HEX0, HEX1, HEX2, HEX3; input CLOCK_50; integer count=0; reg [3:0] d1=0, d2=0, d3=0, d4=0; CLOCK_50) begin count=count+1; if (count == 50_000_000) count=0; d1 = d1+1; if(d1 == 10) d1 = 0; d2 = d2+1; if(d2 == 6) d2 = 0; d3 = d3 + 1; if(d3 == 10) d3 = 0; d4 = d4 + 1; if(d4 == 6) d4 = 0; end digit2sev(d1, HEX0); digit2sev(d2, HEX1); digit2sev(d3, HEX2); digit2sev(d4, HEX3); task digit2sev; … endtask endmodule

Resource utilization is 152 LEs

Second code module clock(CLOCK_50, HEX0, HEX1, HEX2, HEX3);
input CLOCK_50; output reg [6:0] HEX0, HEX1, HEX2, HEX3; integer count=0; reg [15:0] ticks=16'd0; reg [5:0] seconds=6'd0, minutes=6'd0; initial display_time(seconds, minutes); CLOCK_50) begin count = count+1; if (count == 50_000_000) count=0; ticks = ticks + 1; seconds = ticks % 60; minutes = ticks / 60; display_time (seconds, minutes); end

task display_time(input [5:0] s, input [5:0] m); begin
digit2sev(s%10, HEX0); digit2sev(s/10, HEX1); digit2sev(m%10, HEX2); digit2sev(m/10, HEX3); end endtask task digit2sev(input integer digit, output [6:0] disp); if (digit == 0) disp = 7'b ; else if (digit == 1) disp = 7'b ; else if (digit == 2) disp = 7'b ; else if(digit == 3) disp = 7'b ; else if(digit == 4) disp = 7'b ; else if(digit == 5) disp = 7'b ; else if(digit == 6) disp = 7'b ; else if(digit == 7) disp = 7'b ; else if(digit == 8) disp = 7'b ; else if(digit == 9) disp = 7'b ; endmodule HEX3 HEX2 HEX1 HEX0

Resource utilization for 2nd code
Circuit consumes 611 LEs (2% of the chip logic resources). You have to be careful! Changing ticks, seconds and minutes to integer increases area to become 2500 LEs (8% of the utilization)

Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda Division of Engineering, Brown University

Summary of current status
Past lectures Understood the principles of the hardware part of reconfigurable computing: programmable logic technology. Learned how to program reconfigurable fabrics using hardware definition languages (Verilog). Next lectures Understand the principles of the software part (which we have partly used) of reconfigurable computing. Learn how to program reconfigurable fabrics using system software languages (SystemC).

Reconfigurable computing design flow
System Specification partitioning SW HW compile compiling Verilog link synthesis so far we only experienced this portion mapping executable image place & route configuration file download to board

System specification Use High-Level Languages (HLLs) (C, C++, Java, MATLAB). Advantages: Since systems consist of both SW and HW, then we can describe the entire system with the same specification Fast to code, debug and verify the system is working Disadvantages: No concurrent support No notion of time (clock or delay) Different communication model than HW (uses signals) Missing data types (e.g., bit vectors, logic values) How can we overcome these disadvantages?

Using HLL for hardware/software specification
[from G. De Micheli] Augment the HLL (e.g. C++) with a new library that support additional hardware-like functionality (e.g. SystemC) Unified language across all stages of platform design Fast simulation There are already lots of tools for C++ → we will come to this part later in details Enable compilers to optimize code and extract concurrency from sequential code to map into FPGAs

Hardware-Software partitioning
Given a system specification, decompose or partition the specification into tasks (functional objects) and label each task as HW or SW such that the system cost / performance is optimized and all the constraints on resources / cost are satisfied. The exact performance depends on the computational model in hand Given the same application, a system with an FPGA on a slow bus results in a model with different performance parameter than a system with a FPGA as a coprocessor.

HW/SW partitioning Good partitioning criteria:
int main() { …. .. } task SW model task task SW task HW task HW Good partitioning criteria: Minimize communication (traffic) between HW and SW and on the bus Maximize concurrency (reduce stalling) where both the HW and SW run in parallel Maximizes the utilization of the HW resources → Minimize total execution runtime

Profiling is a key step in HW/SW partitioning
Determining the candidate HW partitions by first profiling the specification tasks taking into account typical data sets Given a candidate SW/HW partition Estimate HW implementation Determine the system performance and speedup over software How can we generate candidate SW/HW partitions?

HW/SW partitioning algorithms
Total size is constrained by number and size of available FPGA(s) SW tasks HW tasks task Execution time moves local optimal global Kernighan/Lin – Fidducia/Mattheyses algorithm Start with all task vertices free to swap/move (unlocked) Label each possible swap/move with immediate change in execution time that it causes (gain) Iteratively select and execute a swap/move with highest gain (whether positive or negative); lock the moving vertex (i.e., cannot move again during the pass), Best solution seen during the pass is adopted as starting solution for next pass

Low-level partitioning from software binaries
Rather than partition from the high-level description, it is possible to compile the program as SW and then partition the resultant executable binary into SW and HW parts. Advantages: No need to worry about which language is being used Can be used to develop dynamic runtime partitioners and synthesizers Main steps: Decompilation of binary to recover high-level information Partitioning and synthesis Binary updating to account for the SW parts that migrated to HW

Compilation How can the compiler automatically extract parallelism?
Reconfigurable configurable has the ability to execute multiple operations in parallel through spatial distribution of the computing resources When compiling a SW-based sequential language like (C) into a concurrent language like Verilog, it is necessary to either Manually instruct the compiler to incorporate parallelism either through special instructions or compiler directives Automatically through the compiler How can the compiler automatically extract parallelism?

Data-flow graphs (DFG)
A data-flow graph (DFG) is a graph which represents a data dependencies between a number of operations. Dependencies arise from a various reasons An input to an operation can be the output of another operation Serialization constraints, e.g., loading data on a bus and then raising a flag Sharing of resources A dataflow graph represents operations and data dependencies Vertex set is one-to-one mapping with tasks A directed edge is in correspondence with the transfer of data from an operation to another one + a b c

Consider the following example
[Giovanni’94] Design a circuit to numerically solve the following differential equation in the interval [0, a] with step-size dx read (x, y, u, dx, a); do { xl = x + dx; ul = u – (3*x*u*dx) – (3*y*dx); yl = y + u*dx; c = xl < a; x = x1; u = u; y = yl; } while (c); write(y);

Data-flow graph example
xl = x + dx; ul = u – (3*x*u*dx) – (3*y*dx); yl = y + u*dx; c = xl < a; 3 x u dx 3 y u dx x dx * * * * + a y dx + xl * * < u yl c - - u1

Detecting concurrency from DFGs
Extended DFG where vertices can represent links to link graph DFGs in a hierarchy of graphs NOP * * * * + * * + < * - - NOP Paths in the graph represent concurrent streams of operations

Control / data-flow graphs (CDFG)
Control-flow information (branching and iteration) can be also represented graphically Data-flow graphs can be extended by introducing branching vertices that represent operations that evaluate conditional clauses Iteration can be modeled as a branch based on the iteration exit condition Vertices can also represent model calls

CDFG example x = a * b; y = x * c; z = a + b; if (z ≥ 0) { p = m + n;
NOP x = a * b; y = x * c; z = a + b; if (z ≥ 0) { p = m + n; q = m * n; } * + NOP NOP * BR NOP + * NOP NOP

Next lecture Parallelism extraction and optimization from DFG

Behavioral code optimizing
Tree-height reduction applies to arithmetic expression trees and strives to achieve the expression split into two-operand expressions to exploit parallelism The idea is to attempt to balance the expression tree as much as possible If we have n operations, what is the best height that can be achieved? Example: x = a + b * c + d b c * b c a d a * + + d + + x x

Tree-height reduction
x = a*(b*c*d + e) Exploiting the distributive property at the expense of adding an operation + * b c d e a * a b c d e +

Constant and variable propagation
Constant propagation consists of detecting constants operands and pre-computing the value of the operation with that operand. The result might a constant which can be propagated to other operations as input Example: a = 0; b = a + 1; c = 2 * b Replaced by → a = 0; b = 1; c =2 Variable propagation consists of detecting the copies of the variable and using the right-hand side in the following references in place of the left-hand side Example: a = x; b = a + 1; c = 2 * a Replaced by → a = x; b = x + 1; c = 2 * x

CSE and DCA Common Sub-expression Elimination (CSE) avoids unnecessary computations. Example: a = x + y; b = a + 1; c = x + y Can be replaced by → a = x + y; b = a + 1; c = a Dead code elimination (DCA). Dead code consists of all operations that cannot be reached, or whose results is never referenced elsewhere. a = x; b = x + 1; c = 2 * x; The first assignment can be removed if it is never subsequently referenced

Operator strength reduction & code motion
Operator strength reduction means reducing the cost of implementing an operator by using a “weaker” one (that uses less hardware / simpler and faster) Example: a = x^2; b = 3 * x Replaced by → a = x * x; t = x << 1; b = x + t Code motion often applies to loop invariants, i.e., quantities that are computed inside an iterative construct but whose values fo not change from iteration to iteration. Example: for (i = 1; i < a *b) {…} Replaced by → t = a * b; for ( i =1; i <= t) {…}

Control-flow-based transformations
Control-flow transformations are typically utilized to create more opportunities for data-flow transformations to be exercised Model expansion consists in flattening locally the model call hierarchy. Therefore the called model disappears, being swallowed by the calling one. A possible benefit is that the scope of application of some optimization techniques is enlarged yielding potentially a better circuit Example: x = a + b; y = a*b; z = func(x, y) where func(p, q) = {t = q-p+p*q; return t;} → By expanding func, we get x = a + b; y = a* b; z = a – b+a*b; → CSE x= a+b; y = a*b; z = a-b+y;

Conditional expansion
A conditional construct can be always transformed in a parallel construct with a test in the end. Conditional expansion can increase the performance of the circuit when the conditional clause depends on some late- arriving signal. However, it can preclude the possibility of hardware sharing If (C) then x=A else x=B  compute A and B in parallel, x= C ?A:B

Loop expansion In loop expansion, or unrolling, a loop is replaced by as many instances of the body as the number of operations. The benefit is in expanding the scope of other transformations Example: x = 0; for (i = 1; i <= 12; i++) { x = x + a[i]; } x = 0; for (i = 1; i <= 12; i = i+3) { x = x + a[i] + a[i+1] + a[i+2]; } + x a[i] a[i+1] a[i+2]

Putting concepts into work: Hardware acceleration using custom instructions
We studied the concepts HW/SW partitioning and code optimizations for high-level synthesis We will apply these concepts with the help of the Nios-II soft core processor Difference between Soft and Hard processors A hard processor is one that is implemented as a dedicated, predefined (hardwired) block As opposed to physically embedding a processor into the FPGA fabric, it is possible to configure a group of logic blocks to act as a soft processor What are the advantages and disadvantages of each?

The Nios II soft processor
32 bit soft processor from Altera 82 instructions Up to 256 custom instructions Optional multiply and divide depending on the flavor Comes in three flavors (number for Cyclone II implementations): Economy: emphasizes minimum size ~700 L.E and ~17 DMIPS. Standard: performance/size balance ~1400 L.E and ~54 DMIPS Fast: best performance ~1800 L. E and ~ 92 DMIPS

Creating Nios based systems using SOPC and program it using IDE
SOPC builder Nios II IDE

Memory (SRAM, or onchip)
Accelerating application within the Nios II environment custom instructions Avalon component peripherals Accelerator Avalon bus Nios II processor Memory (SRAM, or onchip)

Using customs instructions to accelerate applications
Combinatorial Sequential/multi-cycle

HW assignment: Accelerate C code to accelerate palindrome detection
A palindrome is a sequence of units “a string” that has the property of reading the same in either direction Examples: Racecar Dennis sinned 425524 HW is to write a C routine to detect whether a number is a palindrome or not then use it to write a C program to count the number of number palindromes between 0 and 1 billion. The count can be computed statically but the HW ask you to write a C program for the Nios II processor to compute the count using the routine you coded

HW assignment: Accelerate C code to accelerate palindrome detection
After you write your program, report the runtime and then accelerate the program using custom instructions designed using the reconfigurable logic You are required to report the runtime before and after the acceleration. It will be also good to try your program on a general purpose workstation and report the runtime. You have to report the count of palindromes you found together with the runtimes. Here are my runtimes. Optional: 2.4 GHz Xeon workstation: 355 seconds Required: Nios II (just software): seconds Required: Nios II (software + custom instructions): 105 seconds Grades: 15/20 if you get all parts working correctly. 16/20 if your runtime is between seconds, 17/20 if your runtime is between and 18/20 if runtime is and 20/20 if runtime is < 100

Lecture 10: RC Principles: Software (3/4) Prof. Sherief Reda Division of Engineering, Brown University [Some examples are based on G. De Micheli textbook and lectures]

Behavioral synthesis Given:
a sequencing graph (data/control flow graph) that is constructed from the circuit behavioral circuit specification after code optimizations a set of functional resources (multipliers, adders, …, etc) each characterized in terms of area, delay and power a set of constraint (on circuit delay, area and power) Synthesizing the output circuit consists of two stages: Place operations in time (scheduling) and space (bind them to resources) Determine the detailed connection of the data path the control unit

Scheduling (temporal assignment)
Scheduling is the task of determining the start times of all operations, subject to the precedence constraints specified by the sequencing graph The latency of the sequencing graph is the difference between the start time of the sink and the start time of the source

Scheduling to minimize the latency
Consider the following differential equation integrator read (x, y, u, dx, a); do { xl = x + dx; ul = u – (3*x*u*dx) – (3*y*dx); yl = y + u*dx; c = xl < a; x = x1; u = u; y = yl; } while (c); write(y);

ASAP scheduling for minimum latency
Assuming all operations to have 1 unit delay, what is the latency here?

ASAP scheduling algorithm

ALAP scheduling to meet latency constraint

ALAP scheduling algorithm

Operation mobility The mobility of an operation corresponds to the difference of the start time computed between the ALAP and ASAP algorithms Mobility measure the freedom we have in scheduling an operation to meet the timing schedule

Resource binding (spatial assignment)
Binding determines the resource type and instance assigned for each operation How many multipliers do we need here? how many ALUs (+, -, <)? 146

Resource sharing and binding
Bind a resource to two operations as long as they do not execute concurrently How many instances of the multiplier and the ALU do we need now? 147

Can we do better? Can we get the same latency with less resources
Resources sharing the same instance are colored with the same color. How many instances are now needed? How can we find the solution?

Finding the minimal number of resources for a given latency (T) using list scheduling
Initialize all resource instances to 1. for t = 1 to T: For each resource type: Calculate the slack (ALAP time – t) of each candidate operation Schedule candidate operations with zero slack and update the number of resource instances used if needed Schedule any candidate operations requiring no other resource instances What is the intuition behind this heuristic?

Scheduling and sharing necessitates a control unit that orchestrates the sequencing of operations

Scheduling under resource constraint
Assume we just one instance of a multiplier and one instance of an ALU (+, - and ==), how can we schedule all operations? What is the latency?

Finding the minimal latency for a given resource constraint (C) using list scheduling
Label all operations by the length of their longest path to the sink and rank them in decreasing order Repeat For each resource type Determine the candidate operations that U can be scheduled Select a subset of U by priority such that the resource constraint usage (C) is not exceeded Increment time What is the intuition behind this heuristic?

There is an inherent tradeoff between area and latency
6 5 4 3 2 1 (4, 6) dominated! area (4, 4) (7, 2) latency

Control unit example for(i=0; i<10; i=i+1) begin x = a[i] + b[i]; z = z + x; end a + MUX 1 b Enable x z How many control signals are produced from the control unit? How can we design the control unit? Control unit i CMP 10 101001 111001 . counter control bits

Summary So far, we covered SW/HW partitioning
Behavioral code optimization Behavioral synthesis techniques Next time, I give an overview of Technology mapping Placement and routing

Summary of the last 3 lectures
previous lectures System Specification this lecture traditional compiler class partitioning SW HW compile compiling Verilog link synthesis mapping & packing executable image place & route configuration file download to board

Programmable logic element
Embedding a digital circuit to FPGA fabric [Maxfield’04] Programmable logic element Mapping decomposes the circuit into logic sections and flip-flops such that each section fits into a K-LUT LE. Packing groups LEs into clusters so that each cluster fits into a LAB Placement determines the position of each cluster into the LABs of the island style FPGA Routing determines the exact routes for the communicating LE/LABs What are the objectives/metrics that these algorithms should pursue?

1. Mapping finds a covering for a given circuit using K-LUT
Map to a LUT in a LB [Figure form Cong FPGA’01]

A covering example [From Ling et al. DAC’05] There could be many possible covering? Which one should be picked?

2. Packing How can we decide which LEs should go together in the same logic cluster? Possible method (VPACK): Construct each cluster sequentially Start by choosing seed LE for the cluster Then greedily selects the LE which shares the most inputs and outputs with the cluster being constructed Repeat the procedure until greedily until the cluster is full or the number of inputs exceed the limit I Can addition of a LE to a cluster reduces the number of distinct inputs?

3. Placement Placement assigns an exact position or LAB for each cluster in the input netlist Suppose you start with a random placement, how can you improve it? Possible algorithm: - Pick a pair of cells and swap their locations if this leads to reduction in WL What’s wrong with the previous greedy algorithm? WL results possible placements local optimal global  It can simply get stuck in a local optimal result

Simulated annealing allows us to avoid getting trapped in a local minima
Modified algorithm Generate a random move (say a swap of two cells) calculate the chance in WL (L) due to the move if the move results in reduction (L < 0) then accept else reject with probability 1-e-L/T WL results possible placements local optimal global T (temperature) controls the rejection probability Initially, T is high (thus avoiding getting trapped early in a local minima) then the temperature cools down in a scheduled manner; at the end, the rejection probability is 1 With the right “slow-enough” cooling scheduling, simulated annealing is guaranteed to reach the global optimal

How do the cooling scheduling and corresponding cost functions look like?
[source: I. Markov]

Placement before & after simulated annealing
[using VPR tool]

4. Routing Assign exact routes for each wire in the given circuit in the FPGA fabric such that no two wires overlap General idea: Order the wires according to some criteria Sequentially route each wire using shortest path algorithms (after removing the resources consumed from preceding routed wires)

Maze routing 2 1 s t 5 4 3 6 7 8 9 10 2 1 s 11 t Problem: Find the shortest path for a 2-pin wire from s to t grid cell capacity is full grid cell still has available tracks Speed ups are possible using A* search algorithms and other AI search techniques

Impact of Net Ordering B A B A B B A A A bad net ordering
may unnecessarily increase the total wirelength or even yield the chip unroutable! Example: Two nets A and B B A B A B B A A B first then A (Good order) A first then B (Bad order) Length in placement Timing criticality

When a route for a net can’t be found then rip up and re-route
So rip-up B and route C first. Cannot route C A B C Finally route B. A A B B C C [Example from Prof. D. Pan Lecture]

VPR. After routing After placement and routing After placement
You probably saw similar layouts from the Quartus II tool

Finally programming the FPGA

Summary Done with software part for reconfigurable computing
Next lecture, project overview The one after is the midterm Afterwards, we will start looking at SystemC is a higher-level method to synthesis systems

(EN2911X, Fall07) Lecture 13: SystemC (1/3) Prof. Sherief Reda Division of Engineering, Brown University

Introduction to SystemC
SystemC is not a language, but rather a class library within a well established language C++. The primary motivation for using SystemC is to attain the productivity increases required to design modern electronic systems with their ever increasing complexity. [SystemC: From Ground Up]

SystemC resources SystemC: from the ground up / by David C. Black and Jack Donovan SystemC: methodologies and applications / edited by Wolfgang Müller, Wolfgang Rosenstiel, and Jürgen Ruf System design with SystemC / by Thorsten Grotker, Stan Liao, Grant Martin, Stuart Swan

C++ mini refresher C++ is an object oriented language
A key concept in C++ is the class. A class is an expanded concept of a data structure: instead of holding only data, it can hold both data and functions. class CRectangle { private: int x, y; public: void set_values (int,int); int area (void); } rect; rect.set_values (3,4); myarea = rect.area(); Examples from

CRectangle example declaration declaration & definition definition
#include <iostream> class CRectangle { private: int x, y; public: void set_values (int,int); int area () { return (x*y); } }; void CRectangle::set_values (int a, int b) { x = a; y = b; int main () { CRectangle rect; rect.set_values (3,4); cout << "area: " << rect.area(); return 0; declaration declaration & definition definition

Constructors #include <iostream> class CRectangle { int width, height; public: CRectangle (int,int); int area () { return (width*height); } }; CRectangle::CRectangle (int a, int b) { width = a; height = b; int main () { CRectangle rect (3,4); CRectangle rectb (5,6); return 0; A constructor is automatically called whenever a new object of this class is created. The constructor function must have the same name as the class, and cannot have any return type; not even void.

(Destructors) destructor definition allocating memory
class CRectangle { int *width, *height; public: CRectangle (int,int); ~CRectangle (); int area () { return (*width * *height); } }; CRectangle::CRectangle (int a, int b) { width = new int; height = new int; *width = a; *height = b; CRectangle::~CRectangle () { delete width; delete height; destructor definition allocating memory destructor definition freeing memory

Pointer to classes It is perfectly valid to create pointers that point to classes. We simply have to consider that once declared, a class becomes a valid type, so we can use the class name as the type for the pointer. For example: CRectangle *prect; prect = new CRectangle; prect->set_values(1, 2); As with data structures, in order to refer directly to a member of an object pointed by a pointer we can use the arrow operator (->) of indirection.

Inheritance between classes
A key feature of C++ classes is inheritance. Inheritance allows to create classes which are derived from other classes, so that they automatically include some of its "parent's" members, plus its own. polygon rectangle triangle The class CPolygon contain members that are common for both types of polygon. In our case: width and height. And CRectangle and CTriangle would be its derived classes, with specific features that are different from one type of polygon to the other. Classes that are derived from others inherit all the accessible members of the base class.

Inheritance example class CPolygon { protected: int width, height; public: void set_values (int a, int b) { width=a; height=b; } }; class CRectangle: public CPolygon { int area () { return (width * height); } }; class CTriangle: public CPolygon { return (width * height / 2); int main () { CRectangle rect; CTriangle trgl; rect.set_values (4,5); trgl.set_values (4,5); The protected access specifier is similar to private. The only difference occurs in fact with inheritance. When a class inherits from another one, the members of the derived class can access the protected members inherited from the base class, but not its private members.

Class templates motivation
class mypairi { int a, b; public: mypair (int first, int second) { a=first; b=second; } int getmax () { int retval; retval = a>b? a : b; return retval; }; int main () { mypairi myobject (100, 75); cout << myobject.getmax(); return 0; class mypairf { float a, b; public: mypair (float first, float second) { a=first; b=second; } float getmax () { float retval; retval = a>b? a : b; return retval; }; int main () { mypairf myobject (100, 75); cout << myobject.getmax(); return 0; Can we have an automatic way to avoid writing multiple versions of the same class for different data types?

Class templates template <class T> class mypair { T a, b; public: mypair (T first, T second) { a=first; b=second; } T getmax () { T retval; retval = a>b? a : b; return retval; }; int main () { mypair <int> myobject (100, 75); cout << myobject.getmax(); return 0; C++ Class Templates are used where we have multiple copies of code for different data types with the same logic. If a set of functions or classes have the same functionality for different data types, they become good candidates for being written as Templates.

What is wrong with plain C++?
Concurrency: hardware systems are inherently parallel; SW are not. Time: C++ has no notion of time Hardware style communication: signals, protocols Reactivity: hardware is inherently reactive Hardware data types: bit type, multi-valued logic Some of the functionalities in C++ are not simply applicable in hardware systems (e.g. destructors)

Extending C++ with SystemC library
SystemC library of C++ classes: Processes (for concurrency) Clocks (for time) Hardware data types (bit vectors, 4-valued logic, fixed-point types, arbitrary precision integers) Waiting, watching, and sensitivity (for reactivity) Modules, ports, signals (for hierarchy)

Synthesizable SystemC
Our discussions will focus on SystemC synthesis as afforded by the Celoxica SystemC agility compiler

SC_MODULE The arguments to SC_MODULE and SC_CTOR must be the same.
creates a class inherted from sc_module SC_MODULE(module_name) { ... // port declarations to connect modules together ... // variable declarations ... // function declarations/definitions SC_CTOR(module_name) ... // body of constructor ... // process declarations, sensitivities } }; The arguments to SC_MODULE and SC_CTOR must be the same.

Module ports Ports are used to communicate with the external modules or channels. Input ports (defined using sc_in and sc_in_clk) Output ports (defined using sc_out and sc_out_clk) Input/output ports (defined using sc_inout, sc_inout_clk and sc_inout_rv ) SC_MODULE (module_name) { //Module port declarations sc_in <port_data_type> port_name; sc_out <port_data_type> port_name; sc_inout <port_data_type> port_name; … };

Datatypes Supported native C++ synthesizable data types long long long
int short char bool SystemC also allows further refined storage types sc_bit sc_bv <width> sc_int <width> sc_uint <width> sc_bigint <width> sc_biguint <width>

Module process Processes describe the parallel behavior of hardware systems. Processes execute concurrently. The code within a process, however, executes sequentially. Defining a process is similar to defining a C++ function. A process is declared as a member function of a module class and registered as a process in the module’s constructor. SC_MODULE (my_module) { sc_in <bool> clk; sc_in <bool> in1; sc_out <bool> out1; void my_method(); SC_CTOR(my_module) { SC_METHOD(my_method); sensitive << in1 << clk.pos(); } module process in1 clk out

Definition of process body
The process body contains the implementation of the process. Like C++ functions, it may be defined: • within the module definition, typically in a .h header file • outside the module, typically in a .cpp file my_module.h my_module.h SC_MODULE (my_module) { void my_method() { ... } }; SC_MODULE (my_module) { void my_method(); ... }; my_module.cpp void my_module::my_thread() { ... } A thread process body within a module definition A thread process body outside a module definition

SC_CTOR construct The SC_CTOR constructor is used to:
Initialize variables declared in the module. Specify the functionality of the module in terms of SC_METHOD, SC_THREAD and SC_CTHREAD. The threads and methods must be defined using synthesizable code. The constructor should also contain: The sensitivity lists describing the inputs that each process is sensitive to. Instantiation of sub-modules. Port mapping code for hierarchical modules.

OR gate example #include "systemc.h" SC_MODULE(or_gate) { sc_in<sc_bit> a; sc_in<sc_bit> b; sc_out<sc_bit> c; void prc_or_gate() { c=a | b; } SC_CTOR(or_gate) { SC_METHOD(prc_or_gate); sensitive << a << b; } }; OR_GATE a b c

Full adder example FA a b sum carry SC_MODULE( half_adder ) {
sc_in<bool> a,b; sc_out<bool> sum, carry; void half_adder_defn(); SC_CTOR( half_adder ) { SC_METHOD ( half_adder_defn ); sensitive << a << b; } }; void half_adder::half_adder_defn() { sum = a ^ b; carry = a & b; FA a b sum carry

Ports module process in1 clk out Ports are the means through which modules communicate with other modules. There are three basic port types that inherent from sc_port: Input ports for receiving data Output ports for sending out data Input/output ports which combine the two An input port must be of type sc_in which is a template SystemC class sc_in < T > portname declares a input port of type T example: sc_in < sc_uint<5> > myinport special case : sc_in_clk clkname declares an input port of type bool

Port types An output port must be of type sc_out which is a template SystemC class sc_out < T > portname declares a output port of type T example: sc_in < sc_uint<5> > myinport Special: sc_out_clk clkname declares an output port of type bool An input/output port must be of type sc_inout, which is a templatized primitive SystemC class sc_inout < T > portname declares a in/out port of type T example: sc_inout < sc_uint<5> > myinport

Synthesizable operations associated with ports
There are two operations associated with ports read and write. sc_inout ports can be written to and read from sc_in ports can only be read from sc_out ports can only be written to Examples sc_in < T > portname; T thedata = portname.read(); sc_out < T > portname; portname.write(data);

Port mapping Ports can be mapped in any order. If myModuleA is an object of class ModuleA myModuleA.a1(in1); myModuleA.a2(in2); myModuleB.b2(in3); myModuleB.b3(out1); 200

Hierarchical design A module may contain sub-module instances in its body to form hierarchy There are no restrictions on the level of the hierarchy

Hierarchical design SC_MODULE(Top) {
sc_in<sc_uint<8> > in1; sc_in<sc_uint<8> > in2; sc_in<sc_uint<8> > in3; sc_out<sc_uint<8> > out; // module instances ModuleA *myModuleA; ModuleB *myModuleB; // signal declarations signal<sc_uint<8> > sig; SC_CTOR(Top) { … } }; SC_CTOR(Top) { myModuleA = new ModuleA (“mA”); ModuleA->a1(in1); ModuleA->a2(in2); ModuleA->a3(sig); MyModuleB = new ModuleB (“mB”); ModuleB->b1(sig); ModuleB->b2(in3); ModuleB->b3(out1); }

Processes Processes describe the parallel behavior of hardware systems. Processes execute concurrently. The code within a process, however, executes sequentially. Three types of SystemC processes: SC_THREAD SC_THREAD (special ) SC_METHOD Process declaration must exist within its module constructor. process’s sensitivity to clock, resets and other signal ports are specified when the process is declared SC_MODULE(my_module) { sc_in_clk clock; void my_thread(); … SC_CTOR(my_module) { SC_THREAD (my_thread); sensitive << clock.pos(); };

Process body The process body contains the implementation of the process. Like C++ functions, it may be defined: • within the module definition, typically in a .h header file • outside the module, typically in a .cpp file my_module.h my_module.h SC_MODULE (my_module) { void my_method() { ... } }; SC_MODULE (my_module) { void my_method(); ... }; my_module.cpp void my_module::my_thread() { ... } A thread process body within a module definition A thread process body outside a module definition

SC_THREAD processes A thread is a process that is called only once and never gets called again after termination (unless with a global reset) The thread process body is composed of two stages separated by the wait() statement The synthesis stage of a thread process should be typically written as a non-terminating loop. wait() can be used in the synthesis part which suspends the thread and resumes upon an event from the thread’s associated clock edge void my_thread() { …. // compile-time initialization stage wait(); ….. // run-time hardware synthesis stage }

Thread mechanisms Any statements between two wait() statements will be constructed as combinational logic. These two examples creates the same logic wait(); c = (a&0xF0) >> 4) | (b&0x0F)<<4); wait(); c = (a&0xF0); c = c >> 4; d = (b&0x0F); d = d <<4; e = c|d; The synthesis stage runs when it receives the signal to which the process is sensitive. The thread may be sensitive to a positive edge or a negative edge but not both. All the values assigned to variables in the initialization stage must be resolvable at compile-time. They can’t contain signal or port reads

Thread example void my_module::run() { int a, b;
// end of compile-time initialization stage wait(); // start of runtime synthesized hardware stage a = 1; // clock cycle 1 a = a+1; // clock cycle 2 b = 5; // clock cycle 2 wait() a = b; // clock cycle 3 b = b+1; // clock cycle 3 ….. }

SC_METHOD processes A method process can be used to model either synchronous or combinational hardware SC_METHOD process must not contain wait() statements and must always terminate A method must be sensitive to all the ports and signals it reads Executed every time a trigger or temporal event occurs. Each signal and port written to must be written to on every execution of the SC_METHOD SC_MODULE(adder) { sc_in <sc_uint <32> > in1; sc_in <sc_uint<32> > in2; sc_out <sc_uint<32> > out; public: void add() { out = in1.read()+in2.read(); } SC_CTOR(adder) { SC_METHOD(add); sensitive << in1 << in2;

Process sensitivity list
A sensitivity list identifies which input ports and signals trigger execution of the code within a process. A process can read from and write to ports, internal signals, and internal variables. Processes use signals to communicate with each other. One process can cause another process to execute by assigning a new value to a signal that interconnects them. SC_MODULE (my_module) { void my_thread(); sc_port <bool> clock; ... SC_CTOR (my_module) { SC_THREAD (my_thread); sensitive<<clock.pos(); } };

Rest of code (testbenches, SW code) Celoxica agility synthesizer
Synthesis and compilation flow Synthesizable subset Rest of code (testbenches, SW code) Celoxica agility synthesizer Verilog/edif Visual C++ SystemC library Quartus II executable

Using signals to communicate between processes and modules
incr.h SC_MODULE(incr) { private: sc_signal <int > x; int d[8]; public: sc_in <sc_int<18> > SW; sc_out <sc_uint<7> > HEX0; void runadd(); void display(); SC_CTOR(incr) { SC_METHOD(runadd); sensitive << SW; SC_METHOD(display); sensitive << x; }; SW HEX0 runadd display x incr

Implementing the processes
void incr::display() { int digit, i=0; int t=x; sc_uint<7> hex; digit=t%10; if (digit == 0) hex = 64; else if (digit == 1) hex = 121; else if (digit == 2) hex = 36; else if(digit == 3) hex = 48; else if(digit == 4) hex = 25; else if(digit == 5) hex = 18; else if(digit == 6) hex = 3 else if(digit == 7) hex = 120; else if(digit == 8) hex = 0; else if(digit == 9) hex = 24; HEX0=hex; } SW HEX0 runadd display x incr void adder::runadd() { int y; y=SW.read(); x=y+1; }

Synthesis point: ag_main
#include <systemc.h> #include <incr.h> // void ag_main() { adder incr(“incr"); } Use the produced Verilog file from Celoxica’s Agility compiler with the Quartus II software

Testing and verifying your code in a C++ development environment
SC_MODULE(tester) { int x; public: sc_out <sc_int<18> > SW; sc_in <sc_uint<7> > HEX0; void run(void) { wait(); while(1) { cout << "enter a number" << endl; cin >> x; SW.write(x); cout << "answer " << HEX0.read() ; } SC_CTOR(tester) { SC_THREAD(run); sensitive << HEX0 ; }; SW HEX0 runadd display x incr testbench

Simulation entry point sc_main
#include <iostream> #include <systemc.h> #include "adder.sc.h" using namespace std int sc_main(int argc, char *argv[]) { sc_signal <sc_int<8> > SW; sc_signal <sc_uint<7> > HEX0; incr incr1(“incr1"); tester test1("test"); ad1.SW(SW); ad1.HEX0(HEX0); test1.SW(SW); test1.HEX0(HEX0); sc_start(); return 0; } elaboration execution

Launch your executable (simulator)
This time you are using the Visual C++ compiler together with the SystemC library Simulate your system by executing it on the command prompt

Integer data types Supported native C++ synthesizable data types
long long (64 bits) long (32 bits) int (32 bits) short (16 bits) char (8 bits) bool (1 bit) SystemC also allows further refined storage types sc_bit sc_bv <width> sc_int <width> sc_uint <width> sc_bigint <width> sc_biguint <width>

Floating-point data types
Full supported for compilation with VisualC++ and systemC library Supported for synthesis by Celoxica’s Agility compiler only if they can be evaluated during compilation time → Any calculation involving float point must evaluate to a constant Synthesizable example sc_int<16> SinTable [128]; for(int i=0; i < 128; i++ ) { double index=(double) i; double angle=2.0*PI*(index/128.0); double sineangle=sin(angle); SineTable[i]=sc_uint<16>(sineangle* ); } Not synthesizable example sc_in<float> in; float x; X=in.read();

Arrays Full supported for compilation with VisualC++ and systemC library For synthesis with Agility compiler, an array is synthesizable if its elements are of a synthesizable type and its size is compile- time determinable Although the array is not synthesizable, the members are only used during compilation time and are cat to type int Synthesizable example int temp, array[100]; sc_in <int> in; For(int y=0; y < 100; y++) { in.read(temp); array[y]=temp; wait(); } #include <math.h> float l[10]; for(int y=0; y<10; y++) l[y]=log10((y+1)*10); //… sc_out <int> out; for(int y=0; y < 10; y++) out.write((int)l[y]);

Pointers Full supported for compilation with VisualC++ and systemC library Agility supports pointers subject to the restriction that Agility can always determine the target of the pointer A pointer is synthesizable if it is a pointer to a synthesizable type, and the value to which the pointer points is compile-time determinable Resolvable pointer void clear(char *a, char *b) { *a=255; *b=255; } sc_out <unsigned char> out; unsigned char x, y; clear(&x, &y); out.write(x);

Other considerations for synthesis
Full supported for compilation with VisualC++ and systemC library Operator new is supported at compiler time but not at runtime. delete operator is not supported Each action in a switch must have a break statement. Fall through is not allowed If a function to be synthesized, its body must only contain code that within Agility synthesizable subset General recursion is not supported for synthesis

Rest of code (testbenches, SW code) Celoxica agility synthesizer
Example using synthesis and compilation flow Synthesizable subset Rest of code (testbenches, SW code) Celoxica agility synthesizer Verilog/edif Visual C++ SystemC library Quartus II executable

Celoxica’s Agility compiler tutorial

Starting adding files to your project

Adjust the project settings to use the Cyclone II devices

Add your file and write your class declaration

Add the main synthesis entry point
Not the most direct implementation

Build your project

Check the CDFG and Verilog output

Copy the Verilog file into Quartus II
Sometimes the Celoxica compiler changes the name of input/output outputs when it exports to Verilog so make sure to fix this in Quartus II assignment editor Then build and download to the FPGA

What if we want to verify and simulate before downloading to the FPGA?
Choose Visual C++ tester orgate KEY[0] KEY[1] LEDG clk Add a tester.h for tester module

Add the main body

Hit Build and then run the executable
Build indirectly invokes the command line compiler of VC (cl) which links your compiled code with SystemC.lib

If you like to synthesis again, make sure to mark the files you want to synthesize
Choose Verilog as your desired output again Exclude tester.h and orgate_exe.cpp

HW/Lab 3 Objective: Learn SystemC using both the synthesis and compilation flows. This time it is a simple example. We will design an 8-bit ALU. Use the 18 switches in the DE2 board to achieve your target: 8 switches give the binary of the first unsigned integer 8 switches give the binary of the second unsigned integer 2 switches give the ALU operation (addition, subtraction, multiplication and XORING) In your report, make sure to include the SystemC code, the executable output print of simulations, and the FPGA resource utilization. You have to send me by your projects archived for both the SystemC design and the Quartus II files Lab due before Thanksgiving holiday (Thur 22nd) Tutorials and Celoxica manual uploaded at the class webpage and also available to download from Engineering website

Lecture 16: Application-Driven Hardware Acceleration (1/4) Prof. Sherief Reda Division of Engineering, Brown University

Fast Fourier transform
One of the most important subroutines in scientific computing Used in many applications including: signal and image processing, solution of differential equations, multiplication of polynomial functions, data compression, …, etc One of the most widely implemented hardware accelerators

Discrete Fourier transform
DFT Maps a set of input points to another set of output points. The operation is reversible.

Roots of the unity What are the Nth roots of unity?
If N = 8 then we have (0, j) imaginary real (-1, 0) (1, 0) (0, -j) Define

Calculating the DFT How many arithmetic (+ and *) operations do we need to calculate the DFT?

Computing the DFT using the FFT
How can we do better? Fast Fourier Transform (FFT) The sum of N point DFT has been broken into two N/2 point DFTs DFT of odd indices DFT of even indices

Example when N=8 Objective: Compute X0, X1, … X7 given x0, x1, …, x7
magic box x0 X0 X1 X2 X3 X4 X5 X6 X7 x2 x4 x6 x1 magic box x3 x5 x7 Note that

Now let’s apply the idea recursively
x0 X0 X1 X2 X3 X4 X5 X6 X7 x4 x2 x6 x1 x5 x3 x7

One more time x0 X0 x4 X1 x2 X2 x6 X3 x1 X4 x5 X5 x3 X6 x7 X7
How many operations do we need now? What is the execution time on a general purpose CPU? What is the execution time on a FPGA? How many resources u need?

Another way to visualize FFT computations
How can we determine the order of the first inputs? x0 X0 Butter fly Butter fly Butter fly X4 x4 x2 Butter fly Butter fly X2 Butter fly X6 x6 x1 Butter fly Butter fly Butter fly X1 x5 X5 x3 Butter fly Butter fly Butter fly X3 x7 X7

Application of FFT: faster multiplication of two polynomials
Suppose we want to evaluate A(x) at x0, how many operations do we need? Use Horner’s rule Suppose you have two polynomials represented by the coefficient vectors How many operations it takes to add these two polynomials? How many operations it takes to multiply these two polynomials?

Point value representation
A point-value representation of a polynomial A(x) of degree-bound N is a set of N point-value pairs such that all of the xk are distinct and yk=A(xk) for k=0, 1, …, N-1 How many operations do we need to compute the point representation of a polynomial? How can we do better?

Interpolation of polynomials from point-value representations
Given the point representation of a polynomial, how can we inverse the evaluation, i.e., determine the coefficient form of a polynomial from a point representation? How can we find the a’s?

Adding and multiplying polynomials in point representation
Polynomial A Polynomial B If polynomial C(x)=A(x)+B(x) then we can get point representation of C easily How many operations do we need? How about C(x)=A(x)*B(x)?

Ordinary multiplication
How can we convert a polynomial quickly from coefficient form to point-value and back? Ordinary multiplication O(N2) Evaluate O(N2) Interpolate O(N2) Point-wise multiplication O(N) It does not make sense now. How can we evaluate and interpolate faster than O(N2)? Can we choose the evaluation points smartly?

Choosing the evaluation points smartly
.

Ordinary multiplication
Finally multiplying polynomials in O(NlogN) Ordinary multiplication O(N2) Inverse FFT FFT O(N log N) Point-wise multiplication O(N)

Back to signal processing
Linear system with Impulse response (b0, b1, …, bN-1) (a0, a1, …, aN-1) T=0: a0b0 T=1: a0b1+a1b0 T=2: a0b2+a1b1+a2b0 …. The response of the system to the input signal at different times is equal to the coefficients of the polynomial produced from multiplying the input signal polynomial with the impulse response polynomial? Commonly known as the convolution of the input and the system’s impulse response. How to do to find the output response faster than O(N2)?

Summary The lecture covered one of the most important hardware accelerators: FFT We have seen how it can be parallelized and speed up Examined some of the applications

Viterbi algorithm A dynamic programming algorithm for finding the most likely sequence of hidden states, the Viterbi path, that results in a sequence of observed events. Originally devised by Andrew Viterbi in 1967 as an error-correction scheme for noisy digital communication links. Widely used in decoding the convolutional codes for both CDMA and GSM digital cellular, dial-up modems, satellite, deep-space communications and wireless LANs. Also used in speech recognition, computational linguistics, and bioinformatics.

Viterbi decoders in digital communication systems

1. Encoding using convolution codes
+ u1 u0 u-1 u-2 + O2 Each input bit is coded onto 2 output bits. The 2 outputs bits are produced by using modulo-2 adders. The selection of which bits are to be added to produce an output bit is called the generating polynomial O1 = (u0+u1+u-1+u-2)mod 2 O2 = (u1+u0+u-2) mod 2

Example Assume the input sequence is 1011 What is the output?
Example by C. Langton

Truth table presentation

State transition graph representation
O1O2=00 O1O2=11 O1O2=00 O1O2=00 O1O2=01 O1O2=10 O1O2=01 O1O2=10 de Bruijn graph. Not all outputs are shown

Tree representation

Trellis diagram Not all transitions are shown

Output of the encoder for various inputs
Encoder output 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 How can we devise a good generating polynomial? Let’s say we receive It is not one of the possible 16 sequences. How do we decode it?

2. Decoding received sequences using the Viterbi algorithm
Let’s decode the received sequence 000 001 010 011 100 101 110 111 01 11 cost 00 1 11 1

2nd step Let’s decode the received sequence 01 11 01 11 01 01 11 000
001 010 011 100 101 110 111 01 11 cost 00 00 3 11 11 1 11 1 00 3

3rd step Let’s decode the received sequence 01 11 01 11 01 01 11 000
001 010 011 100 101 110 111 01 11 cost 00 00 00 4 11 3 11 11 10 2 01 11 3 11 4 00 00 1 01 2 10 5

4th step Let’s decode the received sequence 01 11 01 11 01 01 11
000 001 010 011 100 101 110 111 01 11 cost 00 00 00 00 min(3, 6) 11 10 11 00 01 11 3 11 11 10 3 01 11 4 11 4 00 00 3 01 1 10 3 A any step, there is only one path from the initial state to any state. In case more than one path converge to a node, always pick the minimum

5th step Let’s decode the received sequence 01 11 01 11 01 01 11 000
001 010 011 100 101 110 111 01 11 cost 00 00 00 4 11 11 11 4 11 11 11 10 10 5 01 01 01 01 11 1 11 11 4 00 00 00 11 3 11 01 01 10 4 10 10 01 3

6th step Let’s decode the received sequence 01 11 01 11 01 01 11 000
001 010 011 100 101 110 111 01 11 cost 00 00 00 4 11 11 11 1 11 11 11 10 10 4 01 01 01 01 01 11 4 11 11 4 00 00 00 3 11 11 01 01 10 5 10 10 01 3

Finally 000 001 010 011 100 101 110 111 01 11 cost 00 00 00 1 11 11 11 11 11 11 5 11 11 10 10 6 01 01 01 01 01 11 6 11 11 3 00 00 00 11 5 11 01 01 10 3 10 10 01 4 Winner path is 000, 100, 010, 101, 110, 011, 001, 000 with input sequence What is runtime using SW on a general-purpose CPU? What is the runtime using an FPGA?

Summary So far we have covered popular application-driven algorithms to accelerate in FPGAs FFT for signal and image processor as an example of divide and conquer algorithms Speech recognition applications Viterbi algorithm for digital communication as an example of dynamic programming algorithms Next time, we cover some popular algorithms for bioinformatics

Project updates 2nd project report extended until Sunday Dec 2nd. Make sure to add the new material to the content of the 1st report. The new report is worth 10 points. Main evaluation criterion is your progress on the project plan you outlined in the first report. How thorough and creative your ideas develop? How meticulous is the experimental setup? How do the carried out experiments serve towards the project goals? Make sure to also send me a couple of slides by Monday Dec 3rd to present on Tuesday Dec 4th (last lecture)

Status We have covered popular application-driven hardware acceleration using reconfigurable computing FFT for signal and image processing as an example of divide and conquer algorithms Speech recognition applications Viterbi algorithm for digital communication as an example of dynamic programming algorithms This lecture we overview some of the algorithms for bioinformatics

Quick introduction to molecular biology & bioinformatics

DNA Can be thought of as the “blueprint” for an organism
Composed of small molecules called nucleotides four different nucleotides distinguished by the four bases: adenine (A), cytosine (C), guanine (G) and thymine (T) DNA is digital information A single strand of DNA can be thought of as a string composed of the four letters: A, C, G, T ACGTTCTA DNA molecules usually consist of two strands arranged in a double helix structure where A bonds to T and C bonds to G

Genes Genes are the basic units of heredity
A gene is a sequence of bases that carries the information required for constructing a particular protein. Such a gene is said to encode a protein The human genome comprises ~ 20K-25K genes Those genes encode > 100,000 proteins

Proteins a folded protein structure amino acids Proteins perform most life functions and even make up the majority of cellular structures. Proteins are large, complex molecules made up of smaller subunits called amino acids. Chemical properties that distinguish the 20 different amino acids cause the protein chains to fold up into specific three-dimensional structures that define their particular functions in the cell. Proteins can be thought of as a string composed from a 20- character alphabet

Central dogma of molecular biology
RNA is like DNA except that they are usually single stranded and the base uracil (U) is used in place of thymine (T) a strand of RNA can be thought of as a string composed of the four letters: A, C, G, U

Translation

Translation There are possible 6 reading frames in translating DNA sequences into proteins. In many cases, FPGAs are used to translate a DNA sequence into the 6 frames in parallel and then concurrently apply any subsequent processing

DNA string alignment A sequence alignment is a way of arranging the primary sequences of DNA (or RNA or protein) to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as insertion or deletion mutations introduced in one or both lineages in the time since they diverged from one another. At each position, one of three cases can occur: A match occurs when the same character is present in both strings A mismatch, or substitution, when there are two different characters A gap, where is an insertion of one character in only one string, or symmetrically a deletion in the other string How can we find the best alignment between two DNA strings?

Finding the best global alignment
[Figures from slides from Bioinformatics Applications by D. Lavenier and M. Giraud] Costs: +4 for a match -2 for a mismatch -3 for a gap Needleman and Wunsch (NW) dynamic programming algorithm

Local alignment: finding the most similar subsequences
Costs: +4 for a match -2 for a mismatch -3 for a gap Smith and Waterman (SW algorithm)

Dynamic programming advantage on FPGAs
All cells on a same anti-diagonal can be computed simultaneously What is the runtime on a general purpose CPU? What is the runtime on an FPGA?

Required number of computational cells

Examples of commercial products
Bioceleration Ltd. Each BioXL/H board contains eight FPGA modules and 128MB of global memory. Each of the modules is programmed to calculate four matrix cells per clock cycle (for the Smith-Waterman algorithm). An eight-board BioXL/H executes these applications at a speed of 6 billion matrix cells per second. The clock rate of the system is 25-33MHz (programmable). Examples of applications supported: Smith-Waterman algorithm Translation of nucleic acid sequences to 6 reading frames and search frame into an amino acid database

More examples: TimeLogic
“CodeQuest is a biocomputing workstation that processes large genomics searches and sophisticated informatics workflows. Using its FPGA-based DeCypher Engines, the quad-core CodeQuest workstation speeds Tera-BLAST, Smith-Waterman, Hidden Markov Model (HMM) and gene modeling searches at the speed of a mid-sized cluster.” “It brings several fold the performance of a 64-CPU cluster, yet costs less than 10 CPUs”

Reconfigurable Computing

Similar presentations

Presentation on theme: "Reconfigurable Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reconfigurable Computing

Similar presentations

Presentation on theme: "Reconfigurable Computing"— Presentation transcript:

Similar presentations

About project

Feedback