Edge Detection
256x256 Byte image UART interface PC FPGA 1 Byte every a few hundred cycles of FPGA Sobel circuit Edge and direction
Block Diagram of Sobel Edge Detector Write Addr producer Memory 3x256 Bytes Read Addr producer 3x3 Table Derivative Absolute edge input Input is stream of pixels Output is stream of: Edge: (0/1) pixel at center of table is an edge Direction: horizontal, vertical, left-diagonal, right-diagonal data & address address direction Magnitude
Hints for the design Try to understand the algorithm Your implementation should match with the Reference Model at each stage Create High Level Model HLM should include 3x256 memory array instead of 256x256 HLM does not need to be synthesizable Verify the functionality of HLM by comparing the results
Hints for the design Create synthesizable code Verify the functionality of the synthesizable code Optimize the synthesizable code Divide the algorithm to pipeline stages Verify the functionality of each stage Optimize each stage using the RTL design optimization techniques Try to use multiple processes instead of one single huge process Explicit state machines are easier to optimize Try to have clear idea of which signals in your code are registered and which ones combinational
Verification Goal: compare the result of your circuit with the Reference Model results: For debugging purpose, use smaller input data, e.g 8x8 For an 8x8 input, you can debug your code without using automatic file comparison: reasonable to do debugging by exploring the wavefroms Once confident of functionality for small size inputs, use real 256x256 data to produce the results for edge and direction
Verification To compare the real results with the Reference Model results: Modify the testbench, or write a new one that can read text files and compare them Use whatever programming language that you are comfortable with to read the files and do the comparison Use Matlab, which is available on SunEE lab machines to easily read the files and compare them
Optimizing pipeline stages: Derivative For example, assume that Derivative stage calculates: P=(a1+2b1+c1)- (a2+2b2+c2) Inputs are 8bit and output is 9bit
Optimizing pipeline stages: + - x + +x a1 c1 b1 a2 c2b2 P
Optimize pipeline stages: If you just synthesize this stage without any optimization: Latency: 2 Max frequency: 86.64MHz Longest path: 11.54ns Total Logic Elements: 114 Total pins: 61
If you optimize the stage: + - x a1 c1 b1 P + x 2 a2 c2 b2
Optimize pipeline stages: optimized Latency42 Fmax147.69MHz86.64MHz Longest path 6.77ns11.54ns Logic Elements Total pins4561 non-optimized
A simple Adder example library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity adder is port( clock : in std_logic; reset : in std_logic; a : in unsigned(15 downto 0); b : in unsigned(15 downto 0); z : out unsigned(16 downto 0) ); end adder; architecture main of adder is signal tmp: unsigned(16 downto 0); signal i1,i2: unsigned(15 downto 0); begin process begin wait until rising_edge(clock); i1<=a; i2<=b; end process; process begin wait until rising_edge(clock); if reset='1' then tmp<=to_unsigned(0,17); else tmp<=('0' & i1) + ('0' & i2) ; end if; end process; z<=tmp; end main;
A simple Adder example:.csf.rpt file Chip name: |adder| Device for compilation: |EP20K200EFC484-2X | Total logic elements | 66 / 8320 ( < 1 % )| Total pins | 51 / 379 ( 13 % ) | Total ESB bits | 0 / ( 0 % ) | Info: Clock has Internal fmax of MHz between source register i21 and destination register N_z16 (period= ns)
MOD 3 You may need a circuit that keeps the pixels order in the correct sequence while writing into 3x3 table :
XXXXXXXXXXXXXC2C1C0 B15B14B13B12B11B10B9B8B7B6B5B4B3B2B1B0 A15A14A13A12A11A10A9A8A7A6A5A4A3A2A1A0 ……………… P15P14….P2P1P0 O15O14….O2O1O0 D15D14….D2D1D0 C15C14…. C2 C1C0 B15B14….B2B1B0 A15A14….A2A1A0 16x16 memory 3x16 memory N15N14…N2N1N0 Do edge detection for B1
C10C9C8C7C6C5C4C3C2C1C0 B15B14B13B12B11B10B9B8B7B6B5B4B3B2B1B0 A15A14A13A12A11A10A9A8A7A6A5A4A3A2A1D0 ……………… P15P14….P2P1P0 O15O14….O2O1O0 D15D14….D2D1D0 C15C14….C2C1C0 B15B14….B2B1B0 A15A14….A2A1A0 16x16 memory 3x16 memory C12C11 C13C14C15 N15N14…N2N1N0 D0 C0 B0 Reorder the pixels when Writing to 3x3 table Row Mod
Row count in 256x256 memory: C0 E0 D0 E0 D0 C0 Rowcount for E0=4 (E0 is the last pixel written to 3x256 mem: To fill the third row of 3x3 table, find the location of E0 in 3x256 mem: 4 mod 1=1 in 3x256 mem Grab E0 and copy it to 3 rd row in 3x3 The 2 nd row of 3x3 table: D0=(4-1 mod 3)=0; in 3x256 mem Take D0 and copy it to 2 nd row in 3x3 And so on: Row# of X in 3x3=(row# of X in 3x256) mod 3 Row# of X-1 in 3x3=(row# of X-1 in 3x256) mod 3 Row# of X-2 in 3x3=(row# of X-2 in 3x256) mod … Row count in 3x256 memory: Row count in 3x3 table.. H G F E D C B A mod3 3x256 memory 3x3 table
How to implement mod 3 From numbers theory: d2d1d0 mod 3 is equal to: (d2+d1+d0) mod mod 3 = (1+3+6) mod 3 = 10 mod 3 = (1+0) mod 3= 1 Row count is an 8-bit number (0-255) Design the circuit of mod 3 for a 4-bit input circuit
Mod 3 MSB 4-bits LSB 4-bits Add 2-bits 4-bits 2-bits
Mod 3 xE5 Mod 3=? Mod Add
Mod 3 2 to 4 decoder 8bits Image row 2bits mem0 mem2 mem1 we i_valid FF
Count up to 2 2 to 4 decoder 2bits mem0 mem2 mem1 we i_valid Alternative to mod 3 FF