Reconfigurable Computing - Verifying Circuit Performance! John Morris Chung-Ang University The University of Auckland ‘Iolanthe II’ in a good breeze on the Bay of Islands
Measuring Circuit Performance Don’t believe the simulators! Although some experience has shown that predictions can be reasonably accurate … Potential for gross error is very large A large number of small values need to be summed Possibility of large statistical errors Professional engineers always check That’s what makes them professional! Scientists always want to be able to repeat an experiment That’s a principle of scientific theory Don’t accept anything as fact unless you can repeat it! Whatever your background or reason … Measurement on an actual device needed You can use the simulator’s numbers for guidance though!
Measuring Circuit Performance Use the simulator’s results as a guide But what does it tell you? It calculates propagation delays from inputs to outputs along various circuit paths Simulators try to identify the longest (in time) path for you In a simple combinatorial block that’s fine eg a one-stage (no registers) adder Should identify the carry chain in a ripple carry adder or its equivalent in a more complex adder a single-stage parallel array multiplier Again – in all types of multipliers – there’s a carry chain that limits performance In a pipelined circuit, you want the longest path between two clocked flip-flops In principle, easy for the simulator to find! In practice, you may need to spend more time checking that it selected the right path!
Measuring Circuit Performance Checking the simulator’s predictions Do a sanity check! Using the manufacturer’s published propagation delays for individual circuit elements Estimate the path delay yourself Count the number of logic blocks needed for the computation Will additional multiplexers be needed for steering or selection logic? Are I/O buffers needed? These typically have a considerable delay (relative to other circuit elements)
Measuring Circuit Performance Using the manufacturer’s published propagation delays for individual circuit elements Estimate the path delay yourself … You can use the synthesizer to help you here Its count of the number of the total number of logic blocks will be 100% accurate From this, you infer the number of logic blocks in a path eg For a 32-bit adder, you can obviously start by dividing the total number of logic blocks by 32 Then try to estimate how many logic blocks are needed for overheads, eg Multiplexers needed in a carry select adder For FPGAs, remember …
Measuring Circuit Performance Using the manufacturer’s published propagation delays for individual circuit elements Estimate the path delay yourself For FPGAs, remember … 1.Look up tables (LUTs) are usually used for boolean logic This means that Using Xilinx’s 9-input CLBs y <= a AND b probably takes about the same time as y <= a AND b AND c AND d AND … (up to 9 inputs) Beyond 9 inputs, add a considerable delay to connect to a neighbouring CLB Using Altera’s 4-input logic elements y <= a AND b probably takes about the same time as y <= a AND b AND c AND d (up to 4 inputs) Beyond 4 inputs, add a small delay to use the fast cascade chain logic
Measuring Circuit Performance Using the manufacturer’s published propagation delays for individual circuit elements Estimate the path delay yourself For FPGAs, remember … 2.Paths between logic blocks may have large numbers of transmission gates on them! As noted before, there’s a considerable advantage to being able to keep critical logic on one logic block But Altera’s cascade chains attempt to mitigate the penalty for not fitting critical logic into a single logic element And all manufacturers now provide for fast adder carry chains! This makes estimation of path delays difficult Nevertheless, you should make a rough estimate!!
Measuring Circuit Performance Estimate the path delay yourself If your estimate matches that from the synthesizer, then we’re in good shape ‘Matches’ here can be interpreted liberally If the synthesizer reports 50ns and you calculate 30ns then this is a reasonable match You probably didn’t count enough transmission gates, etc, on the connections between logic blocks! You don’t need to do a very precise calculation The synthesizer has done that for you! Your aim is to ensure that you are reading the correct number from the synthesizer’s report! With a reasonable match (say within 50% - either way), believe the synthesizer and continue … With a serious mismatch 1.Read the synthesizer’s report more carefully You may be looking at the wrong figure! 2.Check your estimate more carefully
Now we believe we know how fast the circuit is … What does this speed mean in practice? You have a longest delay of x ns A synchronous (clocked) circuit can run at 1/x GHz ? Almost! Don’t forget to allow for 1.Propagation delay in the registers 2.Temperature Circuits run slower at high T Make sure that your estimate of t pd is a good one for the highest temperature your circuit will need to withstand Don’t think that this will be low! Try touching a modern high performance processor! (Make sure you have some burn cream nearby!) or simply work out that all those fans hiding that chip aren’t there for decoration! 3.Chip-to-chip variations in fabrication … 32-bit adder – inputs a, b, c Naïve approach - Test all possibilities a – 4 10 9 ( all possible 32-bit numbers ) b – 4 10 9 ( do ) c – 2 ( 0 or 1 ) Total 4 4 2 = 1.6 x GHz machine – 10 9 cases / sec (optimistic!) 1.6 seconds – about 6 months will do it! What about the rest of the machine? -, x, /, ^, v, >, … We should be finished in about 5 years Hmmmm … our 4 GHz machine should be about 30 GHz now! Clearly we need to be more efficient about testing!
Now we believe we know how fast the circuit is … What does this speed mean in practice? You have a longest delay of x ns A synchronous (clocked) circuit can run at 1/x GHz ? Almost! Don’t forget to allow for 1.Propagation delay in the registers More on pipelines later! 2.Temperature 3.Chip-to-chip variations in fabrication The gates will only be nominally 0.18 ! Some may actually be 0.15 and others 0.25 … A maximum clock frequency of 1/(x+ ) GHz may be quite large! Now you’re ready to design an experiment to verify that the circuit does actually run as predicted!
A word of warning! Experimental design! If you don’t make an estimate of what you expect to measure before starting You will waste a lot of time doing the experiment! Working out the expected delay time is formally equivalent to setting out a hypothesis for the experiment The simulator says the delay will be x ns so I hypothese (predict) that we will measure a delay of about x ns This (simple) hypothesis guides your experimental design and set up! For example, assume you have a 150MHz oscilloscope available …
Experimental hypothesis Experimental hypothesis The simulator says the delay will be x ns so I hypothese (predict) that we will measure a delay of about x ns This (simple) hypothesis guides your experimental design and set up! For example, assume you have a 150MHz oscilloscope available You try to make measurements of the delay, but are surprised to find that there appears to be no delay at all! Somebody then remembers to go back and read the synthesis report.. Which tells you to expect a 5ns delay – or one that will be difficult to measure on a slow ‘scope!
Experimental Hypothesis The simulator says the delay will be x ns so I hypothese (predict) that we will measure a delay of about x ns This (simple) hypothesis guides your experimental design and set up! You now know that you have to design your experiment differently, eg 1.Build a wider adder So that the delay is long enough to measure easily 2.Work out how to measure n repeats of the calculation So that 5 n > 20ns (or some time that you can be certain to measure accurately!) 3.Devise an entirely new technique Which doesn’t require direct measurement of such a small delay
Measuring the delay Usual strategy Design a test bench that will drive the component you are testing – the Component Under Test (CUT) This test bench will be different from the one used to verify the component’s correctness! First task: What will the test bench do? What is the worst case input(s)? Ie the ones that will take the longest time to produce a result! For an adder, there are several possibilities: ? Set up the test bench to produce one of these inputs Make a small state machine
Measuring the delay You’ve identified a suitable worst case … Set up the test bench to produce these inputs Make a small state machine 2 states may be adequate: State 1: Clear the outputs State 2: Apply the test case How will you know that the worst case has completed computation? In the case of an adder, it’s easy For other circuits, you may need to add some ‘probe’ circuitry For example, the worst case is when TWO outputs go high Add an AND gate to your driver and route the output of this gate to an external pin Set up your scope to measure the delay from the start of the clock cycle to the output signalling completion
Measuring the delay You’ve identified a suitable worst case … Set up the test bench to produce these inputs Make a small state machine 2 states may be adequate: State 1: Clear the outputs State 2: Apply the test case PROCESS( clk ) BEGIN IF clk’EVENT AND clk=‘1’ THEN CASE state IS WHEN state1 => a a <= one; b <= minus_one; state <= state1; start <= ‘1’; END CASE; END PROCESS; Set up your scope to measure the delay from the start of the clock cycle (or the start signal) to the output signalling completion (carry out for an adder)
Measuring a delay Second strategy Use the FPGA to do everything! Drive a fast counter with the fastest clock available Stop the counter when the operation is complete Requires fast input clock Resolution of this clock determines timing accuracy Can use FPGA PLLs to multiply clock PROCESS(clk) BEGIN CASE state IS WHEN state1 => counter <= zero; a <= zero; b <= zero; IF c_out = ‘0’ THEN state <= state2; END IF; WHEN state2 => a <= one; b <= minus_one; counter_enable <= ‘0’; state <= state3; WHEN state3 => IF c_out = ‘1’ THEN counter_enable <= ‘1’; --- transfer counter output to --- LED display, etc state <= state1; END IF; END CASE; END PROCESS;
Measuring Delay Strategy three If the circuit is clocked too fast, it won’t operate correctly ie it won’t complete the computation before the next clock edge arrives eg in the adder example, next clock edge arrives before carry has rippled through to carry out, so it never becomes ‘1’ Set up test circuit as first case, but gradually increase clock speed until carry_out never becomes ‘1’ Use secondary clock derived from (fast) master clock whose frequency is gradually increased until circuit stops operating correctly ( ie never produces a completion signal) Count pulses of master clock with a loadable counter Secondary clock is derived from counter completion signal Reduce loaded count value to reduce secondary clock cycle time (reciprocal scale)
Measuring the delay Timing accuracy Don’t forget the I/O buffer delay! A signal from either your driver or the CUT has to go through a pin – which implies passing through an I/O buffer Don’t Put one probe on the clock input and one on an output Your result will be increased by the I/O buffer delay! Instead Feed the clock back off the FPGA through an I/O buffer Now both the clock and the completion signal will be delayed by one I/O buffer Still some possibility of error Both I/O buffer will not have exactly the same delay But this error is likely to be of the same magnitude as other unavoidable errors, so … In general, make sure that both your ‘start’ and ‘stop’ signals have very similar delays
Measuring the delay Timing accuracy Don’t forget the I/O buffer delay! In general, make sure that both your ‘start’ and ‘stop’ signals have very similar delays If you had to combine your ‘stop’ signals in some logic, eg one logic block to AND two signals together Then pass the start signal through an artificial similar delay Your completion signal is the AND of two outputs complete <= a_out AND b_out So your start signal should be similarly delayed start_delayed <= start AND start ‘start’ will often be the clock driving your test circuit Be careful: Some compilers are clever enough to realize that this is a ‘do nothing’ piece of logic and will remove it!!
CUT a b c_in sum c_out Clock FPGA