Mar. 2009 Wu Jinyuan, Fermilab 1 FPGA: From Flashing LED to Reconfigurable Computing Wu, Jinyuan Fermilab IIT Mar, 2009.

Slides:

Advertisements

Similar presentations

Microprocessors A Beginning.

Advertisements

A Low-Power Wave Union TDC Implemented in FPGA

On-Chip Processing for the Wave Union TDC Implemented in FPGA

Jan. 2009Jinyuan Wu & Tiehui Liu, Visualization of FTK & Tiny Triplet Finder Jinyuan Wu and Tiehui Liu Fermilab January 2010.

ADC and TDC Implemented Using FPGA

Improving Single Slope ADC and an Example Implemented in FPGA with 16

Processor Technology and Architecture

Chapter 16 Control Unit Operation No HW problems on this chapter. It is important to understand this material on the architecture of computer control units,

Programmable logic and FPGA

Chapter 16 Control Unit Implemntation. A Basic Computer Model.

Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.

Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Logic and Computer Design Fundamentals Registers and Counters

Chapter 15 IA 64 Architecture Review Predication Predication Registers Speculation Control Data Software Pipelining Prolog, Kernel, & Epilog phases Automatic.

GCSE Computing - The CPU

Computer Organization and Architecture

Digital Communication Techniques

Low Cost TDC Using FPGA Logic Cell Delay Jinyuan Wu, Z. Shi For CKM Collaboration Jan

1 © Unitec New Zealand Embedded Hardware ETEC 6416 Date: - 10 Aug,2011.

Resource Saving in Micro-Computer Software & FPGA Firmware Designs Wu, Jinyuan Fermilab Nov

Advanced Topics on FPGA Applications Screen B Wu, Jinyuan Fermilab IEEE NSS 2007 Refresher Course Supplemental Materials Oct, 2007.

CS1Q Computer Systems Lecture 9 Simon Gay. Lecture 9CS1Q Computer Systems - Simon Gay2 Addition We want to be able to do arithmetic on computers and therefore.

DLS Digital Controller Tony Dobbing Head of Power Supplies Group.

TDC and ADC Implemented Using FPGA

MiniBoone Detector: Digitization at Feed Through Student: John Odeghe ; SC State, Fermi Lab Intern Supervisor: JinYuan Wu; Fermi Lab 1.

Resource Awareness FPGA Design Practices for Reconfigurable Computing: Principles and Examples Wu, Jinyuan Fermilab, PPD/EED April 2007.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

TDC for SeaQuest Wu, Jinyuan Fermilab Jan Jan. 2011, Wu Jinyuan, Fermilab TDC for SeaQuest 2 Introduction on FPGA TDC There are.

A Front End and Readout System for PET Overview: –Requirements –Block Diagram –Details William W. Moses Lawrence Berkeley National Laboratory Department.

Basic Sequential Components CT101 – Computing Systems Organization.

A Novel Digitization Scheme with FPGA-based TDC for Beam Loss Monitors Operating at Cryogenic Temperature Wu, Jinyuan, Arden Warner Fermilab Oct

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

Introduction to Computer Engineering CS/ECE 252, Fall 2009 Prof. Mark D. Hill Computer Sciences Department University of Wisconsin – Madison.

May Wu Jinyuan, (Fermilab Huang Yifei (IMSA) 1 An FPGA Computing Demo Core for Space Charge Simulation Wu, Jinyuan (Fermilab)

EE3A1 Computer Hardware and Digital Design

Mar. 12, 2009Wu, Jinyuan Fermilab1 Several Topics on TDC and the Wave Union TDC implemented in FPGA Wu, Jinyuan Fermilab LBNL, Mar.

May Wu Jinyuan, Fermilab 1 FPGA and Reconfigurable Computing Wu, Jinyuan Fermilab ICT May, 2009.

Introduction to Microprocessors

Advanced Topics on FPGA Applications Screen A Wu, Jinyuan Fermilab IEEE NSS 2007 Refresher Course Supplemental Materials Oct, 2007.

Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.

Floyd, Digital Fundamentals, 10 th ed Digital Fundamentals Tenth Edition Floyd © 2008 Pearson Education Chapter 1.

ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.

Readout Processing and Noise Elimination Firmware for the Fermilab Beam Loss Monitor System Wu, Jinyuan C. Drennan, R. Thurman-Keup, Z. Shi, A. Baumbaugh.

Basic Elements of Processor ALU Registers Internal data pahs External data paths Control Unit.

بسم الله الرحمن الرحيم MEMORY AND I/O.

Oct. 2007, Wu Jinyuan, FermilabIEEE NSS Refresher Course1 Digital Design with FPGAs: Examples and Resource Saving Tips Screen B Wu, Jinyuan Fermilab IEEE.

Tiny Triplet Finder Jinyuan Wu, Z. Shi Dec

The SLHC CMS L1 Pixel Trigger & Detector Layout Wu, Jinyuan Fermilab April 2006.

Capability of processor determine the capability of the computer system. Therefore, processor is the key element or heart of a computer system. Other.

Oct. 2007, Wu Jinyuan, Fermilab IEEE NSS Refresher Course 1 Digital Design with FPGAs: Examples and Resource Saving Tips Screen A Wu, Jinyuan Fermilab.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

Logic Gates Dr.Ahmed Bayoumi Dr.Shady Elmashad. Objectives  Identify the basic gates and describe the behavior of each  Combine basic gates into circuits.

Digitization at Feed Through R&D (2) Digitizer Performance Evaluation Student: John Odeghe ; SC State, Fermi Lab Intern Supervisor: JinYuan Wu; Fermi Lab.

Digitization at Feed Through Wu, Jinyuan Fermilab Feb

TDC and ADC Implemented Using FPGA

GCSE Computing - The CPU

Sequential Logic Design

Dr.Ahmed Bayoumi Dr.Shady Elmashad

Embedded Systems Design

Architecture & Organization 1

Instructions at the Lowest Level

An Introduction to Microprocessor Architecture using intel 8085 as a classic processor

Architecture & Organization 1

Digital Circuits and Logic

GCSE Computing - The CPU

Instructor: Michael Greenbaum

Chapter 4 The Von Neumann Model

Presentation transcript:

Mar Wu Jinyuan, Fermilab 1 FPGA: From Flashing LED to Reconfigurable Computing Wu, Jinyuan Fermilab IIT Mar, 2009

Mar Wu Jinyuan, Fermilab 2 Outline Electronic Aspect of FPGA:  LED Flashing  Logic Elements in a Nutshell  TDC and ADC FPGA as a Computing Fabric:  Moore’s Law Forever?  Space Charge Computing with FPGA Cores  Doublet Matching & Hash Sorter  Triplet Matching & Tiny Triplet Finder  Enclosed Loop Micro-Sequencer (ELMS)

Mar Wu Jinyuan, Fermilab 3 Flashing LED, The First Thing First Counter Q[23..0] At least design an LED for an FPGA. When a board is first powered up, first test the LED flashing function. Many things have to be right so that the LED flashes:  Power pins must be all connected.  Configuration devices must be in correct mode.  Design software must be correct.

Mar Wu Jinyuan, Fermilab 4 LED Brightness Variation Counter Q[23..0] A B A<B LUT Counter Q[23..0] A B A<B The LED brightness is varied by changing the output pulse duty- cycle. Comparator input A is the brightness and B is the clock cycle count. Look-up table can be added to input A for different brightness variation curve.

Mar Wu Jinyuan, Fermilab 5 Duty-Cycle Based Single-Pin DAC (1) The duty-cycle or pulse width of the comparator output is proportional to the DAC input at port A. Use external RC as low-pass filter. Output voltage of an ideal LP filter is proportional to the DAC input. Counter Q A B A>B DAC Input

Mar Wu Jinyuan, Fermilab 6 LED Brightness Exponential Drop  Counter Q A B A<B CO Q SET D if (CO==1) {Q = Q - Q/32;} Narrow pulse are typically stretched for LED display with fix brightness. The circuit here provides gradually dim of the LED for better visual effect.

Mar Wu Jinyuan, Fermilab 7 Exponential Sequence Generator  Q SET D if (CO==1) {Q = Q - Q/32;} An exponential sequence is generated using an accumulator shown above. Note that not even one multiplier is used. Other function sequences: sine, co-sine, tangent, co- tangent etc. can also be generated similarly. Possible Student Lab

Mar Wu Jinyuan, Fermilab 8 Duty-Cycle Based Single-Pin DAC (2) Use carry-out of the accumulator as the output. The number of pulses is proportional to the DAC input. Rounding error is carried to later cycles. Output is smoother.  Q CO D DAC Input Possible Student Lab

Mar Wu Jinyuan, Fermilab 9 Outline Electronic Aspect of FPGA:  LED Flashing  Logic Elements in a Nutshell  TDC and ADC FPGA as a Computing Fabric:  Moore’s Law Forever?  Space Charge Computing with FPGA Cores  Doublet Matching & Hash Sorter  Triplet Matching & Tiny Triplet Finder  Enclosed Loop Micro-Sequencer (ELMS)

Mar Wu Jinyuan, Fermilab 10 Logic Elements DQ ENA CLRN LUT4 (16 RAM Cells) DQ ENA CLRN LUT3 8 Cells LUT3 8 Cells Normal Mode: Arithmetic Mode: LUT4 + DFF 2 x LUT3 + DFF ABCDABCD CI A B CO LUT = Look-Up Table

Mar Wu Jinyuan, Fermilab 11 What Can Be Done With a Lookup Table “Any” 4-in Functions ABCDABCD

Mar Wu Jinyuan, Fermilab 12 Xilinx Look-Up Table DQ ENA CLRN RAM16 4-input Look-Up Table 16-bit Shift Register 16-bit Distributed RAM SRL16 LUT4

Mar Wu Jinyuan, Fermilab 13 Pipeline Structure DQ ENA CLRN LUT4 (16 RAM Cells) DQ ENA CLRN LUT4 (16 RAM Cells) DQ ENA CLRN LUT4 (16 RAM Cells) LUT4 (16 RAM Cells) Logic cells are usually designed in pipeline structures.

Mar Wu Jinyuan, Fermilab 14 Logic Element as a Full Adder Bit DQ ENA CLRN LUT3 8 Cells LUT3 8 Cells CI A B DQ ENA CLRN LUT3 8 Cells LUT3 8 Cells A B CO A Logic cell resembles a full adder bit.

Mar Wu Jinyuan, Fermilab 15 Myths on FPGA We commonly heard about FPGA:  FPGA is cheap.  FPGA is fast.  FPGA is large.  FPGA can do anything. Not really, at least it is not always the case. The reality is:  FPGA is ultra-flexible.  As the cost of the flexibility, the transistor usage in FPGA is NOT efficient. Good design tricks are needed.

Mar Wu Jinyuan, Fermilab 16 4-Input NAND, 4-Input NOR, 4-Input NAOR ABCD A B C D Y ABCD A B C D Y AB CD A B C D Y 8 transistors each ABCDABCD ABCDABCD ABCDABCD YY Y In ASIC

Mar Wu Jinyuan, Fermilab 17 Transistor Usage of Logic Element DQ ENA CLRN LUT 16-bit 6-transistor RAM bit At least 96 transistors X 16 In FPGA

Mar Wu Jinyuan, Fermilab 18 The Mirror Adder (Weste93) A A B B Ci Cob Sb A B A B ABCi AB A B A B transistors In ASIC

Mar Wu Jinyuan, Fermilab 19 Full Adder DQ ENA CLRN LUT 8-bit LUT 8-bit Full Adder CI A B S CO DQ At least 96 transistors In FPGA

Mar Wu Jinyuan, Fermilab 20 Other FPGA Resources Other resources are available in FPGA devices:  RAM Blocks  Multipliers  Serial Data Receivers, Power PC, etc. MultipliersRAM Blocks 16 Logic Elements

Mar Wu Jinyuan, Fermilab 21 Outline Electronic Aspect of FPGA:  LED Flashing  Logic Elements in a Nutshell  TDC and ADC FPGA as a Computing Fabric:  Moore’s Law Forever?  Space Charge Computing with FPGA Cores  Doublet Matching & Hash Sorter  Triplet Matching & Tiny Triplet Finder  Enclosed Loop Micro-Sequencer (ELMS)

Mar Wu Jinyuan, Fermilab 22 TDC Using FPGA Logic Chain Delay This scheme uses current FPGA technology Low cost chip family can be used. (e.g. EP2C8T144C6 $31.68) Fine TDC precision can be implemented in slow devices (e.g., 20 ps in a 400 MHz chip). IN CLK

Mar Wu Jinyuan, Fermilab 23 Two Major Issues In a Free Operating FPGA 1. Widths of bins are different and varies with supply voltage and temperature. 2. Some bins are ultra-wide due to LAB boundary crossing

Mar Wu Jinyuan, Fermilab 24 Auto Calibration Using Histogram Method It provides a bin-by-bin calibration at certain temperature. It is a turn-key solution (bin in, ps out) It is semi-continuous (auto update LUT every 16K events) DNL Histogram In (bin) LUT  Out (ps) 16K Events

Mar Wu Jinyuan, Fermilab 25 The Test Module Two NIM inputs FPGA with 8ch TDC Data Output via Ethernet BNC Adapter to add 150ps step.

Mar Wu Jinyuan, Fermilab 26 Test Result NIM Inputs 0 12 RMS 10ps LeCroy 429A NIM Fan-out NIM/ LVDS NIM/ LVDS - 140ps Wave Union TDC B + + BNC adapters to add 140ps step. As good as ASIC TDC

Mar Wu Jinyuan, Fermilab 27 Multi-Sampling TDC FPGA c0 c90 c180 c270 c0 Multiple Sampling Clock Domain Changing Trans. Detection & Encode Q0 Q1 Q2 Q3 QF QE QD c90 Coarse Time Counter DV T0 T1 TS Ultra low-cost: 48 channels in $18.27 EP2C5Q208C7. Sampling rate: 360 MHz x4 phases = 1.44 GHz. LSB = 0.69 ns. 4Ch Logic elements with non-critical timing are freely placed by the fitter of the compiler. This picture represent a placement in Cyclone FPGA

Mar Wu Jinyuan, Fermilab 28 ADC Using FPGA AMP & Shaper AMP & Shaper AMP & Shaper AMP & Shaper AMP & Shaper AMP & Shaper AMP & Shaper AMP & Shaper ADC FPGA TDC R1 C R2 FPGA V REF Analog signals from AMP & Shapers are directly fed to FPGA pins. FPGA outputs and passive RC network are used to generate ramping reference voltage V REF. The input voltages and V REF are compared using FPGA differential input receivers. The times of transitions representing input voltage values are digitized by TDC blocks in FPGA. T1T1 T2T2 T3T3 T4T4 V1V1 V2V2 V3V3 V4V4 V1V1 V2V2 V3V3 V4V4 T1T1 T2T2 T3T3 T4T4

Mar Wu Jinyuan, Fermilab 29 ADC Test: Waveform Digitization on BD3_19 Raw Data Input Waveform, Overlap Trigger & Reference Voltage Converted FPGA TDC pF 100 V REF Possible Student Lab

Mar Wu Jinyuan, Fermilab 30 Outline Electronic Aspect of FPGA:  LED Flashing  Logic Elements in a Nutshell  TDC and ADC FPGA as a Computing Fabric:  Moore’s Law Forever?  Space Charge Computing with FPGA Cores  Doublet Matching & Hash Sorter  Triplet Matching & Tiny Triplet Finder  Enclosed Loop Micro-Sequencer (ELMS)

Mar Wu Jinyuan, Fermilab 31 Moore’s Law Number of transistors in a package: x2 /18months Taken from

Mar Wu Jinyuan, Fermilab 32 Status of Moore’s Law: an Inconvenient Truth # of transistors  Yes, via multi-core. Clock Speed  ? Taken from

Mar Wu Jinyuan, Fermilab 33 The Fever of Moore ’ s Law vs. Maxwell ’ s Equations Op/sec MIT, 2002 During the hot days of Moore’s Law, the rules of thumb are:  BRB – Buy Rather than Build  URU – Use Rather than Understand  WRW – Wait Rather than Work From fundamental principles like Maxwell’s Equations, it is known limits of Moore’s Law exist. The technology advance comes from hard work. WRW

Mar Wu Jinyuan, Fermilab 34 The Execution & Non-Execution Cycles In current micro-processors:  Each instruction takes one clock cycle to execute.  It takes many clock cycles to prepare for executing an instruction.  Pipelined? Yes. But the non-execution pipeline stages consume silicon area, power etc.  To execute an instruction != to do useful calculation. Can we do something different? From MIT Open Course Site

Mar Wu Jinyuan, Fermilab 35 Outline Electronic Aspect of FPGA:  LED Flashing  Logic Elements in a Nutshell  TDC and ADC FPGA as a Computing Fabric:  Moore’s Law Forever?  Space Charge Computing with FPGA Cores  Doublet Matching & Hash Sorter  Triplet Matching & Tiny Triplet Finder  Enclosed Loop Micro-Sequencer (ELMS)

Mar Wu Jinyuan, Fermilab 36 The Space Charge Computing Each electron sees sum of Coulomb forces from other N-1 electrons. The total number of calculations is about N 2 and each calculation of the Coulomb force requires a square root, a division and several multiplications. Regular sequential computers are not fast enough. Number of Electrons Number of Calculations/Iteration Computing Time/ Calculations/s 10 3 ~ s 10 4 ~ hours 10 5 ~ days 10 6 ~ years

Mar Wu Jinyuan, Fermilab 37 The FPGA Board Up to 16 FPGA devices ($32 ea) can be installed onto each board. Each FPGA host one core.

Mar Wu Jinyuan, Fermilab 38 The 16-bit Demo Core

Mar Wu Jinyuan, Fermilab 39 LUT 10b in 16b out x2x2 x2x2 x2x2 + The Lookup Table

Mar Wu Jinyuan, Fermilab 40 Two Electrons with Natural Scales 256 nm 28ps

Mar Wu Jinyuan, Fermilab Charged Particles, Iteration 0

Mar Wu Jinyuan, Fermilab Charged Particles, Iteration 5

Mar Wu Jinyuan, Fermilab Charged Particles, Iteration 10

Mar Wu Jinyuan, Fermilab Charged Particles, Iteration 15

Mar Wu Jinyuan, Fermilab Charged Particles, Iteration 20

Mar Wu Jinyuan, Fermilab Charged Particles, Iteration 25

Mar Wu Jinyuan, Fermilab Charged Particles, Iteration 30

Mar Wu Jinyuan, Fermilab Charged Particles, Iteration 35

Mar Wu Jinyuan, Fermilab Charged Particles, Iteration 40

Mar Wu Jinyuan, Fermilab 50 Speed Comparison with Regular CPU The FPGA core is x10 faster than a typical 2.2 GHz CPU core. The FPGA core runs at 200 MHz or 200 M Coulomb force calculations/s. It seems the CPU core needs clock cycles for each Coulomb force calculation.

Mar Wu Jinyuan, Fermilab 51 One Board: 8 FPGA Cores One board has a calculation capacity as 40 dual core CPUs. The power consumption of one board is < 4.5 W. Newer FPGAs capable of hosting 4 cores/FPGA are available. One Core/FPGA = 5 Dual Core CPUs One Core/FPGA = 5 Dual Core CPUs 8 Cores/Board = 40 Dual Core CPUs

Mar Wu Jinyuan, Fermilab 52 Outline Electronic Aspect of FPGA:  LED Flashing  Logic Elements in a Nutshell  TDC and ADC FPGA as a Computing Fabric:  Moore’s Law Forever?  Space Charge Computing with FPGA Cores  Doublet Matching & Hash Sorter  Triplet Matching & Tiny Triplet Finder  Enclosed Loop Micro-Sequencer (ELMS)

Mar Wu Jinyuan, Fermilab 53 Example of Doublet Match, PET Positrons and electrons annihilate to produce pairs of photons. The back- to-back photons hit the detector at nearly the same time. Detector hits are digitized and hits at nearly the same time are to be matched together. The process takes O(n^2) clock cycles. T D T D Group 1 Group 2 -  T<A?  T>(-A)?

Mar Wu Jinyuan, Fermilab 54 Hash Sorter K K D K D Pass 1:  Data in Group 1 are stored in the hash sorter bins based on key number K. Pass 2:  Data in Group 2 are fetched though and paired up with corresponding Group 1 data with same key number K. Group 1 Group 2

Mar Wu Jinyuan, Fermilab 55 DINDOUT Index RAM Pointer RAM DATA RAM K Link List Structure of Hash Sorter

Mar Wu Jinyuan, Fermilab 56 Hash Sorter K Using hash sorter, matching pairs can be grouped together using 2n, rather than n 2 clock cycles.

Mar Wu Jinyuan, Fermilab 57 Outline Electronic Aspect of FPGA:  LED Flashing  Logic Elements in a Nutshell  TDC and ADC FPGA as a Computing Fabric:  Moore’s Law Forever?  Space Charge Computing with FPGA Cores  Doublet Matching & Hash Sorter  Triplet Matching & Tiny Triplet Finder  Enclosed Loop Micro-Sequencer (ELMS)

Mar Wu Jinyuan, Fermilab 58 Hits, Hit Data & Triplets Hit data come out of the detector planes in random order. Hit data from 3 planes generated by same particle tracks are organized together to form triplets.

Mar Wu Jinyuan, Fermilab 59 Three data items must satisfy the condition: x A + x C = 2 x B. A total of n 3 combinations must be checked (e.g. 5x5x5=125). Three layers of loops if the process is implemented in software. Large silicon resource may be needed without careful planning: O(N 2 ) Triplet Finding Plane APlane BPlane C

Mar Wu Jinyuan, Fermilab 60 Tiny Triplet Finder Operations Pass I: Filling Bit Arrays Note: Flipped Bit Order Physical Planes Bit Array/Shifters For any hit… Fill a corresponding logic cell. x A + x C = 2 x B x A = - x C + constant

Mar Wu Jinyuan, Fermilab 61 Tiny Triplet Finder Operations Pass II: Making Match For any center plane hit… Logically shift the bit array. Perform bit- wise AND in this range. Triplet is found. Physical Planes Bit Array/Shifters

Mar Wu Jinyuan, Fermilab 62 Tiny? Yes, Tiny! – Logic Cell Usage: AM, CAM, Hough Transform etc., O(N 2 ) Tiny Triplet Finder O(N*logN)

Mar Wu Jinyuan, Fermilab 63 Hit Matching SoftwareFPGA Typical FPGA Resource Saving Approaches O(n 2 ) for(){ for(){…} } O(n)*O(N) Comparator Array Hash Sorter O(n)*O(N): in RAM O(n 3 ) for(){ for(){…} } O(n)*O(N 2 ) CAM, Hugh Trans. Tiny Triplet Finder O(n)*O(N*logN) O(n 4 ) for(){ for(){ for() {…} }}}

Mar Wu Jinyuan, Fermilab 64 The Winning Line of FPGA Computing We commonly heard:  FPGA devices contains millions gate.  High parallelism can be implemented in FPGA.  FPGA cost drops by half every 18 months. We want to emphasize, especially to our young students: 1. Creativity, 2. Creativity, 3. Creativity, on Arithmetic ops, on Algorithms, on Architectures & on All Aspects. O Freunde, nicht diese Töne!

Mar Wu Jinyuan, Fermilab 65 Outline Electronic Aspect of FPGA:  LED Flashing  Logic Elements in a Nutshell  TDC and ADC FPGA as a Computing Fabric:  Moore’s Law Forever?  Space Charge Computing with FPGA Cores  Doublet Matching & Hash Sorter  Triplet Matching & Tiny Triplet Finder  Enclosed Loop Micro-Sequencer (ELMS)

Mar Wu Jinyuan, Fermilab 66 The End Thanks

Mar Wu Jinyuan, Fermilab 67 Micro-computing vs. Reconfigurable Computing In microprocessor, the users specify program on fixed logic circuits. In FPGA, the users specify logic circuits (as well as program). The FPGA computing needs not to follow microprocessor architectures. (But useful experiences can be borrowed.) The usefulness of FPGA reconfigurable computing is still to be fully appreciated. ( )*5+7 =? Control: Data: 100,3,4,5,7 LD(-)(+)(*)(+) CPU FPGA Data Program Configuration Data Program

Mar Wu Jinyuan, Fermilab 68 FPGA Process Sequencing Options Program Type Program Length (CLK cycles) ReprogramResource Usage Finite State Machine (FSM) Fixed Wired 10HardSmall Enclosed Loop Micro- Sequencer (ELMS) Memory Stored Program EasySmall Microprocessor (MP) Memory Stored Program >1000EasyLarge

Mar Wu Jinyuan, Fermilab 69 The Between Counter 0,1,2,3,4,5,6,7,8,9,A 5,6,7,8,9,A SLOAD D[] SCLR N Q[] M-1 == A[] B[] T 5,6,7,8,9,A 5,6,7,8,9,A,B,C,D,E,F… PC0: instr0 PC1: instr1 PC2: instr2 PC3: instr3 PC4: instr4 PC5: instr5 PC6: instr6 PC7: instr7 PC8: instr8 PC9: instr9 PCA: instrA PCB: instrB PCC: instrC PCD: instrD T ROM Between Counter Control Signals

Mar Wu Jinyuan, Fermilab 70 ELMS– Enclosed Loop Micro-Sequencer Loop & Return Logic + Stack Conditional Branch Logic Program Counter ROM 128x 36bits A Reset CLK Control Signals PCControl SignalsOpration LDR1, #n LDR2, #addr_a LDR3, #addr_X LDR7, # BckA1LDR4, (R2) INCR LDR5, (R3) INCR MULR6, R4, R5 0a EndA1ADDR7, R7, R6 0b DECR1 0c BRNZBckA1 Special in ELMS Supports FOR loops at machine code level PC+ROM is a good sequencer in FPGA. Adding Conditional Branch Logic allows the program to loop back. Loop & Return Logic + Stack is a special feature in ELMS that supports FOR loops at machine code level. Allows jump back as in microprocessors

Mar Wu Jinyuan, Fermilab 71 ELMS – Detailed Block Diagram User Control Signals FORBckA1 EndA1 #n LDR2, #addr_a LDR3, #addr_X LDR7, #0 BckA1LDR4, (R2) INCR2 LDR5, (R3) INCR3 MULR6, R4, R5 EndA1ADDR7, R7, R6 LDR8, R7 The Stack supports nested loops and sub- routing calls up to 128 layers.

Mar Wu Jinyuan, Fermilab 72 Software: Using Spread Sheet as Compiler

Mar Wu Jinyuan, Fermilab 73 What’s Good About ELMS: FOR Loops at Machine Code Level w/ Zero-Over Head Looping sequence is known in this example before entering the loop. Regular micro-processor treat the sequence as unknown. ELMS supports FOR loops with pre-defined iterations at machine code level. Execution time is saved and micro-complexities (branch penalty, pipeline bubble, etc.) associated with conditional branches are avoided. LDR1, #n LDR2, #addr_a LDR3, #addr_X LDR7, #0 BckA1LDR4, (R2) INCR2 LDR5, (R3) INCR3 MULR6, R4, R5 EndA1ADDR7, R7, R6 DECR1 BRNZBckA1 FORBckA1 EndA1 #n LDR2, #addr_a LDR3, #addr_X LDR7, #0 BckA1LDR4, (R2) INCR2 LDR5, (R3) INCR3 MULR6, R4, R5 EndA1ADDR7, R7, R6 25% MicroprocessorThe ELMS Conditional Branch

Mar Wu Jinyuan, Fermilab 74 ELMS as a Hardware Loop Sequencer Loop & Return Logic + Stack Conditional Branch Logic Program Counter ROM 128x 36bits A Reset CLK Control Signals There are DSP devices that support hardware loop for zero-overhead loop implementation. The emphasis of ELMS is that the FOR loop and subroutine calls/return are treated the same. Any program passage can be used as a subroutine without needing a return instruction. The ELMS uses as less resource as possible for FPGA implementation. From

Mar Wu Jinyuan, Fermilab 75 No ALU => Small Resource Usage Program DATA Memory Princeton Architecture Harvard Architecture Fermilab (?) Architecture Program Control ALU Program Memory Program Control ALU DATA Memory Program Memory Sequencer (ELMS) Data Processor DATA Memory The Princeton Architecture is more suitable at system level while Harvard Architecture is better suited at micro-structure level. Regular microprocessors cannot run looped program without an ALU. The ALU takes large amount of resource while may not be efficiently utilized for data processing tasks in FPGA. The ELMS can run nested loop program without an ALU. Further separation of Program and data is therefore possible. The ELMS is kept small. The von Neumann Architecture

Mar Wu Jinyuan, Fermilab 76 The Frequency Spectrum of DAC (2)  Q CO D DAC Input The first harmonic may be suppressed. Works better with regular low-pass filters. Possible Student Lab

Mar Wu Jinyuan, Fermilab 77 The Frequency Spectrum of DAC (1) Counter Q A B A>B DAC Input The first harmonic has dominate concentration. Works better with notch filter.

Mar Wu Jinyuan, Fermilab 78 Digital Calibration Using Twice-Recording Method IN CLK Use longer delay line. Some signals may be registered twice at two consecutive clock edges. N 2 -N 1 =(1/f)/  t The two measurements can be used:  to calibrate the delay.  to reduce digitization errors. 1/f: Clock Period  t: Average Bin Width

Mar Wu Jinyuan, Fermilab 79 Digital Calibration Result Power supply voltage changes from 2.5 V to 1.8 V, (about the same as 100 o C to 0 o C). Delay speed changes by 30%. The difference of the two TDC numbers reflects delay speed. 2 nd TDC 1 st TDCCorrected Time Warning: the calibration is based on average bin width, not bin-by-bin widths.

Mar Wu Jinyuan, Fermilab 80 Indirect Cost of Complexity If something like this can do the job… … why do these?

Mar Wu Jinyuan, Fermilab 81 Tiny Triplet Finder Reuse Coincident Logic via Shifting Hit Patterns C1 C2 C3 One set of coincident logic is implemented. For an arbitrary hit on C3, rotate, i.e., shift the hit patterns for C1 and C2 to search for coincidence.

Mar Wu Jinyuan, Fermilab 82 Tiny Triplet Finder for Circular Tracks *R1/R3 *R2/R3 Triplet Map Output To Decoder Bit Array Shifter Bit Array Shifter Bit-wise Coincident Logic 1.Fill the C1 and C2 bit arrays. (n1 clock cycles) 2.Loop over C3 hits, shift bit arrays and check for coincidence. (n3 clock cycles) Also works with more than 3 layers