Future evolution of the Fast TracKer (FTK) processing unit C. Gentsos, Aristotle University of Thessaloniki FTK 324318 FP7-PEOPLE-2012-IAPP FTK executive.

Future evolution of the Fast TracKer (FTK) processing unit C. Gentsos, Aristotle University of Thessaloniki FTK 324318 FP7-PEOPLE-2012-IAPP FTK executive board 21/7/2014 1

Presentation Overview 2  Key Fast TracKer (FTK) components  Goal of my work  FPGA firmware  FPGA Device utilization and power  System latency and processing speed  Progress

Project goals: integration of many FTK functions in a small form factor  Main goal is to integrate the FTK system in a more compact form  First step will be to connect an AMChip, an FPGA and a RAM in a prototype board  In the future the devices could be merged in a single package (AMSiP)  That AMSiP will be the building block of the new processing unit, to be assembled in an ATCA board with new mezzanines 3 AMSiP

Project goals: flexibility  The main target will be the ATLAS detector L1 trigger (for the PhaseII upgrade), but for that development phase we will keep the FTK detector layout, that is 5 silicon strip detector layers and 3 pixel detector layers  The latency requirements are very demanding, leaving just 8μs for the L1 tracking  The final architecture should be flexible, and the resulting system easy to reprogram to target other applications (to be studied in the IAPP project) such as:  Machine vision for embedded systems: low power edge detection, object detection  Medical imaging applications  Coprocessor for various High Performance Computing applications 4

FPGA firmware - overview Full resolution stored in a smart DB, while SuperStrip ID of each hit is sent to the AM AM performs pattern recognition and returns the ID value for each road (RoadID) The RoadIDs get decoded to SuperStrip IDs for each layer, using an external RAM Database retreives all hits for each SuperStrip ID of the detected Roads Combiner unit computes all possible permutations of the hits to form tracks A very fast full resolution fit is done for each possible track, fits are accepted or rejected according to χ 2 value 5 Write part −−−−−−− Read part −−−−−−−

Data Organizer 6

The hits are stored sequentially in the hitmem, whatever is the order in which they arrive Hits are cloned in two dual-port memories, reading uses 4 memory ports Max. No of Hits per event, no other restrictions on the input 7 The nextmem holds, for each hit address, the location of the next hit of the same SuperStrip ID The HLP keeps track of the address of the first hit of each SuperStrip ID Write part −−−−−−− Read part −−−−−−−

Data Organizer – latest improvements The HLP width is increased to 320bits, giving access to ranges of 32 memory locations. 8 Write part −−−−−−− Read part −−−−−−− In this way we can check for data on groups of 8 SuperStrips in parallel for DC At the same time, the BRAM formation is more compact, requiring 10% less resources

Data Organizer – latest improvements The HLP function also changes, now it keeps the location of the last hit location in the hitmem, eliminating the need for a lastmem 9 Write part −−−−−−− Read part −−−−−−− The freed up BRAMS are eventually put to use to make the reading rate data- independent

Track Fitting 10

Track Fitting  Track helix parameters and  2 can be extracted from linear equations in the local silicon hit coordinates  The resolution of the linear fit is close to that of the full helical fit in a narrow region (sector) of the detector  p i ’s are the helix parameters and  2 components.  x j ’s are the hit coordinates in the silicon layers.  a ij & b i are stored constants from full simulation or real data tracks.  The range of the linear fit is a “sector” which consists of a single silicon module in each detector layer.  Using FPGA DSPs, very good performance can be achieved 5sct+3pix*2=11 11 coordinate Space to 5 dimensional surface 11

Track Fitting - Combiner Visualization of the combiner function on an example Road Best fit is selected, others are discarded 12 Road

Track Fitting: Scalar Product Calculation Pipeline Very fast FPGA implementation was developed for the fitter All multiplications are executed in parallel, giving 1 fit per clock Using dedicated DSP resources, the frequency of the fitter is 550MHz 4 such fitters run in parallel in the device 13 DSP

FPGA implementation  The main components of the design have already been implemented  Placement on the device has been made for estimating true achievable clock rates and the power dissipation  Target device is 28nm, Xilinx 7-series XC7K325T- 900ffg-3  The clock frequencies we are targeting are close to the actual limits of the device  To achieve such clock rates, many coding guidelines and advanced design techniques must be followed 14

FPGA implementation (device floorplanning view) Track Fitter instances Local clock routing for the Track Fitters 15

Device utilization & power  Power estimation of 15.5W is the absolute worst-case figure  Simple power optimization moves and migration to a newer 20nm device family is expected to reduce this by 30% or more 16

Speed and latency 100MHz input bus 50MHz output bus @550MHz@450MHz 17 800MHz DDR

Speed and latency 225MHits per layer 50MRoads Units are per second 2200Mfits 450-1800 Mhits per layer 18 576MBit RLDRAM3 57.6Gb/s 10ns t RC Total speed will be data dependent, further system simulation needed for precision

There are ideas to further improve performance, if needed Speed and minimum latency ~40ns ~10ns ~50ns ~70ns Figures represent latency from last incoming hit to first output track 19 Minimum system latency (from last hit to first computed parameters) <0.3μs

Sample event processing time  In a typical event there are 500 hits/layer  50 roads are assumed to be produced by the AM  4 layers are assumed to have 1 hit/road, the rest 4 to have 2hits/road each Event processing time with current AMChip (100MHz input)  5μs (SSIDs to AM)  1μs (roads from AM) > 0.36μs (processing time for all the roads)  0.17μs (latency from RAM, DO, Combiner, TF)  Total: 6.17μs, which is less than 8μs which is considered to be the limit for L1 tracking Event processing time after a reasonable AMChip upgrade (200MHz input)  2.5μs (SSIDs to AM)  0.25μs (roads from AM) < 0.36μs (processing time for all the roads)  0.17μs (latency from RAM, DO, Combiner, TF)  Total: 3μs, which is less than half the L1 limit, lots of headroom for bigger events 20

Progress  Data organizer and Track Fitter already implemented in the FPGA device  After the last improvements on the Data Organizer are over, the combiner and external memory interface will be implemented  Speed and latency figures of the system are promising  After design completion, testing on the prototype board will follow 21

Thank you! 22

Backup  Backup slides, way more to be added 23

FPGA implementation Data organizers are more spread in the device 24

 A shortlist of design techniques necessary for such high- speed implementation  Pipelining of control signals and memory buses  Careful fan-out control for many signals  Manual device resource instantiation  Manual floorplanning of key components  Utilization of dedicated routing wherever possible  Local clock buffers for 550MHz areas FPGA implementation 25

FPGA firmware More detailed overview of the design, showing serial transceivers and more details of the DO 26

DO function 27

Latency, processing speed  To write 1k hits to the DO, 4.5us is needed  To forward 100 roads worth of hits to the Track Fitter another ~0.5-1.5us is necessary (data dependent)  The TF itself because of its high operating frequency does not have any noticeable latency, the maximum processing speed would be close to 2.4Gfits/sec if it was utilized 100% of the event time  There are ideas on how to increase all those numbers if there is need  Simulations will be done to show if the current performance is enough 28

Latency, processing speed  There are ideas on how to increase all those numbers if there is need  Just migrating to an UltraScale device can increase the performance, by utilizing even more parallelism and clock rate capabilities  Simulations have to be done to show if the current performance is enough, but the figures seem good for now 29

Future evolution of the Fast TracKer (FTK) processing unit C. Gentsos, Aristotle University of Thessaloniki FTK 324318 FP7-PEOPLE-2012-IAPP FTK executive.

Similar presentations

Presentation on theme: "Future evolution of the Fast TracKer (FTK) processing unit C. Gentsos, Aristotle University of Thessaloniki FTK 324318 FP7-PEOPLE-2012-IAPP FTK executive."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Future evolution of the Fast TracKer (FTK) processing unit C. Gentsos, Aristotle University of Thessaloniki FTK 324318 FP7-PEOPLE-2012-IAPP FTK executive.

Similar presentations

Presentation on theme: "Future evolution of the Fast TracKer (FTK) processing unit C. Gentsos, Aristotle University of Thessaloniki FTK 324318 FP7-PEOPLE-2012-IAPP FTK executive."— Presentation transcript:

Similar presentations

About project

Feedback