SAMPA and CRU simulations and further ideas Johan Alme – 4th February
Read-out Simulations The presented work is done by Damian K Wejnerowski and Håvard Rustad Olsen (M.Sc. Students Bergen University College) ›Supervisors: Håvard Helstrup, Johan Alme The idea: Make a simulation model for the complete read-out chain in ALICE TPC Run3 ›Based on TDR and ongoing development Range: ›From digital part of SAMPA chip ›To output from CRU (DDL3?) Tool: SystemC Johan Alme – CRU planning meeting 4th February 20152
The Run3 Simulation Model – Basic Ideas Simulate the data throughput from Sampa to CRU output Software model = Hardware component ›However: Not detailed design of interfaces Easily configurable ›Easy to add more functionality Customizable Input ›Changing Data source should be easy ›Current Data source: Random generator giving configurable occupancy Note: Little/no variation from channel to channel Customizable output Johan Alme – CRU planning meeting 4th February Data generator inputs data here tion/3/material/slides/0.pdf
Simulation Model Data Generator: ›Random generated in space/time with selectable occupancy to digital part of Sampa – no clusters SAMPA: ›Digital part with all buffers in place GBTx w/GBT protocol: ›Only forwards data – assumes no extra delay CRU: ›Model assumes no sorting on FEC Note: ›More FECs can be easily added ›More CRUs can be easily added Johan Alme – CRU planning meeting 4th February 20154
SAMPA simulations Setup ›1 FEC (5 SAMPA) ›Expected buffersizes: Data: 4k * 10bit / 8k* 10bit Header: 256 * 10bit / 1k * 10bit ›Randomly generated samples depended only on occupancy Scenarios ›1. Increasing occupancy with simple fluctuation 8 Timewindows / 4k * 10bit buffers ›2. Globally distributed Fluctuation 26 Timewindows / 4k * 10bit buffers ›3. When does the buffers stabilize? 8k* 10bit buffers / 30 Timewindows Johan Alme – CRU planning meeting 4th February 20155
1. Increasing Occupancy Johan Alme – CRU planning meeting 4th February Peak: 30% - 369* 10bit 60% * 10bit 90% * 10bit Spot checking 1 channel (Sampa 0; Channel 0)Average over all channels (after 100 us, 200us, 500us and 800us) Should have been run for longer time period… - but enough to see the trends and verify the model buffer overflow
2. Some variation on input Johan Alme – CRU planning meeting 4th February Input pattern used (occupancy per 100 us) Result:
4. Will the buffer usage stabilize? ›Neither tests stabilized after a period of 30 x 100 us ›50 % peaked at 3615 * 10bit ›70 % peaked at 9126 * 10bit ›This is as expected – and acts as a proof of the simulation model Johan Alme – CRU planning meeting 4th February 20158
CRU Model Simple Model: ›One FIFO per channel 1920/2560 FIFOs. (12/16 FECs) ›Assumes unsorted input data in time and geometry! ›Does Geometrical sorting Time sorting Setup for simulation ›Samples generated by random generator For each channel and for each time window 30% occupancy = 30% probability In average 340 samples per 100 us Johan Alme – CRU planning meeting 4th February 20159
CRU – Buffer Estimation Johan Alme – CRU planning meeting 4th February These shows total mem usage on CRU under these two conditions
Further work on Simulation Include a more realistic data source: › real data/black events › zero-suppressed/huffman-encoded Include a better model for CRU Johan Alme – CRU planning meeting 4th February
CRU Mockup test – is it possible to build 2560 FIFOs in one FPGA?? Assumes no sorting on FEC level Structure: ›2600 x 18 Kb FIFOs ›Some «stupid» logic Took ~40 hours to build (on a weak laptop) Successful build! Failed to look at design afterwards due to mem resources on laptop We push the limit of mem resources even on state of the art FPGAs… Conclusion: ›If some level of sorting can be done at the FEC level it would be better Johan Alme – CRU planning meeting 4th February Target device: Xilinx Virtex Ultrascale XCVU095 - Largest available
How can sorting on FEC level be done? 1.The pads and the FECs are distributed such that full padrows belong to one CRU ›Figure shows new IROC padplane: 63 pad rows 5440 pads (- 64 pads) 34 FEC (2 partitions, 3 sectors) dxmin=8.7mm dxmax=13.5mm ›For rp1, where we have 20 FECs: We can split the data from the 40 GBT links into 2 CRUs: upper half of FEC to one CRU lower half to 2nd CRU) Use one CRU with less than 40 GBT inputs Johan Alme – CRU planning meeting 4th February
How can sorting on FEC level be done? 2.Design the FECs such that we make sure that pads are connected to the input of the SAMPAs in a logical way. This might not mean the easiest physical routing of signals… Pad plane to transfer poins routing may not be symmetrical for all FECs, probability is there for a Geographic dependency of FEC. ›The routing from pad plane to transfer points (Kapton connectors!) may not be same for all FECs ›an FFEC with routing logical to follow pad-row can get a FEC to have a geographic constraint. ›Can there be an intermediate connector? (with totally characterized parasitic () R.LC, high / low frequency e.t.c…) Johan Alme – CRU planning meeting 4th February A. Rehman: Idea of Attiq ur Rehman
How can sorting on FEC level be done? 3.By adding programmable sorting matrix on SAMPA This will need a few memory resources for the programmable routing table Will give us full freedom to match pads/padrow no matter which partition Johan Alme – CRU planning meeting 4th February Channel 1 Channel 2 Channel 3 Channel 4 Channel 32 Sorting Matrix 4 x e-links Idea of Attiq ur Rehman
What does this mean for the CRU? All data for the GBT link will be arrive sorted, but time-multiplexed. We decide the readout order so that all data for one padrow comes first, and then comes the next, etc I.e. – we would need the same amount of FIFOs as there are pads in a row. ›150 FIFOs The depth of the FIFO depends on the time-frame. In order to do parallel cluster finding you need the data from all pads to be present. ›So if you multiplex M channels, each with a time-frame of N, you need to store NxM 10-bit words. › Johan Alme – CRU planning meeting 4th February Idea of Torsten Alt/Attiq ur Rehman
What does this mean for the CRU? The search for clusters is done on all pads of a pad-row in parallel in a Search Matrix Some key numbers from Torsten Alts calculations (I don’t go through them in details here): ›Number of FIFO buffers = Number of pads in a row = 150 ›Depth of FIFO, min 3 full events 3K 10 bit words One BRAM in Xilinx is 36 kb = 1 BRAM block 150 BRAM blocks ›Min clock speed 230 MHz ›Min 7500 registers for the search matrix ›+ extra logic for the calculation logic This have not been simulated, and the simulation models must be adjusted for this! Johan Alme – CRU planning meeting 4th February Idea of Torsten Alt
Further ideas The SAMPA needs to time-multiplex the data from the channels onto the GBT. ›However, how it does it, should be up to us. Ideally we could make it configurable. Instead of having a fixed time-frame size, this could be programmable. ›So we can decide if the want to have 1000 samples from Channel 0, then 1000 samples from Channel 1, etc. ›We could decide to just get 250. Or 100. Or whatever number. ›This would allow to fine-tune the required buffers in the CRU by quite some degree. ›For the SAMPA not much would change. The data is sampled and then stored in an internal buffer/memory. With minimal overhead we could solve the problem of “time-frame” borders. If the continuous data stream is chopped up into the time-frames, then we’ll create artificial borders for the cluster finder. ›One easy solution would be that the time-frames overlap. Johan Alme – CRU planning meeting 4th February Idea of Torsten Alt
Further ideas cont… Time-frames overlap. ›This will give a bit of overhead but not that much. ›Example: We chose a time-frame of 250 samples. The SAMPA sends out Samples for all pads, time-multiplexed. Instead of sending the Samples for the next time-frame, we send % of the data is sent double. ›This allows the Clusterfinder to avoid any border issues. It would create double clusters but they can be identified by software easily and filtered out. But we would have the overlap, so we wouldn’t miss clusters. This is easily realised in HW, and it would give an enormous flexibility at a very low price. Johan Alme – CRU planning meeting 4th February Idea of Torsten Alt
Conclusion Simulations have so far proved that the current scheme is working given 30% occupancy and zero suppressed data. However – if no sorting is done on the FEC, the CRU implementation might be large and clumsy. ›Maybe impossible to realize in commercially available solutions ›i.e. a custom HW solution will be needed Torsten Alt has invented an elegant scheme for buffering and 2D clusterfinding that uses a minimum of resources on the CRU ›This relies on the data from the FECs are sorted correctly ›Following Attiq Ur Rehmans scheme and by doing careful design this should be straight forward ›This solution does not add extra latency, and only add very few resources. However – this idea should be simulated to perform a sanity check of the estimations. Johan Alme – CRU planning meeting 4th February
Answers to Questionare FEC partitioning compatible with cluster finder ? ›Ordering of pads inside SAMPA by programmable routing, the problem of sending incomplete padrows to the 2 CRUs is still there, might be solvable on FEC level (under discussion) Plans for the data compression in the SAMPA chip ? ›zero-suppression and (alternatively) Huffman coding Number of CRU needed if using the PCIe form factor ? ›As described in the TDR. This means 3 partitions per OROC. ›Table 6.1 here: ›=> ›Assuming 24 GBTs per CRU, the O2 counting is 324. (IROC1 = 2, IROC2 = 2, OROC1 = 2, OROC2 = 2, OROC3 = 1, Sum = 9, 9 x 36 = 324) ›Assuming 32 GBTs per CRU, the O2 counting is 288. (IROC1 = 1, IROC2 = 2, OROC1 = 2, OROC2 = 2, OROC3 = 1, Sum = 8, 8 x 36 = 288) What is the size of the CRU internal buffer needed for each GBT ? ›worst case: ~18kbx2560 = 46Mb ›with SAMPA pad ordering and with 2D cluster finder: 36kbx150 = 5.4 Mbit Data size needed to represent a cluster with the upgraded TPC (Pad, row, time, charge, etc) ? ›To represent a cluster for the current ROCs, 7 parameters and in total 77 bit are needed in an uncompressed format. For the upgrade that will eventually change if there are more pads per paprow and/or if a better precision is required. ›Compression factor w/clusterfinder is minimum can be increased to about 3 by optimizing the design Johan Alme – CRU planning meeting 4th February