CA+KF Track Reconstruction in the STS I. Kisel GSI / KIP CBM Collaboration Meeting GSI, February 28, 2008
28 February 2008, GSIIvan Kisel, GSI2/14 Track Finder: what is the next Step? Optimize the STS geometry (strips, sector navigation) Optimize the STS geometry (strips, sector navigation) Mathematical and computational optimization Mathematical and computational optimization SIMDization of the algorithm (from scalars to vectors) SIMDization of the algorithm (from scalars to vectors) MIMDization (multi-threads, multi-cores) MIMDization (multi-threads, multi-cores) High track density High track density Non-homogeneous magnetic field Non-homogeneous magnetic field Fake space points are dominated Fake space points are dominated Single-sided strip detectors Single-sided strip detectors Detector inefficiency Detector inefficiency Not perfectly aligned system Not perfectly aligned system On-line event selection On-line event selection Large PC farm Large PC farm
28 February 2008, GSIIvan Kisel, GSI3/14 Data Acquisition System EventBuilderNetwork 100 ev/slice Detector PC Farm 10 7 ev/s 10 5 sl/s 50 kB/ev 5 MB/slice N x M SchedulerScheduler Sub-Farm RU RU RU RU RU RU RU RU RU RU RU RU RU RU RU RU Farm Control System Sub-Farm SF n available SF n tt MAPSSTSRICHTRDECAL SF n tt MAPSSTSRICHTRDECAL SF n tt tt tt tt 10 ? PCs
28 February 2008, GSIIvan Kisel, GSI4/14 Cell Blade – a Sub-Farm with (2+16) Cores Tracking and Vertexing Units Sub-Farm Management Unit Sub-Farm Decision/Selection Unit FPGA PCPCPCPCPCSub-Farm
28 February 2008, GSIIvan Kisel, GSI5/14 Welcome to the Era of Multicore HPC Gaming STI: Cell STI: CellGaming GP GPU Nvidia: Tesla Nvidia: Tesla GP GPU Nvidia: Tesla Nvidia: Tesla GP CPU Intel: Larrabee Intel: Larrabee GP CPU Intel: Larrabee Intel: Larrabee CPU/GPU AMD: Fusion AMD: FusionCPU/GPU ?? High performance computing (HPC) High performance computing (HPC) Highest clock rate is reached Highest clock rate is reached Performance/power optimization Performance/power optimization Heterogeneous systems of many (>8) cores Heterogeneous systems of many (>8) cores Similar programming languages (Ct and CUDA), but standards are unlikely Similar programming languages (Ct and CUDA), but standards are unlikely We need a uniform approach to all CPU/GPU families We need a uniform approach to all CPU/GPU families How to take advantage of the additional cores? How to take advantage of the additional cores?
28 February 2008, GSIIvan Kisel, GSI6/14 NVIDIA GeForce 9600 GT GPU: 64 Cores 64 processors 64 processors GHz frequency GHz frequency double precision (?) double precision (?) 170 EUR price 170 EUR price
28 February 2008, GSIIvan Kisel, GSI7/14 Intel Polaris: 80 Cores 3.16 GHz, 0.95 Volt, 62 Watt -> 1.01 Teraflops 3.16 GHz, 0.95 Volt, 62 Watt -> 1.01 Teraflops
28 February 2008, GSIIvan Kisel, GSI8/14 Cell Processor: 1+8 Cores
28 February 2008, GSIIvan Kisel, GSI9/14 Computer Physics Communications 178 (2008)
28 February 2008, GSIIvan Kisel, GSI10/14 Speed-up of the Kalman Filter Track Fit
28 February 2008, GSIIvan Kisel, GSI11/14 Structure and Data: a Bottleneck cbmroot/L1 L1Algo L1Geometry L1Event (L1Strips, L1Hits) L1Tracks Strips: float vStripValues[NStrips]; // strip coordinates (32b) unsigned char vStripFlags [NStrips]; // strip iStation (6b) + used (1b) + used_by_dublets (1b)Hits: struct L1StsHit { unsigned short int f, b; // front (16b) and back (16b) strip indices }; L1StsHit L1StsHit vHits[NHits]; unsigned short int vRecoHits [NRecoHits]; // hit index (16b) unsigned char vRecoTracks [NRecoTracks]; // N hits on track (8b) class L1Triplet{ unsigned short int w0; // left hit (16b) unsigned short int w1; // first neighbour (16b) or middle hit (16b) unsigned short int w2; // N neighbours (16b) or right hit (16b) unsigned char b0; // chi2 (5b) + level (3b) unsigned char b1; // qp (8b) unsigned char b2; // qp error (8b) } Input: Output: Internal: L1Algo A standalone L1Algo module 300 kB About 300 kB per central event
28 February 2008, GSIIvan Kisel, GSI12/14 Parallelization of the CA Track Finder 1 Create tracklets 2 Collect tracks GSI, KIP, CERN
28 February 2008, GSIIvan Kisel, GSI13/14 Kalman Filter Track Fit on Multicore Systems: Multithreading Real fit time/track (us) #tasks Logarithmic scale! Håvard Bjerke
28 February 2008, GSIIvan Kisel, GSI14/14 Summary and Plans SIMDized CA track finder works well SIMDized CA track finder works well Work on single-sided strip detectors started Work on single-sided strip detectors started Multithreaded Kalman filter track fit Multithreaded Kalman filter track fit Learn Ct (Intel) and CUDA (Nvidia) programming languages Investigate large multi-core systems (CPU and GPU) Parallelize the CA track finder Parallel hardware -> parallel languages -> parallel algorithms
28 February 2008, GSIIvan Kisel, GSI15/14 Double-Sided vs. Single-Sided Strip Detectors: Tracking Efficiency D-S: Efficiency, % Track category S-S: Efficiency, % 96.0Reference set (>1 GeV/c) All set (>=4 hits, >100 MeV/c) Extra set (<1 GeV/c) Clone Ghost MC tracks/event found Time/event, s25.6 Standard geometry with all strips Standard geometry with all strips Thickness is the same for D-S and S-S strip stations Thickness is the same for D-S and S-S strip stations Front stations positioned as in sts_allstrips.geo, back stations shifted in Z st +1cm Front stations positioned as in sts_allstrips.geo, back stations shifted in Z st +1cm Fake space points are produced as in the double-sided scenario (within the same sector) Fake space points are produced as in the double-sided scenario (within the same sector) True space points taken from MC (different sectors possible) True space points taken from MC (different sectors possible) No SIMDization No SIMDization No sorting of strips No sorting of strips No sector navigation No sector navigation No memory optimization No memory optimization