L1 Event Reconstruction in the STS I. Kisel GSI / KIP CBM Collaboration Meeting Dubna, October 16, 2008
16 October 2008, DubnaIvan Kisel, GSI2/15 Many-core HPC High performance computing (HPC) High performance computing (HPC) Highest clock rate is reached Highest clock rate is reached Performance/power optimization Performance/power optimization Heterogeneous systems of many (>8) cores Heterogeneous systems of many (>8) cores Similar programming languages (Ct and CUDA) Similar programming languages (Ct and CUDA) We need a uniform approach to all CPU/GPU families We need a uniform approach to all CPU/GPU families On-line event selection On-line event selection Mathematical and computational optimization Mathematical and computational optimization SIMDization of the algorithm (from scalars to vectors) SIMDization of the algorithm (from scalars to vectors) MIMDization (multi-threads, multi-cores) MIMDization (multi-threads, multi-cores) Optimize the STS geometry (strips, sector navigation) Optimize the STS geometry (strips, sector navigation) Smooth magnetic field Smooth magnetic field Gaming STI: Cell STI: CellGaming GP GPU Nvidia: Tesla Nvidia: Tesla GP GPU Nvidia: Tesla Nvidia: Tesla GP CPU Intel: Larrabee Intel: Larrabee GP CPU Intel: Larrabee Intel: Larrabee CPU/GPU AMD: Fusion AMD: FusionCPU/GPU ?? ? ?
16 October 2008, DubnaIvan Kisel, GSI3/15 NVIDIA GeForce GTX 280 NVIDIA GT200 GeForce GTX MB. 933 GFlops single precision (240 FPUs). finally double precision support, but only ~ 90 GFlops (8 core Xeon ~80 GFlops). Currently under investigation: Tracking Linpack Image Processing Sebastian Kalcher CUDA (Compute Unified Device Architecture) CUDA (Compute Unified Device Architecture)
16 October 2008, DubnaIvan Kisel, GSI4/15 Intel Larrabee: 32 Cores L. Seiler et all, Larrabee: A Many-Core x86 Architecture for Visual Computing, ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, August Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Radeon 4000 series in three major ways: use the x86 instruction set with Larrabee-specific extensions; use the x86 instruction set with Larrabee-specific extensions; feature cache coherency across all its cores; feature cache coherency across all its cores; include very little specialized graphics hardware. include very little specialized graphics hardware. The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: LRB's x86 cores will be based on the much simpler Pentium design; LRB's x86 cores will be based on the much simpler Pentium design; each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time; each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time; LRB includes one fixed-function graphics hardware unit; LRB includes one fixed-function graphics hardware unit; LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory; LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory; LRB includes explicit cache control instructions; LRB includes explicit cache control instructions; each core supports 4-way simultaneous multithreading, with 4 copies of each processor register. each core supports 4-way simultaneous multithreading, with 4 copies of each processor register.
16 October 2008, DubnaIvan Kisel, GSI5/15 Intel Ct Language Ct: Throughput Programming in C++. Tutorial. Intel. Ct adds new data types (parallel vectors) & operators to C++ Library-like interface and is fully ANSI/ISO-compliant Ct abstracts away architectural details Vector ISA width / Core count / Memory model / Cache sizes Ct forward-scales software written today Ct platform-level API, Virtual Intel Platform (VIP), is designed to be dynamically retargetable to SSE, SSEx, LRB, etc Ct is fully deterministic No data races Nested data parallelism and deterministic task parallelism differentiate Ct on parallelizing irregular data and algorithm Extend C++ for Throughput-Oriented Computing Dot Product Using C Loops for (i = 0; i < n; i++) { dst += src1[i] * src2[i]; } Dot Product Using Ct TVEC Dst, Src1(src1, n), Src2(src2, n); Dst = addReduce(Src1*Src2); Element-wise multiply 3 Reduction (a global sum) 1 Vector operations subsumes loop The basic type in Ct is a TVEC
16 October 2008, DubnaIvan Kisel, GSI6/15 Ct vs. CUDA Matthias Bach
16 October 2008, DubnaIvan Kisel, GSI7/15 Multi/Many-Core Investigations CA: Game of Life CA: Game of Life L1/HLT CA Track Finder L1/HLT CA Track Finder SIMD KF Track Fitter SIMD KF Track Fitter LINPACK LINPACK MIMDization (multi-threads, multi-cores) MIMDization (multi-threads, multi-cores) GSI, KIP, CERN, Intel
16 October 2008, DubnaIvan Kisel, GSI8/15 SIMD KF Track Fit on Multicore Systems: Scalability Using Intel Threading Building Blocks – linear scaling on multiple cores #threads Håvard Bjerke Real fit time/track ( s)
16 October 2008, DubnaIvan Kisel, GSI9/15 Parallelization of the L1 CA Track Finder 1 Create tracklets 2 Collect tracks GSI, KIP, CERN, Intel, ITEP, Uni-Kiev
16 October 2008, DubnaIvan Kisel, GSI10/15 L1 Standalone Package for Event Selection Igor Kulakov
16 October 2008, DubnaIvan Kisel, GSI11/15 KFParticle: Primary Vertex Finder Ruben Moor The algorithm is implemented and passed first tests.
16 October 2008, DubnaIvan Kisel, GSI12/15 L1 Standalone Package for Event Selection Igor Kulakov, Iouri Vassiliev Efficiency Reference set97.1% All set91.9% Extra set81.9% Clone3.5% Ghost3.2% Tracks/even691 Efficiency of D + selection: 48.9%
16 October 2008, DubnaIvan Kisel, GSI13/15 Magnetic Field: Smooth in the Acceptance 1. Approximate with a polynomial in the plane of each station 2. Approximate with a parabolic function between each 3 stations We need a smooth magnetic field in the acceptance
16 October 2008, DubnaIvan Kisel, GSI14/15 CA on the STS Geometry with Overlapping Sensors UrQMD MC central Au+Au 25AGeV Efficiency and fraction of killed tracks ok up to ∆Z = Z hit - Z station < ~0.2cm Irina Rostovtseva
16 October 2008, DubnaIvan Kisel, GSI15/15 Summary and Plans Learn Ct (Intel) and CUDA (Nvidia) programming languages Develop the L1 standalone package for event selection Parallelize the CA track finder Investigate large multi-core systems (CPU and GPU) Parallel hardware -> parallel languages -> parallel algorithms