Porting Reconstruction Algorithms to the Cell Broadband Engine S. Gorbunov 1, U. Kebschull 1,I. Kisel 1,2, V. Lindenstruth 1, W.F.J. Müller 2 S. Gorbunov.

Slides:



Advertisements
Similar presentations
DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM I. Kisel (for CBM Collaboration) I. Kisel (for CBM Collaboration)
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
L1 Event Reconstruction in the STS I. Kisel GSI / KIP CBM Collaboration Meeting Dubna, October 16, 2008.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Christian Steinle, University of Heidelberg, Institute of Computer Engineering1 L1 Hough Tracking on a Sony Playstation III Christian Steinle, Andreas.
ALICE HLT High Speed Tracking and Vertexing Real-Time 2010 Conference Lisboa, May 25, 2010 Sergey Gorbunov 1,2 1 Frankfurt Institute for Advanced Studies,
K-Ary Search on Modern Processors Fakultät Informatik, Institut Systemarchitektur, Professur Datenbanken Benjamin Schlegel, Rainer Gemulla, Wolfgang Lehner.
CA+KF Track Reconstruction in the STS I. Kisel GSI / KIP CBM Collaboration Meeting GSI, February 28, 2008.
KIP TRACKING IN MAGNETIC FIELD BASED ON THE CELLULAR AUTOMATON METHOD TRACKING IN MAGNETIC FIELD BASED ON THE CELLULAR AUTOMATON METHOD Ivan Kisel KIP,
Online Event Selection in the CBM Experiment I. Kisel GSI, Darmstadt, Germany RT'09, Beijing May 12, 2009.
Event Reconstruction in STS I. Kisel GSI CBM-RF-JINR Meeting Dubna, May 21, 2009.
Online Track Reconstruction in the CBM Experiment I. Kisel, I. Kulakov, I. Rostovtseva, M. Zyzak (for the CBM Collaboration) I. Kisel, I. Kulakov, I. Rostovtseva,
Computer Performance Computer Engineering Department.
CA tracker for TPC online reconstruction CERN, April 10, 2008 S. Gorbunov 1 and I. Kisel 1,2 S. Gorbunov 1 and I. Kisel 1,2 ( for the ALICE Collaboration.
Many-Core Scalability of the Online Event Reconstruction in the CBM Experiment Ivan Kisel GSI, Germany (for the CBM Collaboration) CHEP-2010 Taipei, October.
Gedae Portability: From Simulation to DSPs to the Cell Broadband Engine James Steed, William Lundgren, Kerry Barnes Gedae, Inc
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
Helmholtz International Center for CBM – Online Reconstruction and Event Selection Open Charm Event Selection – Driving Force for FEE and DAQ Open charm:
1 Tracking Reconstruction Norman A. Graf SLAC July 19, 2006.
Status of the L1 STS Tracking I. Kisel GSI / KIP CBM Collaboration Meeting GSI, March 12, 2009.
Tracking at LHCb Introduction: Tracking Performance at LHCb Kalman Filter Technique Speed Optimization Status & Plans.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Multi-core.  What is parallel programming ?  Classification of parallel architectures  Dimension of instruction  Dimension of data  Memory models.
History of Microprocessor MPIntroductionData BusAddress Bus
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Speed-up of the ring recognition algorithm Semeon Lebedev GSI, Darmstadt, Germany and LIT JINR, Dubna, Russia Gennady Ososkov LIT JINR, Dubna, Russia.
STS track recognition by 3D track-following method Gennady Ososkov, A.Airiyan, A.Lebedev, S.Lebedev, E.Litvinenko Laboratory of Information Technologies.
TI DSPS FEST 1999 Implementation of Channel Estimation and Multiuser Detection Algorithms for W-CDMA on Digital Signal Processors Sridhar Rajagopal Gang.
Ooo Performance simulation studies of a realistic model of the CBM Silicon Tracking System Silicon Tracking for CBM Reconstructed URQMD event: central.
Fast reconstruction of tracks in the inner tracker of the CBM experiment Ivan Kisel (for the CBM Collaboration) Kirchhoff Institute of Physics University.
Fast track reconstruction in the muon system and transition radiation detector of the CBM experiment at FAIR Andrey Lebedev 1,2 Claudia Höhne 3 Ivan Kisel.
A.Ayriyan 1, V.Ivanov 1, S.Lebedev 1,2, G.Ososkov 1 in collaboration with N.Chernov 3 1 st CBM Collaboration Meeting, JINR Duba,19-22 May JINR-LIT,
Cell Processor Programming: An introduction Pascal Comte Brock University, Fall 2007.
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Standalone FLES Package for Event Reconstruction and Selection in CBM DPG Mainz, 21 March 2012 I. Kisel 1,2, I. Kulakov 1, M. Zyzak 1 (for the CBM.
Tracking in the CBM experiment I. Kisel Kirchhoff Institute of Physics, University of Heidelberg (for the CBM Collaboration) TIME05, Zurich, Switzerland.
Introduction to MMX, XMM, SSE and SSE2 Technology
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
First Level Event Selection Package of the CBM Experiment S. Gorbunov, I. Kisel, I. Kulakov, I. Rostovtseva, I. Vassiliev (for the CBM Collaboration (for.
Fast Tracking of Strip and MAPS Detectors Joachim Gläß Computer Engineering, University of Mannheim Target application is trigger  1. do it fast  2.
Global Tracking for CBM Andrey Lebedev 1,2 Ivan Kisel 1 Gennady Ososkov 2 1 GSI Helmholtzzentrum für Schwerionenforschung GmbH, Darmstadt, Germany 2 Laboratory.
Cellular Automaton Method for Track Finding (HERA-B, LHCb, CBM) Ivan Kisel Kirchhoff-Institut für Physik, Uni-Heidelberg Second FutureDAQ Workshop, GSI.
Muon detection in the CBM experiment at FAIR Andrey Lebedev 1,2 Claudia Höhne 1 Ivan Kisel 1 Anna Kiseleva 3 Gennady Ososkov 2 1 GSI Helmholtzzentrum für.
CA+KF Track Reconstruction in the STS S. Gorbunov and I. Kisel GSI/KIP/LIT CBM Collaboration Meeting Dresden, September 26, 2007.
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.
A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.
Kalman Filter based Track Fit running on Cell S. Gorbunov 1,2, U. Kebschull 2, I. Kisel 2,3, V. Lindenstruth 2 and W.F.J. Müller 1 1 Gesellschaft für Schwerionenforschung.
SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
Parallelization of the SIMD Kalman Filter for Track Fitting R. Gabriel Esteves Ashok Thirumurthi Xin Zhou Michael D. McCool Anwar Ghuloum Rama Malladi.
Fast parallel tracking algorithm for the muon detector of the CBM experiment at FAIR Andrey Lebedev 1,2 Claudia Höhne 1 Ivan Kisel 1 Gennady Ososkov 2.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Parallel Implementation of the KFParticle Vertexing Package for the CBM and ALICE Experiments Ivan Kisel 1,2,3, Igor Kulakov 1,4, Maksym Zyzak 1,4 1 –
Track Reconstruction in MUCH and TRD Andrey Lebedev 1,2 Gennady Ososkov 2 1 Gesellschaft für Schwerionenforschung, Darmstadt, Germany 2 Laboratory of Information.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
KFParticle Ivan Kisel 1,2, Igor Kulakov 1,3, Iourii Vassiliev 1,2, Maksym Zyzak 1,3 1 – Goethe-Universität Frankfurt, Frankfurt am Main, Germany 2 – GSI.
Y. Fisyak1, I. Kisel2, I. Kulakov2, J. Lauret1, M. Zyzak2
Kalman filter tracking library
Fast Parallel Event Reconstruction
Update of pattern recognition
ALICE HLT tracking running on GPU
TPC reconstruction in the HLT
Hellenic Open University
Multicore and GPU Programming
Multicore and GPU Programming
Sergey Gorbunov Ivan Kisel DESY, Zeuthen KIP, Uni-Heidelberg
Presentation transcript:

Porting Reconstruction Algorithms to the Cell Broadband Engine S. Gorbunov 1, U. Kebschull 1,I. Kisel 1,2, V. Lindenstruth 1, W.F.J. Müller 2 S. Gorbunov 1, U. Kebschull 1, I. Kisel 1,2, V. Lindenstruth 1, W.F.J. Müller 2 1 Kirchhoff Institute for Physics, University of Heidelberg, Germany 2 Gesellschaft für Schwerionenforschung mbH, Darmstadt, Germany ACAT08, Erice, Sicily November 3-7, 2008

06 November 2008, ACAT08, EriceIvan Kisel, GSI/KIP2/18 CBM Experiment (FAIR/GSI): Tracking Challenge future heavy ion experiment ~ 1000 charged particles/collision 10 7 Au+Au collisions/sec inhomogeneous magnetic field double-sided strip detectors Processing time per event increased from 1 sec to 5 min for double-sided strip detectors with 85% of fake space points ! Modern (Pentium, AMD, PowerPC, …) CPUs have vector units operating 2 d.p. or 4 s.p. scalars in one go ! SIMD = Single Instruction Multiple Data truefake

06 November 2008, ACAT08, EriceIvan Kisel, GSI/KIP3/18 Cell Processor: Supercomputer-on-a-Chip Sony PlayStation-3 -> cheap Sony PlayStation-3 -> cheap 32 (8x4) times faster ! 32 (8x4) times faster !

06 November 2008, ACAT08, EriceIvan Kisel, GSI/KIP4/18 Enhanced Cell with Double Precision 128 (32x4) times faster ! 128 (32x4) times faster ! future: 512 (32x16) ? future: 512 (32x16) ?

06 November 2008, ACAT08, EriceIvan Kisel, GSI/KIP5/18 Core of Reconstruction: Kalman Filter based Track Fit detectors measurements  (r, C) track parameters and errors Initial estimates for and Prediction step State estimate Error covariance Filtering step  2 x  2 y …  2 y …  2 z  2 z  2 px  2 px …  2 py …  2 py  2 pz  2 pz C =C =C =C = error of x r = { x, y, z, p x, p y, p z } position momentum State vector Covariancematrix

06 November 2008, ACAT08, EriceIvan Kisel, GSI/KIP6/18 Kalman Filter for Track Fit 1/2 arbitrary large errors non-homogeneous magnetic field as large map multiple scattering in material small errors weight for update >>> 256 KB of Local Store

06 November 2008, ACAT08, EriceIvan Kisel, GSI/KIP7/18 “Local” Approximation of the Magnetic Field Full reconstruction must work within 256 kB of the Local Store. The magnetic field map is too large for that (70 MB). A position (x,y), to which the track is propagated, is unknown in advance. Therefore, access to the magnetic field map is a blocking process Use a polynomial approximation (4-th order) of the field in XY planes of the stations Assuming a parabolic behavior of the field between stations calculate the magnetic field along the track based on 3 consecutive measurements. X Y Z = 50 P 4 Approximation Difference P4P4P4P4 P4P4P4P4 P4P4P4P4 P2P2P2P2 Problem: Solution: Station 1 Station 2 Station 3

06 November 2008, ACAT08, EriceIvan Kisel, GSI/KIP8/18 Kalman Filter for Track Fit 2/2 arbitrary large errors non-homogeneous magnetic field as large map multiple scattering in material small errors weight for update not enough accuracy in single precision

06 November 2008, ACAT08, EriceIvan Kisel, GSI/KIP9/18 Kalman Filter: Conventional and Square-Root The square-root KF provides the same precision as the conventional KF, but roughly 30% slower.

06 November 2008, ACAT08, EriceIvan Kisel, GSI/KIP10/18 Kalman Filter Instability in Single Precision Illustrative example: 2D fit of straight line, no MS, 15 detectors 2D fit of straight line, no MS, 15 detectors Conclusion for single precision: Use the square-root version of KF or Use the square-root version of KF or Use double precision for the critical part of KF or Use double precision for the critical part of KF or Use a proper initialization in the conventional KF Use a proper initialization in the conventional KF d.p. conventional KF s.p. conventional KF s.p. square-root KF

06 November 2008, ACAT08, EriceIvan Kisel, GSI/KIP11/18 Porting Algorithms to Cell 1.Linux 2.Virtual machine: Red Hat (Fedora Core 4) Red Hat (Fedora Core 4) Cell Simulator: Cell Simulator:  PPE  SPE 3.Cell Blade SSE2 SSE2 AltiVec Specialized SIMD Data Types: Platform: Use headers to overload +, -, *, / operators --> the source code is unchanged ! c = a + b Scalar double Scalar double Scalar float Scalar float Pseudo-vector (array) Pseudo-vector (array) Vector (4 float) Vector (4 float) Approach: Universality (any multi-core architecture)Universality (any multi-core architecture) Vectorization (SIMDization)Vectorization (SIMDization) Run SPEs independently (one collision per SPE)Run SPEs independently (one collision per SPE) Track |Tr|Tr|Tr|Tr|

06 November 2008, ACAT08, EriceIvan Kisel, GSI/KIP12/18 Code (Part of the Kalman Filter) inline void AddMaterial( TrackV &track, Station &st, Fvec_t &qp0 ) { cnst mass2 = *0.1396; cnst mass2 = *0.1396; Fvec_t tx = track.T[2]; Fvec_t tx = track.T[2]; Fvec_t ty = track.T[3]; Fvec_t ty = track.T[3]; Fvec_t txtx = tx*tx; Fvec_t txtx = tx*tx; Fvec_t tyty = ty*ty; Fvec_t tyty = ty*ty; Fvec_t txtx1 = txtx + ONE; Fvec_t txtx1 = txtx + ONE; Fvec_t h = txtx + tyty; Fvec_t h = txtx + tyty; Fvec_t t = sqrt(txtx1 + tyty); Fvec_t t = sqrt(txtx1 + tyty); Fvec_t h2 = h*h; Fvec_t h2 = h*h; Fvec_t qp0t = qp0*t; Fvec_t qp0t = qp0*t; cnst c1=0.0136, c2=c1*0.038, c3=c2*0.5, c4=-c3/2.0, c5=c3/3.0, c6=-c3/4.0; cnst c1=0.0136, c2=c1*0.038, c3=c2*0.5, c4=-c3/2.0, c5=c3/3.0, c6=-c3/4.0; Fvec_t s0 = (c1+c2*st.logRadThick + c3*h + h2*(c4 + c5*h +c6*h2) )*qp0t; Fvec_t s0 = (c1+c2*st.logRadThick + c3*h + h2*(c4 + c5*h +c6*h2) )*qp0t; Fvec_t a = (ONE+mass2*qp0*qp0t)*st.RadThick*s0*s0; Fvec_t a = (ONE+mass2*qp0*qp0t)*st.RadThick*s0*s0; CovV &C = track.C; CovV &C = track.C; C.C22 += txtx1*a; C.C22 += txtx1*a; C.C32 += tx*ty*a; C.C33 += (ONE+tyty)*a; C.C32 += tx*ty*a; C.C33 += (ONE+tyty)*a;}

06 November 2008, ACAT08, EriceIvan Kisel, GSI/KIP13/18 Header (Intel’s SSE) typedef F32vec4 Fvec_t; typedef F32vec4 Fvec_t; /* Arithmetic Operators */ /* Arithmetic Operators */ friend F32vec4 operator +(const F32vec4 &a, const F32vec4 &b) { return _mm_add_ps(a,b); } friend F32vec4 operator +(const F32vec4 &a, const F32vec4 &b) { return _mm_add_ps(a,b); } friend F32vec4 operator -(const F32vec4 &a, const F32vec4 &b) { return _mm_sub_ps(a,b); } friend F32vec4 operator -(const F32vec4 &a, const F32vec4 &b) { return _mm_sub_ps(a,b); } friend F32vec4 operator *(const F32vec4 &a, const F32vec4 &b) { return _mm_mul_ps(a,b); } friend F32vec4 operator *(const F32vec4 &a, const F32vec4 &b) { return _mm_mul_ps(a,b); } friend F32vec4 operator /(const F32vec4 &a, const F32vec4 &b) { return _mm_div_ps(a,b); } friend F32vec4 operator /(const F32vec4 &a, const F32vec4 &b) { return _mm_div_ps(a,b); } /* Functions */ /* Functions */ friend F32vec4 min( const F32vec4 &a, const F32vec4 &b ){ return _mm_min_ps(a, b); } friend F32vec4 min( const F32vec4 &a, const F32vec4 &b ){ return _mm_min_ps(a, b); } friend F32vec4 max( const F32vec4 &a, const F32vec4 &b ){ return _mm_max_ps(a, b); } friend F32vec4 max( const F32vec4 &a, const F32vec4 &b ){ return _mm_max_ps(a, b); } /* Square Root */ /* Square Root */ friend F32vec4 sqrt ( const F32vec4 &a ){ return _mm_sqrt_ps (a); } friend F32vec4 sqrt ( const F32vec4 &a ){ return _mm_sqrt_ps (a); } /* Absolute value */ /* Absolute value */ friend F32vec4 fabs( const F32vec4 &a){ return _mm_and_ps(a, _f32vec4_abs_mask); } friend F32vec4 fabs( const F32vec4 &a){ return _mm_and_ps(a, _f32vec4_abs_mask); } /* Logical */ /* Logical */ friend F32vec4 operator&( const F32vec4 &a, const F32vec4 &b ){ // mask returned friend F32vec4 operator&( const F32vec4 &a, const F32vec4 &b ){ // mask returned return _mm_and_ps(a, b); return _mm_and_ps(a, b); } friend F32vec4 operator|( const F32vec4 &a, const F32vec4 &b ){ // mask returned friend F32vec4 operator|( const F32vec4 &a, const F32vec4 &b ){ // mask returned return _mm_or_ps(a, b); return _mm_or_ps(a, b); } friend F32vec4 operator^( const F32vec4 &a, const F32vec4 &b ){ // mask returned friend F32vec4 operator^( const F32vec4 &a, const F32vec4 &b ){ // mask returned return _mm_xor_ps(a, b); return _mm_xor_ps(a, b); } friend F32vec4 operator!( const F32vec4 &a ){ // mask returned friend F32vec4 operator!( const F32vec4 &a ){ // mask returned return _mm_xor_ps(a, _f32vec4_true); return _mm_xor_ps(a, _f32vec4_true); } friend F32vec4 operator||( const F32vec4 &a, const F32vec4 &b ){ // mask returned friend F32vec4 operator||( const F32vec4 &a, const F32vec4 &b ){ // mask returned return _mm_or_ps(a, b); return _mm_or_ps(a, b); } /* Comparison */ /* Comparison */ friend F32vec4 operator<( const F32vec4 &a, const F32vec4 &b ){ // mask returned friend F32vec4 operator<( const F32vec4 &a, const F32vec4 &b ){ // mask returned return _mm_cmplt_ps(a, b); return _mm_cmplt_ps(a, b); } SIMD instructions

06 November 2008, ACAT08, EriceIvan Kisel, GSI/KIP14/18 SPE Statistics Timing profile ! No need to check the assembler code !

06 November 2008, ACAT08, EriceIvan Kisel, GSI/KIP15/18 Kalman Filter Track Fit on Intel Xeon, AMD Opteron and Cell Motivated, but not restricted by Cell ! 2 Intel Xeon Processors with Hyper-Threading enabled and 512 kB cache at 2.66 GHz; 2 Dual Core AMD Opteron Processors 265 with 1024 kB cache at 1.8 GHz; 2 Cell Broadband Engines with 256 kB local store at 2.4G Hz. Intel P4 Cell faster!

06 November 2008, ACAT08, EriceIvan Kisel, GSI/KIP16/18 SIMDized Cellular Automaton Track Finder: Pentium 4 Efficiency, % Track category Efficiency, % 97.04Reference set (>1 GeV/c) All set Extra set (<1 GeV/c) Clone Ghost MC tracks/event found93 78 msCA time/event5 ms 1000 faster!

06 November 2008, ACAT08, EriceIvan Kisel, GSI/KIP17/18 SIMD KF: Code and Article

06 November 2008, ACAT08, EriceIvan Kisel, GSI/KIP18/18Summary Think about using SIMD units in the nearest future (many-cores, TF/s, …) Think about using SIMD units in the nearest future (many-cores, TF/s, …) Use single-precision floating point if possible Use single-precision floating point if possible In critical parts use double precision if necessary In critical parts use double precision if necessary Avoid accessing main memory, no maps, no look-up-tables Avoid accessing main memory, no maps, no look-up-tables New parallel languages appear: Ct, CUDA, … New parallel languages appear: Ct, CUDA, … Keep portability of the code (Intel, AMD, Cell, GPGPU, …) Keep portability of the code (Intel, AMD, Cell, GPGPU, …) Try the auto-vectorization option of the compilers Try the auto-vectorization option of the compilers