Event Reconstruction in STS I. Kisel GSI CBM-RF-JINR Meeting Dubna, May 21, 2009.

Slides:



Advertisements
Similar presentations
DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM I. Kisel (for CBM Collaboration) I. Kisel (for CBM Collaboration)
Advertisements

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
L1 Event Reconstruction in the STS I. Kisel GSI / KIP CBM Collaboration Meeting Dubna, October 16, 2008.
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
A many-core GPU architecture.. Price, performance, and evolution.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
ALICE HLT High Speed Tracking and Vertexing Real-Time 2010 Conference Lisboa, May 25, 2010 Sergey Gorbunov 1,2 1 Frankfurt Institute for Advanced Studies,
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Porting Reconstruction Algorithms to the Cell Broadband Engine S. Gorbunov 1, U. Kebschull 1,I. Kisel 1,2, V. Lindenstruth 1, W.F.J. Müller 2 S. Gorbunov.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
COMPUTER ARCHITECTURE (for Erasmus students)
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CA+KF Track Reconstruction in the STS I. Kisel GSI / KIP CBM Collaboration Meeting GSI, February 28, 2008.
Online Event Selection in the CBM Experiment I. Kisel GSI, Darmstadt, Germany RT'09, Beijing May 12, 2009.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Online Track Reconstruction in the CBM Experiment I. Kisel, I. Kulakov, I. Rostovtseva, M. Zyzak (for the CBM Collaboration) I. Kisel, I. Kulakov, I. Rostovtseva,
Lecture 2 : Introduction to Multicore Computing
Computer Graphics Graphics Hardware
Many-Core Scalability of the Online Event Reconstruction in the CBM Experiment Ivan Kisel GSI, Germany (for the CBM Collaboration) CHEP-2010 Taipei, October.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Helmholtz International Center for CBM – Online Reconstruction and Event Selection Open Charm Event Selection – Driving Force for FEE and DAQ Open charm:
Status of the L1 STS Tracking I. Kisel GSI / KIP CBM Collaboration Meeting GSI, March 12, 2009.
Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
Standalone FLES Package for Event Reconstruction and Selection in CBM DPG Mainz, 21 March 2012 I. Kisel 1,2, I. Kulakov 1, M. Zyzak 1 (for the CBM.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
First Level Event Selection Package of the CBM Experiment S. Gorbunov, I. Kisel, I. Kulakov, I. Rostovtseva, I. Vassiliev (for the CBM Collaboration (for.
Multicore – The future of Computing Chief Engineer Terje Mathisen.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Fast parallel tracking algorithm for the muon detector of the CBM experiment at FAIR Andrey Lebedev 1,2 Claudia Höhne 1 Ivan Kisel 1 Gennady Ososkov 2.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Trends in Multicore Architecture.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Computer Graphics Graphics Hardware
Fast Parallel Event Reconstruction
ALICE HLT tracking running on GPU
Morgan Kaufmann Publishers
Computer Graphics Graphics Hardware
Graphics Processing Unit
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

Event Reconstruction in STS I. Kisel GSI CBM-RF-JINR Meeting Dubna, May 21, 2009

21 May 2009, DubnaIvan Kisel, GSI2/24 Many-core HPC Heterogeneous systems of many cores Heterogeneous systems of many cores Uniform approach to all CPU/GPU families Uniform approach to all CPU/GPU families Similar programming languages (CUDA, Ct, OpenCL) Similar programming languages (CUDA, Ct, OpenCL) Parallelization of the algorithm (vectors, multi-threads, many-cores) Parallelization of the algorithm (vectors, multi-threads, many-cores) On-line event selection On-line event selection Mathematical and computational optimization Mathematical and computational optimization Optimization of the detector Optimization of the detector ? OpenCL ? Gaming STI: Cell STI: CellGaming GP CPU Intel: Larrabee Intel: Larrabee GP CPU Intel: Larrabee Intel: Larrabee ? CPU Intel: XX-cores Intel: XX-coresCPU FPGA Xilinx: Virtex Xilinx: VirtexFPGA ? CPU/GPU AMD: Fusion AMD: FusionCPU/GPU ? GP GPU Nvidia: Tesla Nvidia: Tesla GP GPU Nvidia: Tesla Nvidia: Tesla

21 May 2009, DubnaIvan Kisel, GSI3/24 Current and Expected Eras of Intel Processor Architectures From S. Borkar et al. (Intel Corp.), "Platform 2015: Intel Platform Evolution for the Next Decade", Future programming is 3-dimentional Future programming is 3-dimentional The amount of data is doubling every month The amount of data is doubling every month Massive data streams Massive data streams The RMS (Recognition, Mining, Synthesis) workload in real time The RMS (Recognition, Mining, Synthesis) workload in real time Supercomputer-level performance in ordinary servers and PCs Supercomputer-level performance in ordinary servers and PCs Applications, like real-time decision-making analysis Applications, like real-time decision-making analysisCores HW Threads SIMD width

21 May 2009, DubnaIvan Kisel, GSI4/24 Cores and HW Threads CPU architecture in 2009 CPU of your laptop in 2015 CPU architecture in 19XX 1 Process per CPU CPU architecture in Threads per Process per CPU Process Thread1 Thread2 exe r/w r/w exe exe r/w Cores and HW threads are seen by an operating system as CPUs: > cat /proc/cpuinfo Maximum half of threads are executed

21 May 2009, DubnaIvan Kisel, GSI5/24 SIMD Width D1 D2 S4 S3S2 S1 D S8 S7S6 S5S4S3S2 S1 S16 S15S14 S13S12S11S10 S9 S8 S7S6 S5 S4 S3S2 S1 Scalar double precision (64 bits) Vector (SIMD) double precision (128 bits) Vector (SIMD) single precision (128 bits) Intel AVX (2010) vector single precision (256 bits) Intel LRB (2010) vector single precision (512 bits) 2 or 1/2 4 or 1/4 8 or 1/8 16 or 1/16 Faster or Slower ? SIMD = Single Instruction Multiple Data SIMD uses vector registers SIMD exploits data-level parallelism CPU Scalar Vector D SS SS

21 May 2009, DubnaIvan Kisel, GSI6/24 SIMD KF Track Fit on Intel Multicore Systems: Scalability H. Bjerke, S. Gorbunov, I. Kisel, V. Lindenstruth, P. Post, R. Ratering Real-time performance on different CPU architectures – speed-up 100 with 32 threads Speed-up 3.7 on the Xeon 5140 (Woodcrest) Real-time performance on different Intel CPU platforms scalar double single -> xCell SPE ( 16 ) Woodcrest ( 2 ) Clovertown ( 4 ) Dunnington ( 6 ) # threads Fit time,  s/track

21 May 2009, DubnaIvan Kisel, GSI7/24 Intel Larrabee: 32 Cores L. Seiler et all, Larrabee: A Many-Core x86 Architecture for Visual Computing, ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, August LRB vs. GPU: Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Radeon 4000 series in three major ways: use the x86 instruction set with Larrabee-specific extensions; use the x86 instruction set with Larrabee-specific extensions; feature cache coherency across all its cores; feature cache coherency across all its cores; include very little specialized graphics hardware. include very little specialized graphics hardware. LRB vs. CPU: The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: LRB's 32 x86 cores will be based on the much simpler Pentium design; LRB's 32 x86 cores will be based on the much simpler Pentium design; each core supports 4-way simultaneous multithreading, with 4 copies of each processor register; each core supports 4-way simultaneous multithreading, with 4 copies of each processor register; each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time; each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time; LRB includes explicit cache control instructions; LRB includes explicit cache control instructions; LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory; LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory; LRB includes one fixed-function graphics hardware unit. LRB includes one fixed-function graphics hardware unit.

21 May 2009, DubnaIvan Kisel, GSI8/24 General Purpose Graphics Processing Units (GPGPU) Substantial evolution of graphics hardware over the past years Substantial evolution of graphics hardware over the past years Remarkable programmability and flexibility Remarkable programmability and flexibility Reasonably cheap Reasonably cheap New branch of research – GPGPU New branch of research – GPGPU

21 May 2009, DubnaIvan Kisel, GSI9/24 NVIDIA Hardware S. Kalcher, M. Bach Streaming multiprocessors Streaming multiprocessors No overhead thread switching No overhead thread switching FPUs instead of cache/control FPUs instead of cache/control Complex memory hierarchy Complex memory hierarchy SIMT – Single Instruction Multiple Threads SIMT – Single Instruction Multiple Threads GT multiprocessors 30 multiprocessors 30 DP units 30 DP units 8 SP FPUs per MP 8 SP FPUs per MP 240 SP units 240 SP units registers per MP registers per MP 16 kB shared memory per MP 16 kB shared memory per MP >= 1 GB main memory >= 1 GB main memory 1.4 GHz clock 1.4 GHz clock 933 GFlops SP 933 GFlops SP

21 May 2009, DubnaIvan Kisel, GSI10/24 SIMD/SIMT Kalman Filter on the CSC-Scout Cluster CPU 1600 GPU 9100 M. Bach, S. Gorbunov, S. Kalcher, U. Kebschull, I. Kisel, V. Lindenstruth 18x(2x(Quad-Xeon, 3.0 GHz, 2x6 MB L2), 16 GB) + 27xTesla S1070(4x(GT200, 4 GB))

21 May 2009, DubnaIvan Kisel, GSI11/24 CPU/GPU Programming Frameworks Cg, OpenGL Shading Language, Direct X Cg, OpenGL Shading Language, Direct X Designed to write shaders Require problem to be expressed graphically AMD Brook AMD Brook Pure stream computing Pure stream computing No hardware specific No hardware specific AMD CAL (Compute Abstraction Layer) AMD CAL (Compute Abstraction Layer) Generic usage of hardware on assembler level Generic usage of hardware on assembler level NVIDIA CUDA (Compute Unified Device Architecture) NVIDIA CUDA (Compute Unified Device Architecture) Defines hardware platform Defines hardware platform Generic programming Generic programming Extension to the C language Extension to the C language Explicit memory management Explicit memory management Programming on thread level Programming on thread level Intel Ct (C for throughput) Intel Ct (C for throughput) Extension to the C language Extension to the C language Intel CPU/GPU specific Intel CPU/GPU specific SIMD exploitation for automatic parallelism SIMD exploitation for automatic parallelism OpenCL (Open Computing Language) OpenCL (Open Computing Language) Open standard for generic programming Open standard for generic programming Extension to the C language Extension to the C language Supposed to work on any hardware Supposed to work on any hardware Usage of specific hardware capabilities by extensions Usage of specific hardware capabilities by extensions

21 May 2009, DubnaIvan Kisel, GSI12/24 Cellular Automaton Track Finder

21 May 2009, DubnaIvan Kisel, GSI13/24 L1 CA Track Finder: Efficiency Track category Efficiency, % Reference set (>1 GeV/c) 95.2 All set (≥4 hits,>100 MeV/c) 89.8 Extra set (<1 GeV/c) 78.6 Clone2.8 Ghost6.6 MC tracks/ev found 672 Speed, s/ev 0.8 I. Rostovtseva Fluctuated magnetic field? Fluctuated magnetic field? Too large STS acceptance? Too large STS acceptance? Too large distance between STS stations? Too large distance between STS stations?

21 May 2009, DubnaIvan Kisel, GSI14/24 L1 CA Track Finder: Changes I. Kulakov

21 May 2009, DubnaIvan Kisel, GSI15/24 L1 CA Track Finder: Timing I. Kulakov Time oldnew 1 thread2 threads3 threads CPU Time [ms] Real Time [ms] old – old version (from CBMRoot DEC08) new – new paralleled version Statistic: 100 central events Processor: Pentium D, 3.0 GHz, 2 MB. R [cm] CPU time [ms] Real time [ms] Ref set All set Extra Clone Ghost tracks/event

21 May 2009, DubnaIvan Kisel, GSI16/24 On-line = Off-line Reconstruction ?  Off-line and on-line reconstructions will and should be parallelized  Both versions will be run on similar many-core systems or even on the same PC farm  Both versions will use (probably) the same parallel language(s), such as OpenCL  Can we use the same code, but with some physics cuts applied when running on-line, like L1 CA?  If the final code is fast, can we think about a global on-line event reconstruction and selection? Intel SIMD Intel MIMD Intel Ct NVIDIA CUDA OpenCLSTS++++– MuCh RICH TRD Your Reco Open Charm Analysis Your Analysis

21 May 2009, DubnaIvan Kisel, GSI17/24Summary Think parallel ! Think parallel ! Parallel programming is the key to the full potential of the Tera-scale platforms Parallel programming is the key to the full potential of the Tera-scale platforms Data parallelism vs. parallelism of the algorithm Data parallelism vs. parallelism of the algorithm Stream processing – no branches Stream processing – no branches Avoid direct accessing main memory, no maps, no look-up-tables Avoid direct accessing main memory, no maps, no look-up-tables Use SIMD unit in the nearest future (many-cores, TF/s, …) Use SIMD unit in the nearest future (many-cores, TF/s, …) Use single-precision floating point where possible Use single-precision floating point where possible In critical parts use double precision if necessary In critical parts use double precision if necessary Keep portability of the code on heterogeneous systems (Intel, AMD, Cell, GPGPU, …) Keep portability of the code on heterogeneous systems (Intel, AMD, Cell, GPGPU, …) New parallel languages appear: OpenCL, Ct, CUDA New parallel languages appear: OpenCL, Ct, CUDA GPGPU is personal supercomputer with 1 TFlops for 300 EUR !!! GPGPU is personal supercomputer with 1 TFlops for 300 EUR !!! Should we start buying them for testing the algorithms now? Should we start buying them for testing the algorithms now?

21 May 2009, DubnaIvan Kisel, GSI18/24 Back-up Slides (1-5) Back-up

21 May 2009, DubnaIvan Kisel, GSI19/24 Back-up Slides (1/5) Back-up

21 May 2009, DubnaIvan Kisel, GSI20/24 Back-up Slides (2/5) Back-up

21 May 2009, DubnaIvan Kisel, GSI21/24 Back-up Slides (3/5) Back-up

21 May 2009, DubnaIvan Kisel, GSI22/24 Back-up Slides (4/5) Back-up SIMD is out of consideration (I.K.)

21 May 2009, DubnaIvan Kisel, GSI23/24 Back-up Slides (5/5) Back-up

21 May 2009, DubnaIvan Kisel, GSI24/24 Tracking Workshop Please be invited to the Tracking Workshop June 2009 at GSI