Status of the L1 STS Tracking I. Kisel GSI / KIP CBM Collaboration Meeting GSI, March 12, 2009.

Slides:

Advertisements

Similar presentations

DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM I. Kisel (for CBM Collaboration) I. Kisel (for CBM Collaboration)

Advertisements

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

GPU Programming using BU Shared Computing Cluster

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

L1 Event Reconstruction in the STS I. Kisel GSI / KIP CBM Collaboration Meeting Dubna, October 16, 2008.

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

GRAPHICS AND COMPUTING GPUS Jehan-François Pâris

A many-core GPU architecture.. Price, performance, and evolution.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

ALICE HLT High Speed Tracking and Vertexing Real-Time 2010 Conference Lisboa, May 25, 2010 Sergey Gorbunov 1,2 1 Frankfurt Institute for Advanced Studies,

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Porting Reconstruction Algorithms to the Cell Broadband Engine S. Gorbunov 1, U. Kebschull 1,I. Kisel 1,2, V. Lindenstruth 1, W.F.J. Müller 2 S. Gorbunov.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

COMPUTER ARCHITECTURE (for Erasmus students)

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.

CA+KF Track Reconstruction in the STS I. Kisel GSI / KIP CBM Collaboration Meeting GSI, February 28, 2008.

Online Event Selection in the CBM Experiment I. Kisel GSI, Darmstadt, Germany RT'09, Beijing May 12, 2009.

Event Reconstruction in STS I. Kisel GSI CBM-RF-JINR Meeting Dubna, May 21, 2009.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Online Track Reconstruction in the CBM Experiment I. Kisel, I. Kulakov, I. Rostovtseva, M. Zyzak (for the CBM Collaboration) I. Kisel, I. Kulakov, I. Rostovtseva,

Lecture 2 : Introduction to Multicore Computing

Computer Graphics Graphics Hardware

Many-Core Scalability of the Online Event Reconstruction in the CBM Experiment Ivan Kisel GSI, Germany (for the CBM Collaboration) CHEP-2010 Taipei, October.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Helmholtz International Center for CBM – Online Reconstruction and Event Selection Open Charm Event Selection – Driving Force for FEE and DAQ Open charm:

Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc

Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

First Level Event Selection Package of the CBM Experiment S. Gorbunov, I. Kisel, I. Kulakov, I. Rostovtseva, I. Vassiliev (for the CBM Collaboration (for.

Multicore – The future of Computing Chief Engineer Terje Mathisen.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

My Coordinates Office EM G.27 contact time:

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Trends in Multicore Architecture.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Computer Graphics Graphics Hardware

Fast Parallel Event Reconstruction

ALICE HLT tracking running on GPU

Computer Graphics Graphics Hardware

Graphics Processing Unit

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Multicore and GPU Programming

Presentation transcript:

Status of the L1 STS Tracking I. Kisel GSI / KIP CBM Collaboration Meeting GSI, March 12, 2009

12 March 2009, GSIIvan Kisel, GSI2/14 L1 CA Track Finder Efficiency Track category Efficiency, % Reference set (>1 GeV/c) 95.2 All set (≥4 hits,>100 MeV/c) 89.8 Extra set (<1 GeV/c) 78.6 Clone2.8 Ghost6.6 MC tracks/ev found 672 Speed, s/ev 0.8 I. Rostovtseva Fluctuated magnetic field? Fluctuated magnetic field? Too large STS acceptance? Too large STS acceptance? Too large distance between STS stations? Too large distance between STS stations?

12 March 2009, GSIIvan Kisel, GSI3/14 Many-core HPC High performance computing (HPC) High performance computing (HPC) Highest clock rate is reached Highest clock rate is reached Performance/power optimization Performance/power optimization Heterogeneous systems of many (>8) cores Heterogeneous systems of many (>8) cores Similar programming languages (OpenCL, Ct and CUDA) Similar programming languages (OpenCL, Ct and CUDA) We need a uniform approach to all CPU/GPU families We need a uniform approach to all CPU/GPU families On-line event selection On-line event selection Mathematical and computational optimization Mathematical and computational optimization SIMDization of the algorithm (from scalars to vectors) SIMDization of the algorithm (from scalars to vectors) MIMDization (multi-threads, many-cores) MIMDization (multi-threads, many-cores) ?? Gaming STI: Cell STI: CellGaming GP CPU Intel: Larrabee Intel: Larrabee GP CPU Intel: Larrabee Intel: Larrabee? GP GPU Nvidia: Tesla Nvidia: Tesla GP GPU Nvidia: Tesla Nvidia: Tesla ? CPU Intel: XX-cores Intel: XX-coresCPU FPGA Xilinx XilinxFPGA ? CPU/GPU AMD: Fusion AMD: FusionCPU/GPU ?

12 March 2009, GSIIvan Kisel, GSI4/14 Current and Expected Eras of Intel Processor Architectures From S. Borkar et al. (Intel Corp.), "Platform 2015: Intel Platform Evolution for the Next Decade", Future programming is 3-dimentional Future programming is 3-dimentional The amount of data is doubling every month The amount of data is doubling every month Massive data streams Massive data streams The RMS (Recognition, Mining, Synthesis) workload in real time The RMS (Recognition, Mining, Synthesis) workload in real time Supercomputer-level performance in ordinary servers and PCs Supercomputer-level performance in ordinary servers and PCs Applications, like real-time decision-making analysis Applications, like real-time decision-making analysisCoresThreads SIMD width

12 March 2009, GSIIvan Kisel, GSI5/14 Cores and Threads CPU architecture in 2009 CPU of your laptop in 2015 CPU architecture in 19XX 1 Process per CPU CPU architecture in Threads per Process per CPU Process Thread1 Thread2 exe r/w r/w exe exe r/w......

12 March 2009, GSIIvan Kisel, GSI6/14 SIMD Width D1 D2 S4 S3S2 S1 D S8 S7S6 S5S4S3S2 S1 S16 S15S14 S13S12S11S10 S9 S8 S7S6 S5 S4 S3S2 S1 Scalar double precision (64 bits) Vector (SIMD) double precision (128 bits) Vector (SIMD) single precision (128 bits) Intel AVX (2010) vector single precision (256 bits) Intel LRB (2010) vector single precision (512 bits) 2 or 1/2 4 or 1/4 8 or 1/8 16 or 1/16 Faster or Slower ? SIMD = Single Instruction Multiple Data SIMD uses vector registers SIMD exploits data-level parallelism CPU Scalar Vector D SS SS

12 March 2009, GSIIvan Kisel, GSI7/14 SIMD KF Track Fit on Intel Multicore Systems: Scalability H. Bjerke, S. Gorbunov, I. Kisel, V. Lindenstruth, P. Post, R. Ratering Real-time performance on the quad-core Xeon 5345 (Clovertown) at 2.4 GHz – speed-up 30 with 16 threads Speed-up 3.7 on the Xeon 5140 (Woodcrest) at 2.4 GHz using icc 9.1 Real-time performance on different Intel CPU platforms

12 March 2009, GSIIvan Kisel, GSI8/14 Intel Larrabee: 32 Cores L. Seiler et all, Larrabee: A Many-Core x86 Architecture for Visual Computing, ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, August LRB vs. GPU: Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Radeon 4000 series in three major ways: use the x86 instruction set with Larrabee-specific extensions; use the x86 instruction set with Larrabee-specific extensions; feature cache coherency across all its cores; feature cache coherency across all its cores; include very little specialized graphics hardware. include very little specialized graphics hardware. LRB vs. CPU: The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: LRB's 32 x86 cores will be based on the much simpler Pentium design; LRB's 32 x86 cores will be based on the much simpler Pentium design; each core supports 4-way simultaneous multithreading, with 4 copies of each processor register; each core supports 4-way simultaneous multithreading, with 4 copies of each processor register; each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time; each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time; LRB includes explicit cache control instructions; LRB includes explicit cache control instructions; LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory; LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory; LRB includes one fixed-function graphics hardware unit. LRB includes one fixed-function graphics hardware unit.

12 March 2009, GSIIvan Kisel, GSI9/14 General Purpose Graphics Processing Units (GPGPU) Substantial evolution of graphics hardware over the past years Substantial evolution of graphics hardware over the past years Remarkable programmability and flexibility Remarkable programmability and flexibility Reasonably cheap Reasonably cheap New branch of research – GPGPU New branch of research – GPGPU

12 March 2009, GSIIvan Kisel, GSI10/14 NVIDIA Hardware S. Kalcher, M. Bach Streaming multiprocessors Streaming multiprocessors No overhead thread switching No overhead thread switching FPUs instead of cache/control FPUs instead of cache/control Complex memory hierarchy Complex memory hierarchy SIMT – Single Instruction Multiple Threads SIMT – Single Instruction Multiple Threads GT multiprocessors 30 multiprocessors 30 DP units 30 DP units 8 SP FPUs per MP 8 SP FPUs per MP 240 SP units 240 SP units registers per MP registers per MP 16 kB shared memory per MP 16 kB shared memory per MP >= 1 GB main memory >= 1 GB main memory 1.4 GHz clock 1.4 GHz clock 933 GFlops SP 933 GFlops SP

12 March 2009, GSIIvan Kisel, GSI11/14 SIMD/SIMT Kalman Filter on the CSC-Scout Cluster CPU 1600 GPU 9100 M. Bach, S. Gorbunov, S. Kalcher, U. Kebschull, I. Kisel, V. Lindenstruth 18x(2x(Quad-Xeon, 3.0 GHz, 2x6 MB L2), 16 GB) + 27xTesla S1070(4x(GT200, 4 GB))

12 March 2009, GSIIvan Kisel, GSI12/14 CPU/GPU Programming Frameworks Cg, OpenGL Shading Language, Direct X Cg, OpenGL Shading Language, Direct X Designed to write shaders Require problem to be expressed graphically AMD Brook AMD Brook Pure stream computing Pure stream computing No hardware specific No hardware specific AMD CAL (Compute Abstraction Layer) AMD CAL (Compute Abstraction Layer) Generic usage of hardware on assembler level Generic usage of hardware on assembler level NVIDIA CUDA (Compute Unified Device Architecture) NVIDIA CUDA (Compute Unified Device Architecture) Defines hardware platform Defines hardware platform Generic programming Generic programming Extension to the C language Extension to the C language Explicit memory management Explicit memory management Programming on thread level Programming on thread level Intel Ct (C for throughput) Intel Ct (C for throughput) Extension to the C language Extension to the C language Intel CPU/GPU specific Intel CPU/GPU specific SIMD exploitation for automatic parallelism SIMD exploitation for automatic parallelism OpenCL (Open Computing Language) OpenCL (Open Computing Language) Open standard for generic programming Open standard for generic programming Extension to the C language Extension to the C language Supposed to work on any hardware Supposed to work on any hardware Usage of specific hardware capabilities by extensions Usage of specific hardware capabilities by extensions

12 March 2009, GSIIvan Kisel, GSI13/14 On-line = Off-line Reconstruction ?  Off-line and on-line reconstructions will and should be parallelized  Both versions will be run on similar many-core systems or even on the same PC farm  Both versions will use (probably) the same parallel language(s), such as OpenCL  Can we use the same code, but with some physics cuts applied when running on-line, like L1 CA?  If the final code is fast, can we think about a global on-line event reconstruction and selection? Intel SIMD Intel MIMD Intel Ct NVIDIA CUDA OpenCLSTS++++– MuCh RICH TRD Your Reco Open Charm Analysis Your Analysis

12 March 2009, GSIIvan Kisel, GSI14/14Summary Think parallel ! Think parallel ! Parallel programming is the key to the full potential of the Tera-scale platforms Parallel programming is the key to the full potential of the Tera-scale platforms Data parallelism vs. parallelism of the algorithm Data parallelism vs. parallelism of the algorithm Stream processing – no branches Stream processing – no branches Avoid direct accessing main memory, no maps, no look-up-tables Avoid direct accessing main memory, no maps, no look-up-tables Use SIMD unit in the nearest future (many-cores, TF/s, …) Use SIMD unit in the nearest future (many-cores, TF/s, …) Use single-precision floating point where possible Use single-precision floating point where possible In critical parts use double precision if necessary In critical parts use double precision if necessary Keep portability of the code on heterogeneous systems (Intel, AMD, Cell, GPGPU, …) Keep portability of the code on heterogeneous systems (Intel, AMD, Cell, GPGPU, …) New parallel languages appear: OpenCL, Ct, CUDA New parallel languages appear: OpenCL, Ct, CUDA GPGPU is personal supercomputer with 1 TFlops for 300 EUR !!! GPGPU is personal supercomputer with 1 TFlops for 300 EUR !!! Should we start buying them for testing? Should we start buying them for testing? CPU/GPU AMD: Fusion AMD: FusionCPU/GPU OpenCL?OpenCL? Gaming STI: Cell STI: CellGaming GP CPU Intel: Larrabee Intel: Larrabee GP CPU Intel: Larrabee Intel: Larrabee GP GPU Nvidia: Tesla Nvidia: Tesla GP GPU Nvidia: Tesla Nvidia: Tesla CPU Intel: XXX-cores Intel: XXX-coresCPU FPGA Xilinx XilinxFPGA