Accelerating particle identification for high-speed data-filtering using OpenCL on FPGAs and other architectures for FPL 2016 Srikanth Sridharan CERN 8/31/2016.

Slides:

Advertisements

Similar presentations

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.

Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.

Enabling Coherent FPGA Acceleration Allan Cantle, President & Founder Nallatech Join the conversation at #OpenPOWERSummit1 #OpenPOWERSummit.

Evaluation of 'OpenCL for FPGA' for DAQ and Acceleration in HEP applications CHEP 2015 Srikanth S Marie Curie Fellow: ICE-DIP

5 th LHCb Computing Workshop, May 19 th 2015 Niko Neufeld, CERN/PH-Department

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)

GPGPU platforms GP - General Purpose computation using GPU

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Supporting GPU Sharing in Cloud Environments with a Transparent

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Niko Neufeld, CERN/PH-Department

Extracted directly from:

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.

Parallelisation of Random Number Generation in PLACET Approaches of parallelisation in PLACET Martin Blaha University of Vienna AT CERN

Parallelization of System Matrix generation code Mahmoud Abdallah Antall Fernandes.

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

OpenCL Programming James Perry EPCC The University of Edinburgh.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

ICHEC Presentation ESR2: Reconfigurable Computing and FPGAs ICE-DIP Srikanth Sridharan 9/2/2015.

Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,

Computer Graphics Graphics Hardware

Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin

M. Bellato INFN Padova and U. Marconi INFN Bologna

NFV Compute Acceleration APIs and Evaluation

Generalized and Hybrid Fast-ICA Implementation using GPU

Gwangsun Kim, Jiyun Jeong, John Kim

FPGAs for next gen DAQ and Computing systems at CERN

GPU-based iterative CT reconstruction

CS427 Multicore Architecture and Parallel Computing

Enabling Effective Utilization of GPUs for Data Management Systems

Enabling machine learning in embedded systems

Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang

FPGAs in AWS and First Use Cases, Kees Vissers

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Embedded OpenCV Acceleration

Linchuan Chen, Xin Huo and Gagan Agrawal

A first step towards GPU-assisted Query Optimization

IBM Power Systems.

Computer Graphics Graphics Hardware

GPU accelerated application tracing

Multicore and GPU Programming

6- General Purpose GPU Programming

Multicore and GPU Programming

CS 295: Modern Systems GPU Computing Introduction

Presentation transcript:

Accelerating particle identification for high-speed data-filtering using OpenCL on FPGAs and other architectures for FPL 2016 Srikanth Sridharan CERN 8/31/2016

Srikanth Sridharan – ICE-DIP Project 2

Background 1/2 › Upgrades planned for LHCb exp. at CERN  Capture 40M particle collisions per sec at LHCb  Or a new collision every 25ns  Generates  Need to capture 100% of the data  Currently capturing only 2.5% * Projections from High Throughput Computing Project Status Report 22/6/16 by Niko Neufeld 8/31/2016 Srikanth Sridharan – ICE-DIP Project 3

Background 2/2 › Need to process all data in large computing farm to reconstruct events › Impossible with just CPUs › Hence need for accelerators › FPGA attractive due to performance and power efficiency › OpenCL makes FPGAs more accessible 8/31/2016 Srikanth Sridharan – ICE-DIP Project 4

Ring Imaging Cherenkov Detector (RICH) concept  One of the sub detectors of LHCb  Charged particles faster than light produce cone of Cherenkov photons  Like the light equivalent of a sonic boom  Reflected on spherical mirrors and captured on photodetectors  Ring size ∝ Cherenkov angle * Detector Images from “RICH pattern recognition” Note LHCb by Forty, R, and O Schneider Sonic boom image from content/uploads/2013/08/sonic-boom.jpg 8/31/2016 Srikanth Sridharan – ICE-DIP Project 5

RICH - Ring Imaging Cherenkov Detector 2/2  Calculate Spherical reflection Point  Calculate Cherenkov angles  Cherenkov angle ∝ particle velocity -1  Momentum and in turn Mass inferred from trajectory  Particle type inferred from velocity and mass * Images from “RICH pattern recognition” Note LHCb by Forty, R, and O Schneider 8/31/2016 Srikanth Sridharan – ICE-DIP Project 6

RICH Particle Identification Algorithm  50% of overall event reconstruction processes  Cherenkov angle reconstruction – 20% of RICH ̵ Like inverse ray tracing in 3D space ̵ Spherical mirrors make it more complicated ̵ Solve system of Quadratic equations, cubic roots, matrix rotations, cross products and Eigen functions etc. in floating point  Parallelizable for n particle  Pipelinable 8/31/2016 Srikanth Sridharan – ICE-DIP Project 7

OpenCL implementation › Start from an existing single threaded C++ implementation as reference › Create an OpenCL kernel from scratch  Remove framework dependencies (Eg. Gaudi, Root)  Recreate library function equivalents  Datatype modifications (Eg. Gaudi::XYZPoint, Eigen3Vector etc.)  Implement, Parallelize and optimize the algorithm 8/31/2016 Srikanth Sridharan – ICE-DIP Project 8

Results of OpenCL implementation 8/31/2016 Srikanth Sridharan – ICE-DIP Project 9 Reference: Intel Core i7-4770

Performance per Watt Metric Devices Stratix V FPGA Nallatech 385 card with 1x Stratix V FPGA Nvidia GeForce GTX 690 card with 2x GPU 2x Intel Xeon E CPU Avg. Run- time for 10 Million Photons (μs) No. of Photons per microsecond (μs -1 ) Power consumption (W) (Profiled) 25 (Typ. Max) 300 (Max) 190 (TDP) Estimated No. of Photons per microsecond per Watt (μs -1 W -1 ) /31/2016 Srikanth Sridharan – ICE-DIP Project 10

But…when factoring data transfer 8/31/2016 Srikanth Sridharan – ICE-DIP Project 11 Reference: Intel Core i7-4770

AVENUES FOR IMPROVING OVERALL RUNTIME ON FPGA › 1.Overlapping computation and data transfer  Default case: Transfer time and kernel runtime for single blocking command queue  Opt 1: 10 Command queues with 1/10 th data each, non blocking with event waitlist  Opt 2: Opt1 + flush after kernel 8/31/2016 Srikanth Sridharan – ICE-DIP Project 12

Improvement in numbers › Comparison of speedup for Overlapping and non Overlapping transfers › Small improvement but 10x transfer adds 10x latency 8/31/2016 Srikanth Sridharan – ICE-DIP Project 13 Type of Commands in Command QueueNo of PhotonsGlobal work sizeSpeedup Blocking (Default case)1e Nonblocking, No flush (Opt 1)1e710x 1e66.02 Nonblocking, Flush after kernel (Opt 2)1e710x 1e65.99 Blocking (Default case)1e6 5.43

2. Increasing the compute units on the FPGA / moving to a different FPGA › FPGA resource usage  DSPs – only resource heavily used  FPGA on Nallatech card: Stratix V GX variant with only 256 DSP blocks  Stratix V GS variants have DSP blocks  Potential to 2X throughput by increasing Computational Units Resource % of Resource used by OpenCL design KernelTotal LEs % % FFs % % RAMs % % DSPs % Avg. Utilization % % 8/31/2016 Srikanth Sridharan – ICE-DIP Project 14 › Move to Next gen FPGAs  Hard Floating Point units  Potentially could match GPU in raw performance

Moving to a faster memory / low latency coherent memory access › 3. Faster Memory  GeForce GTX690 card: GDDR5 with bw of 384GB/s  Nallatech FPGA card: DDR3 with bw of 25.6GB/s › 4. Low latency coherent memory access  Intel QPI, IBM CAPI  ~100x lower latency than PCIe  View of main memory  Already see significant improvements 8/31/2016 Srikanth Sridharan – ICE-DIP Project 15

5. Relaxing the Host centric OpenCL model › Host CPU at center of OpenCL model  Manages data transfers, kernel execution…  Device cannot start execution unless host issues commands  Add control latency on top of data latency  Unavoidable with opaque interfaces like PCIe › Can be replaced under SVM context  Device can schedule read and write by itself  Command queues and event wait list exists, but managed by host  Potential for device based command queues  More autonomy for devices 8/31/2016 Srikanth Sridharan – ICE-DIP Project 16

Conclusion › GPU has 1.8X better kernel performance over FPGA (normalized) › FPGA has 1.4X better overall performance over GPU (normalized) › FPGA has Order of magnitude better power efficiency & Performance/Watt › OpenCL – a game changer for FPGAs  Accelerated development time  ~2weeks for OpenCL vs ~2months for RTL 8/31/2016 Srikanth Sridharan – ICE-DIP Project 17

Questions? 8/31/2016 Srikanth Sridharan – ICE-DIP Project 18