Lic. Thesis Presentation in ICT/ECS KTH

A High-End Reconfigurable Computation Platform for Particle Physics Experiments
Lic. Thesis Presentation in ICT/ECS KTH Under the collaboration between KTH & JLU by Ming Liu Supervisors: Prof. Axel Jantsch (KTH) Dr. Zhonghai Lu (KTH) Prof. Wolfgang Kuehn (JLU, Germany) 10 November 2018 Royal Institute of Technology, Sweden

Royal Institute of Technology, Sweden
Contributions The thesis is mainly based on the following contributions: Ming Liu, Johannes Lang, Shuo Yang, Tiago Perez, Wolfgang Kuehn, Hao Xu, Dapeng Jin, Qiang Wang, Lu Li, Zhenan Liu, Zhonghai Lu, and Axel Jantsch, “ATCA-based Computation Platform for Data Acquisition and Triggering in Particle Physics Experiments”, In Proc. of the International Conference on Field Programmable Logic and Applications 2008 (FPL’08), Sep (System architecture) Ming Liu, Wolfgang Kuehn, Zhonghai Lu and Axel Jantsch, “System-on-an-FPGA Design for Real-time Particle Track Recognition and Reconstruction in Physics Experiments”, In Proc. of the 11th EUROMICRO Conference on Digital System Design (DSD’08), Sep (Algorithm implementation and evaluation) Ming Liu, Wolfgang Kuehn, Zhonghai Lu, Axel Jantsch, Shuo Yang, Tiago Perez and Zhenan Liu, “Hardware/Software Co-design of a General-Purpose Computation Platform in Particle Physics”, In Proc. of the 2007 IEEE International Conference on Field Programmable Technology (ICFPT’07), Dec (HW/SW co-design) 10 November 2018 Royal Institute of Technology, Sweden

Overview Background in Physics Experiments Computation Platform for DAQ and Triggering Network architecture Compute Node (CN) architecture HW/SW Co-design of the System-on-an-FPGA Partitioning strategy HW design SW design Algorithm Implementation and Evaluation Conclusion and Future Work 10 November 2018 Royal Institute of Technology, Sweden

Background Nuclear & Particle Physics: a branch of physics that studies the constituents and interactions of atomic nuclei and particles. Some elementary particles do not occur under normal circumstances in nature. Many can be created and detected during energetic collisions of others. Beam → Target, or Beam ↔ Beam. Produced particles are studied with huge/complex detector systems. Examples: HADES & GSI, Germany ATLAS, CMS, LHCb, ALICE at the CERN, Switzerland & France BES IHEP, China FZ-Juelich, Germany …… 10 November 2018 Royal Institute of Technology, Sweden

Detector Systems HADES: RICH (Ring Image CHerenkov) MDC (Mini Drift Chamber) TOF (Time-Of-Flight) TOFino (small TOF) Shower (Electromagnetic Shower) RPC (Resistive Plate Chamber, will be added to substitute TOFino) HADES BES III WASA PANDA 10 November 2018 Royal Institute of Technology, Sweden

HADES Detector System 10 November 2018 Royal Institute of Technology, Sweden

Challenge & Motivation
Challenge: high reaction rate and high data rate (PANDA, MHz, data rate up to 200 GB/s!!!) Not possible to entirely store all the data, due to the storage capacity limitation. Only a rare fraction (e.g. 1/106) is of interest for extensive offline analysis. The background can be discarded on the fly. Pattern recognition algorithms used to identify interesting events. Motivation: a reconfigurable and scalable computation platform for high data rate processing. 10 November 2018 Royal Institute of Technology, Sweden

Data Flow Pattern recognition algorithms Data correlation Largely reduced data rate for storage 10 November 2018 Royal Institute of Technology, Sweden

Related Work Previously commercial bus systems, such as VMEbus, FASTbus, CAMAC, etc., were used for DAQ and triggering. Time-multiplexing of the system bus exacerbates the data exchange efficiency and cannot meet high-performance requirements. The solution of existing reconfigurable computers sounds good, but not suitable for physics experiment applications: Some are augmented computer clusters with FPGAs attached to the system bus as accelerators. (Bandwidth bottleneck between the microprocessor and the accelerator) Some are standalone boards. (Not straightforward to scale the system to a large size, due to the lack of efficient inter-board connectivity) Flexible and massive communication channels are required to interface with detectors and the PC farm. All-board-switched or tree-like topology may result in communication penalty between algorithm steps. (P2P direct links are preferred.) 10 November 2018 Royal Institute of Technology, Sweden

DAQ and Trigger Systems
Detectors detect particles and generate signals Signals digitized by ADCs Data buffered in concentrators/buffers Pattern recognition algorithms extract features from events. Interesting events stored in the mass storage. Background discarded on the fly. 10 November 2018 Royal Institute of Technology, Sweden

Network Topology Compute Nodes (CN) interconnected for parallel/pipelined processing Hierarchical network topology External channels Optical links Gigabit Ethernet Internal interconnections On-board IO connections Inter-board backplane Inter-chassis switching 10 November 2018 Royal Institute of Technology, Sweden

ATCA Backplane Advanced Telecommunications Computing Architecture (ATCA) Full-mesh direct Point-to-Point (P2P) backplane High flexibility to correlate results from different algorithms High performance compared to shared buses 10 November 2018 Royal Institute of Technology, Sweden

Compute Node Prototype board with 5 Xilinx Virtex-4 FX60 FPGAs 4 FPGAs as algo. processors 1 FPGA as a switch 2GB DDR2 per FPGA, IPMC, Flash, CPLD... Full-mesh on-board communications of GPIOs & RocketIOs RocketIO-based backplane channels External channels of optical links & Gigabit Ethernet 10 November 2018 Royal Institute of Technology, Sweden

Compute Node PCB 14-layer PCB design Standard 12U size of 280 x 322 mm 10 November 2018 Royal Institute of Technology, Sweden

Performance Summary 1 ATCA chassis ~ 14 CNs 1890 Gbps on-board connections 1456 Gbps inter-board backplane connections 728 Gbps full-duplex optical bandwidth 70 Gbps Ethernet 140 GB DDR2 SDRAM All computing resources of 70 Virtex-4 FX60 FPGAs (140 PowerPC 405 microprocessors + programmable resources) Power consumption evaluation: Max. 170 W/CN (Each ATCA slot: 200 W) 10 November 2018 Royal Institute of Technology, Sweden

Partitioning Strategy
Multiple tasks during experiment operations (data processing, control tasks, ...) Partitioned between FPGA HW fabric & embedded PowerPC CPUs 10 November 2018 Royal Institute of Technology, Sweden

Partitioning Strategy
Concrete strategy: All pattern recognition algorithms customized in the FPGA fabric , as HW parallel/pipelined processing modules Slow control tasks, (e.g. Monitoring the system status, modifying experimental parameters, ...), implemented in SW (applications + OS) Soft TCP/IP stack in Linux OS 10 November 2018 Royal Institute of Technology, Sweden

HW Design Old bus-based arch. (PLB & OPB) CPU & fast peripherals on PLB Slow peripherals on OPB Customized processing modules (e.g. TPU) on PLB Improved MPMC LocalLink-based arch. Multi-Port Memory Controller (8 ports) Direct access to the memory from the device Customized processing unit interfaced to MPMC directly 10 November 2018 Royal Institute of Technology, Sweden

SW Design Open-source embedded Linux on the embedded PowerPC CPUs Device drivers Standard devices (Ethernet, RS232, Flash memory, etc.) Customized modules Applications for slow controls High level scripts C/C++ programs Apache webserver Java programs on the VM Software cost: low budget!!! 10 November 2018 Royal Institute of Technology, Sweden

Remote Reconfigurability
Remote reconfigurability (HW & SW) is desired due to the spatial constraint in experiments. Both OS and FPGA bitstream are stored in the NOR flash memories. With the support of the MTD driver, the bitstream and the OS kernel can be overwritten/upgraded in Linux. Roboot the system and the updated system will function. Backup mechanism to guarantee the system alive. 10 November 2018 Royal Institute of Technology, Sweden

Pattern Recognition in HADES
New DAQ & Trigger system for HADES upgrade (≤10 GB/s) Pattern recognition algorithms: Cherenkov ring recognition (RICH) MDC particle track reconstruction (MDCs) Time-Of-Flight processing (TOF & RPC) Electromagnetic shower recognition (Shower) Partitioned and distributed on FPGA nodes Algorithms correlated by hierarchical connections 10 November 2018 Royal Institute of Technology, Sweden

Pattern Recognition in HADES
Correlation Event building & storage 10 November 2018 Royal Institute of Technology, Sweden

Particle Track Reconstruction
Particle tracks bent in the magnetic field between the coils Straight lines before & after the coil approximately Inner and outer tracks pointing to RICH and TOF detector respectively and helping them to find patterns (correlation) Similar principle for inner and outer segments. Only inner part discussed The particle track reconstruction algorithm for HADES was previously implemented in SW, due to the complexity. Now implemented and investigated as a case study in HW 10 November 2018 Royal Institute of Technology, Sweden

Basic Principle Wires fired by flying particles Project fired wires to a plane Recognize the overlap area and reconstruct tracks from the target 6 sectors 2110 wires per sector (inner) 6 orientations 10 November 2018 Royal Institute of Technology, Sweden

Basic Principle 10 November 2018 Royal Institute of Technology, Sweden

Hardware Implementation
PLB interface (Slave) MPMC interface (Master) Algorithm processor: Tracking Processing Unit (TPU) 10 November 2018 Royal Institute of Technology, Sweden

Modular Design TPU for track reconstruction computation Input: fired wire Nos. Output: position of track candidates on the proj. plane Sub-modules: Wire No. Wr. FIFO Proj. LUT & Addr. LUT Bus master Accumulate unit Peak finder 10 November 2018 Royal Institute of Technology, Sweden

Implementation Results
Resource utilization of Virtex-4 FX60 – acceptable! Timing limitation: 125 MHz without optimization effort Clock frequency fixed at 100 MHz, to match the PLB speed 10 November 2018 Royal Institute of Technology, Sweden

Performance Evaluation
Experimental setup: MPMC-based structure used for measurements Different measurement points on different wire multiplicities (10, 30, 50, 200, 400 fired wires out of 2110) Projection LUT: 5.7 Kbits per wire in average (1.5 MB/2110 wires) A C program running on the Xeon 2.4 GHz computer as the reference Results: Speedup of 10.8 – 24.3 times per module have been seen compared to the software solution. Considering the FPGA resource consumption, multiple TPU modules may be integrated in the system for parallel processing and high performance. 10 November 2018 Royal Institute of Technology, Sweden

Performance Analysis Non-TPU factors introducing overhead and restricting the performance Complex DDR2 address mechanism (large latency) Data transfer burst mode of only 8 beats (clock cycles wasted) MPMC arbitrating the memory access among multiple ports (clock cycles wasted) … TPU module is powerful, but memory accessories (LUTs) are slow. Solution: SRAM memory added to enhance the memory bandwidth and reduce the access latency Speed up of from 20 to 50 per module compared to software expected 10 November 2018 Royal Institute of Technology, Sweden

Conclusion An FPGA- and ATCA-based computation platform is being constructed for the DAQ and trigger system in modern nuclear and particle physics experiments. The platform features high-performance, scalability, reconfigurability, and the potential to be used for different application projects (physics & non-physics). A co-design approach is proposed to develop applications on the platform. HW: system design + customized processing modules SW: Linux OS + device drivers + application programs A case study, the particle track reconstruction algorithm, has been implemented and evaluated on the system. Speedup of one order of magnitude per module has been observed when compared to the software solution. (multiple modules integrated for parallel processing) 10 November 2018 Royal Institute of Technology, Sweden

Future Work The network communication will be investigated with multiple CN PCBs. All pattern recognition algorithms are to be implemented and parallelized. Study of more efficient memory addressing mechanism for multiple modules More advanced features, e.g. dynamic partial reconfiguration for adaptive computing 10 November 2018 Royal Institute of Technology, Sweden

Thank you! 10 November 2018 Royal Institute of Technology, Sweden

Lic. Thesis Presentation in ICT/ECS KTH

Similar presentations

Presentation on theme: "Lic. Thesis Presentation in ICT/ECS KTH"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lic. Thesis Presentation in ICT/ECS KTH

Similar presentations

Presentation on theme: "Lic. Thesis Presentation in ICT/ECS KTH"— Presentation transcript:

Similar presentations

About project

Feedback