July 2, 2016V.I. -- MultiCore R&D1 The challenge of adapting HEP physics software applications to run on many-core cpus CHEP, March ’09 Vincenzo Innocente.

Slides:

Advertisements

Similar presentations

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Advertisements

Lecture 6: Multicore Systems

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Thoughts on Shared Caches Jeff Odom University of Maryland.

I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Introduction CSE 410, Spring 2008 Computer Systems

Multi-core architectures. Single-core computer Single-core CPU chip.

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

October 14, 2015V.I. -- MultiCore R&D1 The challenge of adapting HEP physics-software to run on many-core cpus CERN/TH, July `10 Vincenzo Innocente CERN.

October 15, 2015V.I. -- MultiCore R&D1 The challenge of adapting HEP physics software applications to run on many-core cpus CHEP, March ’09 Vincenzo Innocente.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.

Requirements for a Next Generation Framework: ATLAS Experience S. Kama, J. Baines, T. Bold, P. Calafiura, W. Lampl, C. Leggett, D. Malon, G. Stewart, B.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

SFT Group Review: Additional projects, future directions and overall planning  SPI project  (WP8) Multi-core  (WP9) Virtualization  Other projects.

ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

WLCG Overview Board, September 3 rd 2010 P. Mato, P.Buncic Use of multi-core and virtualization technologies.

Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Full and Para Virtualization

Computing Performance Recommendations #10, #11, #12, #15, #16, #17.

Martin Kruliš by Martin Kruliš (v1.1)1.

PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.

PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,

Parallelization Geant4 simulation is an embarrassingly parallel computational problem – each event can possibly be treated independently 1.

Preliminary Ideas for a New Project Proposal.  Motivation  Vision  More details  Impact for Geant4  Project and Timeline P. Mato/CERN 2.

Parallel IO for Cluster Computing Tran, Van Hoai.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Alessandro De Salvo CCR Workshop, ATLAS Computing Alessandro De Salvo CCR Workshop,

1 The challenge of adapting HEP physics software applications to run on many-core cpus SuperB Workshop, March `10 Vincenzo Innocente CERN High Performance.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

PROOF on multi-core machines G. GANIS CERN / PH-SFT for the ROOT team Workshop on Parallelization and MultiCore technologies for LHC, CERN, April 2008.

Processor Level Parallelism 1

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Chapter Overview General Concepts IA-32 Processor Architecture

M. Bellato INFN Padova and U. Marconi INFN Bologna

Lynn Choi School of Electrical Engineering

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Solid State Disks Testing with PROOF

Multicore Computing in ATLAS

Lynn Choi School of Electrical Engineering

Multi-core processors

Sharing Memory: A Kernel Approach AA meeting, March ‘09 High Performance Computing for High Energy Physics Vincenzo Innocente July 20, 2018 V.I. --

Task Scheduling for Multicore CPUs and NUMA Systems

Vincenzo Innocente CERN

Architecture & Organization 1

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Hyperthreading Technology

Grid Canada Testbed using HEP applications

What is Parallel and Distributed computing?

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Architecture & Organization 1

Hardware Multithreading

Chapter 1 Introduction.

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

The University of Adelaide, School of Computer Science

Presentation transcript:

July 2, 2016V.I. -- MultiCore R&D1 The challenge of adapting HEP physics software applications to run on many-core cpus CHEP, March ’09 Vincenzo Innocente CERN High Performance Computing for High Energy Physics

Computing in the years zero 2 Moore’s law Transistors used to increase raw-powerIncrease global power

The 'three walls’ The perceived exponential grow of “effective” computing power, leaded by the Moore’s law, faded away in hitting three “walls”: –The memory wall –The power wall –The instruction level parallelism (micro- architecture) wall A turning point was reached and a new paradigm emerged: multicore 3

The ‘memory wall’ –Processor clock rates have been increasing faster than memory clock rates –larger and faster “on chip” cache memories help alleviate the problem but does not solve it. –Latency in memory access is often the major performance issue in modern software applications 4 Core 1 Core n …

The ‘power wall’ –Processors consume more and more power the faster they go –Not linear: »73% increase in power gives just 13% improvement in performance »(downclocking a processor by about 13% gives roughly half the power consumption) –Many computing center are today limited by the total electrical power installed and the corresponding cooling/extraction power. –How else increase the number of instruction per unit-time: Go parallel! 5

The ‘Architecture walls’ –Longer and fatter parallel instruction pipelines has been a main architectural trend in `90s –Hardware branch prediction, hardware speculative execution, instruction re- ordering (a.k.a. out-of-order execution) and just-in-time compilation are some notable examples of techniques to boost ILP –In practice inter-instruction data dependencies and run- time branching limit the amount of achievable ILP 6

Copyright © 2008 Intel Corporation. All rights reserved. Bringing IA Programmability and Parallelism to High Performance & Throughput Computing   Highly parallel, IA programmable architecture in development   Ease of scaling for software ecosystem   Array of enhanced IA cores   New Cache Architecture   New Vector Processing Unit   Scalable to TFLOPS performance Cache Special Function & I/O … IA++ … … … …… … Future options subject to change without notice.

8 The Challenge of Parallelization Exploit all 7 “parallel” dimensions of modern computing architecture for HPC –Inside a core 1.Superscalar: Fill the ports (maximize instruction per cycle) 2.Pipelined: Fill the stages (avoid stalls) 3.SIMD (vector): Fill the register width (exploit SSE) –Inside a Box 4.HW threads: Fill up a core (share core & caches) 5.Processor cores: Fill up a processor (share of low level resources) 6.Sockets: Fill up a box (share high level resources) –LAN & WAN 7.Optimize scheduling and resource sharing on the Grid –HEP has been traditionally good (only) in the latter

9 Were are WE? –HEP code does not exploit the power of current processors »One instruction per cycle at best »Little or no use of SIMD »Poor code locality »Abuse of the heap –Running N jobs on N=8 cores still efficient but: »Memory (and to less extent cpu cycles) wasted in non sharing “static” condition and geometry data I/O buffers Network and disk resources »Caches (memory on CPU chip) wasted and trashed Not locality of code and data –This situation is already bad today, will become only worse in future architectures (either multi-thread or multi-process)

HEP software on multicore: a R&D effort –Collaboration among experiments, IT-departments, projects such as OpenLab, Geant4, ROOT, Grid –Target multi-core (8-24/box) in the short term, many-core (96+/box) in near future –Optimize use of CPU/Memory architecture –Exploit modern OS and compiler features »Copy-on-Write »MPI, OpenMP »SSE/AltiVec, OpenCL –Prototype solutions »Adapt legacy software »Look for innovative solution for the future 10

11 Code optimization –Ample Opportunities for improving code performance »Measure and analyze performance of current LHC physics application software on multi-core architectures »Improve data and code locality (avoid trashing the caches) »Effective use of vector instruction (improve ILP) »Exploit modern compiler’s features (does the work for you!) –See Paolo Calafiura’s talk –All this is absolutely necessary, still not sufficient

Opportunity: Reconstruction Memory-Footprint shows large condition data How to share common data between different process?  multi-process vs multi-threaded  Read-only: Copy-on-write, Shared Libraries  Read-write: Shared Memory or files Event parallelism 12

13 Experience and requirements –Complex and dispersed “legacy” software »Difficult to manage/share/tune resources (memory, I/O) w/o support from OS and compiler »Coding and maintaining thread-safe software at user-level is difficult »Need automatic tools to identify code to be made thread-aware Geant4: 10K lines modified! Not enough, many hidden (optimization) details such as state-caches –Multi process seems more promising »ATLAS: fork() (exploit copy-on-write), shmem (needs library support) »LHCb: python »PROOF-lite –Other limitations are at the door (I/O, communication, memory) »Proof: client-server communication overhead in a single box »Proof-lite: I/O bound >2 processes per disk »Online (Atlas, CMS) limit in in/out-bound connections to one box

Exploit Copy on Write –Modern OS share read-only pages among processes dynamically »A memory page is copied and made private to a process only when modified –Prototype in Atlas and LHCb »Encouraging results as memory sharing is concerned (50% shared) »Concerns about I/O (need to merge output from multiple processes) 14 Memory (ATLAS) One process: 700MB VMem and 420MB RSS COW (before) evt 0: private: 004 MB | shared: 310 MB (before) evt 1: private: 235 MB | shared: 265 MB... (before) evt50: private: 250 MB | shared: 263 MB See Sebastien Binet’s talk

Exploit “Kernel Shared Memory” –KSM is a linux driver that allows dynamically sharing identical memory pages between one or more processes. »It has been developed as a backend of KVM to help memory sharing between virtual machines running on the same host. »KSM scans just memory that was registered with it. Essentially this means that each memory allocation, sensible to be shared, need to be followed by a call to a registry function. –Test performed “retrofitting” TCMalloc with KSM »Just one single line of code added! –CMS reconstruction of real data (Cosmics with full detector) »No code change »400MB private data; 250MB shared data; 130MB shared code –ATLAS »No code change »In a Reconstruction job of 1.6GB VM, up to 1GB can be shared with KSM 15

Handling Event Input/Output Sub-process Transient Event Store Parent-process Transient Event Store Algorithm input output Event Serialization/ Deserialization OutputStream Work Queue Output Queue input See talk by Eoin Smith 16

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 17 PROOF Lite PROOF Lite is a realization of PROOF in 2 tiers PROOF Lite is a realization of PROOF in 2 tiers The client starts and controls directly the workers The client starts and controls directly the workers Communication goes via UNIX sockets Communication goes via UNIX sockets No need of daemons: No need of daemons: workers are started via a call to ‘system’ and call back the client to establish the connection workers are started via a call to ‘system’ and call back the client to establish the connection Starts N CPU workers by default Starts N CPU workers by default C W W W 17 See Fons Radermaker’s talk

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 18 Scaling processing a tree, example Data sets 2 GB (fits in memory), 22 GB Data sets 2 GB (fits in memory), 22 GB 22 GB, IO bound CPU bound 4 coes, 8 GB RAM, single HDD 2 GB, no memory refresh 18

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 19 Solid State BNL Solid State BNL Model: Mtron MSP-SATA Model: Mtron MSP-SATA Capacity 64 GB Average access time ~0.1 ms (typical HD ~10ms)‏ Sustained read ~120MB/s Sustained write ~80 MB/s IOPS (Sequential/ Random) 81,000/18,000 Write endurance >140 50GB write per day MTBF 1,000,000 hours 7-bit Error Correction Code Courtesy of S. PaniL 19

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 20 SSD vs HDD on 8 Node Cluster SSD vs HDD on 8 Node Cluster Aggregate (8 node farm) analysis rate as a function of number of workers per node Aggregate (8 node farm) analysis rate as a function of number of workers per node Almost linear scaling with number of nodes Almost linear scaling with number of nodes Courtesy of S. Panitkin, BNL Solid State Disk: 120GB for 400Euro 20

Progress toward a thread-parallel Geant4 Gene Cooperman and Xin Dong (NEU Boston) »Master/Worker paradigm »Event-level parallelism: separate events on different threads only 1 RAM : increase sharing of memory between threads »Phase I : code sharing, but no data sharing Done »Phase II : sharing of geometry, materials, particles, production cuts Done, undergoing validation »Phase III : sharing of data for EM physics processes In Progress  Physics tables are read-only, but small caches and different API »Phase IV : other physics processes Todo »Phase V : General Software Schema: new releases of sequential Geant4 drive corresponding multi-threaded releases In Progress Patch parser.c of gcc to identify static and globals declarations in G4 Currently 10,000 lines of G4 modified automatically lines by hand 21

22 Algorithm Parallelization –Final performance gain will come from parallelizing algorithms used in current LHC physics application software »Prototypes using posix-thread, OpenMP and parallel gcclib »Work started to provide basic thread-safe/multi-thread library components Random number generators Parallelize minimization/fitting algorithms Parallelize/Vectorize linear algebra –Positive and interesting experience with Minuit

23 Parallel MINUIT see poster by Alfio Lazzaro, Lorenzo Moneta –MINUIT2 is a C++ OO version of the original MINUIT (F. James, "Minuit, Function Minimization and Error Analysis", CERN long writeup D506, 1970) »The most used algorithm in HEP community for Maximum Likelihood and χ 2 fits ( »Based on the gradient method for the minimization (MIGRAD), but other algorithms implemented (SIMPLEX, SCAN) »Integrated into ROOT –Two strategies for the parallelization of the Minimization and NLL calculation: 1. Minimization or NLL calculation on the same multi-cores node (OpenMP) 2. Minimization process on different nodes (MPI) and NLL calculation for each multi-cores node (OpenMP) : hybrid solution

24 Parallelization - Example 60 cores 30 cores 15 cores

Outlook –Recent progress shows that we shall be able to exploit next generation multicore with “small” changes to HEP code »Exploit copy-on-write (COW) in multi-processing (MP) »Develop an affordable solution for the sharing of the output file »Leverage Geant4 experience to explore multi-thread (MT) solutions –Continue optimization of memory hierarchy usage »Study data and code “locality” including “core-affinity” –Expand Minuit experience to other areas of “final” data analysis 25

Explore new Frontier of parallel computing »“Constant-Data” sharing by COW or MT will allow scaling current applications to use 8/24 cores per job. –Scaling to many-core processors (96-core processors foreseen for next year) will require innovative solutions »MP and MT beyond event level »Fine grain parallelism (openCL, custom solutions?) »Parallel I/O »Use of “spare” cycles/core for ancillary and speculative computations –Algorithm concept have to change: Think Parallel! 26