Multicore Computing in ATLAS

Slides:



Advertisements
Similar presentations
CSCI 4717/5717 Computer Architecture
Advertisements

Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
IT Systems Operating System EN230-1 Justin Champion C208 –
Chapter 1 and 2 Computer System and Operating System Overview
Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.
PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.
Roger Jones, Lancaster University1 Experiment Requirements from Evolving Architectures RWL Jones, Lancaster University Ambleside 26 August 2010.
Implementing Processes and Process Management Brian Bershad.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Requirements for a Next Generation Framework: ATLAS Experience S. Kama, J. Baines, T. Bold, P. Calafiura, W. Lampl, C. Leggett, D. Malon, G. Stewart, B.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
1 ATLAS experience running ATHENA and TDAQ software Werner Wiedenmann University of Wisconsin Workshop on Virtualization and Multi-Core Technologies for.
I/O Strategies for Multicore Processing in ATLAS P van Gemmeren 1, S Binet 2, P Calafiura 3, W Lavrijsen 3, D Malon 1 and V Tsulaia 3 on behalf of the.
Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.
I/O aspects for parallel event processing frameworks Workshop on Concurrency in the many-Cores Era Peter van Gemmeren (Argonne/ATLAS)
Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 10 September 2012.
Parallelizing Atlas Reconstruction and Simulation: Issues and Optimization Solutions for Scaling on Multi- and Many-CPU Platforms Charles Leggett 1 Sebastien.
Mini-Workshop on multi-core joint project Peter van Gemmeren (ANL) I/O challenges for HEP applications on multi-core processors An ATLAS Perspective.
Multi Process I/O Peter Van Gemmeren (Argonne National Laboratory (US))
Operating Systems c. define and explain the purpose of scheduling, job queues, priorities and how they are used to manage job throughput; d. explain how.
Introduction to Operating Systems Concepts
Chapter Overview General Concepts IA-32 Processor Architecture
SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS
Getting the Most out of Scientific Computing Resources
NFV Compute Acceleration APIs and Evaluation
Atlas IO improvements and Future prospects
Introduction to Operating Systems
Getting the Most out of Scientific Computing Resources
University of Technology
Process Management Process Concept Why only the global variables?
Chapter 3: Process Concept
Distributed Processors
Conception of parallel algorithms
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
Lecture Topics: 11/1 Processes Process Management
RT2003, Montreal Niko Neufeld, CERN-EP & Univ. de Lausanne
Sharing Memory: A Kernel Approach AA meeting, March ‘09 High Performance Computing for High Energy Physics Vincenzo Innocente July 20, 2018 V.I. --
Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker
Lecture 21 Concurrency Introduction
Task Scheduling for Multicore CPUs and NUMA Systems
How can a detector saturate a 10Gb link through a remote file system
Parallel Algorithm Design
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
Report on Vector Prototype
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
TYPES OFF OPERATING SYSTEM
湖南大学-信息科学与工程学院-计算机与科学系
Lecture 2: Processes Part 1
Hardware Multithreading
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
CS703 - Advanced Operating Systems
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Lecture Topics: 11/1 General Operating System Concepts Processes
Introduction to Operating Systems
Introduction to Operating Systems
Processes and Process Management
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 3: Processes.
Hardware Multithreading
CENG 351 Data Management and File Structures
Chapter-1 Computer is an advanced electronic device that takes raw data as an input from the user and processes it under the control of a set of instructions.
Operating System Overview
MapReduce: Simplified Data Processing on Large Clusters
Lecture Topics: 11/1 Hand back midterms
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Multicore Computing in ATLAS Paolo Calafiura

Why Embarrassingly Parallel? athena AOD dESD1 dESD2 Millions of LOCs Complex Data Source Multiple Outputs Our computational units are many, small, and very regular (not nearly enough black holes in our billions of events) Low data bandwidth (0.1-10MB/s)

Today multi-job approach: athenaMJ for i in range(4): $> Athena.py -c “EvtMax=25; SkipEvents=$i*25” Jobo.py core-0 JOB 0: Events: [1,…,24] core-1 JOB 1: Events: [25,…,49] core-2 JOB 2: Events: [50,…,74] core-3 JOB 3: Events: [75,…,99] start init end start init end start init end start init end PARALLEL: independent jobs Future Computing Edinburgh 16/6/`11

Event Level Parallelism with AthenaMP > Athena.py --nprocs=4 -c EvtMax=100 Jobo.py core-0 WORKER 0: Events: [0, 5, 8,…96] core-1 WORKER 1: Events: [1, 7, 10,…,99] core-2 WORKER 2: Events: [3, 6, 9,…,98] core-3 WORKER 3: Events: [2, 4, 12,…,97] output-tmp files output tmp Output tmp files Maximize the shared memory! firstEvnts init OS-fork merge end Input Files Output Files SERIAL: parent-init-fork SERIAL: parent-merge and finalize PARALLEL: workers event loop Future Computing Edinburgh 16/6/`11

~Same Event Throughput memory limit hit, swapping begins - RDOtoESD 2x4 cores Nehalem, 16GB RAM, Hyperthreading ON Future Computing Edinburgh 16/6/`11

AthenaMP ~0.5 Gb physical memory saved per process Why AthenaMP? Main goal is to reduce overall memory footprint Use Linux fork() to share memory automatically - 8 core HT machine AthenaMP ~0.5 Gb physical memory saved per process Future Computing Edinburgh 16/6/`11

Future Computing Edinburgh Why Multiprocessing? Forking has annoying issues like memory unsharing, file merging, why not go multi-thread? ATLAS HLT started with GaudiMT in ~2002 Same event farming approach, threading code well-hidden Never got beyond proof-of-concept Too much code to port, too many developers to educate Hardly an ATLAS issue: examples of MLOCs applications running reliably in MT are few and far between.... Even with multi-processing took us four years to go from Scott's prototype to (almost) production-quality Future Computing Edinburgh 16/6/`11

Multicore Tweaks: CPU Affinity Linux schedulers will move processes from core to core during the course of a job We can prevent this by pinning a process to a core via its affinity, and gain 20% - 100 events Future Computing Edinburgh 16/6/`11

Multicore Tweaks: Hyper-Threading Nature of instruction pipeline allows instructions to be interleaved OS sees “virtual cores” as just another processor Only effective until one of the threads saturates a shared resource and stalls - AthenaMP - 8 cores, - no affinity pinning HT increases event throughput by 25% Future Computing Edinburgh 16/6/`11

athenaMP in Production Six months of focused effort, not quite there yet Great improvements since Vakho started working on this full time last month. Details in Andy's talk Need of a File Merging Framework Generalize what we are doing with POOL fastmerge Of interest beyond athenaMP Future Computing Edinburgh 16/6/`11

Beyond Event Parallel Many-core is pushing us to go task- parallel Smaller processes Improved memory locality Potentially better caching of asynchronous I/O and other stalls Tracking Tracking Tracking Tracking PID Tracking Monitoring Event I/O Tracking Tracking

Dedicated I/O Processes Event source Read centrally data from disk. Deflate once, do not duplicate buffers Event sink Merge events on the fly, same memory, no fastmerge postprocessing Event worker As before, minus all I/O dependencies (e.g. dictionaries) Predicated on ability to exchange events or data objects in bytestream(-like) format Most promising sub-event parallelization. Work will start this summer (next week!) Future Computing Edinburgh 16/6/`11

Multi-staged, Pipelined Processing

A Lot of Pipes!

Automated Sub-event Pipelining through Dataflow Analysis Core 0 Core 1 D 2 4 M 1 4 J K I 1 1 N 2 P 6 4 One partitioning technique we are starting to experiment with, is dataflow analysis (slide 2). In the top left corner of slide 2 there is a toy data flow graph for athena (the real graph as 20 times more nodes). At run-time we monitor which data (grey boxes) is read and written by every athena Algorithm (the processing modules, red ovals). We then turn this dataflow graph into a "precedence graph" (center) that shows the ordering in which Algorithms must run to satisfy their data dependencies. We will then use this precedence graph and the data flow graph itself to minimize the flow of data between Algorithms, and maximize the locality of the code they run. The challenge is to do this while keeping athenaMP "load-balanced", in other words keeping all cores in the CPU busy as much as possible, maximizing the event-throughput through the entire CPU. We are getting some seed money in FY11 from DOE Office of HEP to do initial studies, but we were also told that further R&D will have to be financed via R&D channels, including the detector upgrade ones. Challenging optimization: Minimize memory traffic, load-balance cores Prototype results promising (2.5x parallel speedup)

Summary Go parallel to save memory resources Basic event task farm will be used in production this summer Famous last words... Sub-event parallelization should improve memory locality and disk access patterns

Future Computing Edinburgh Extra Future Computing Edinburgh 16/6/`11

Communication among sub-tasks Tracking Process PID Process Physics Algorithms Transient Event Transient Event The Tracks Microstreamin g Persistent Event Pipe or shared memory access Implementation under discussion by Peter Van G, Sebastien, and PC Persistent Store