Multicore Computing in ATLAS Paolo Calafiura
Why Embarrassingly Parallel? athena AOD dESD1 dESD2 Millions of LOCs Complex Data Source Multiple Outputs Our computational units are many, small, and very regular (not nearly enough black holes in our billions of events) Low data bandwidth (0.1-10MB/s)
Today multi-job approach: athenaMJ for i in range(4): $> Athena.py -c “EvtMax=25; SkipEvents=$i*25” Jobo.py core-0 JOB 0: Events: [1,…,24] core-1 JOB 1: Events: [25,…,49] core-2 JOB 2: Events: [50,…,74] core-3 JOB 3: Events: [75,…,99] start init end start init end start init end start init end PARALLEL: independent jobs Future Computing Edinburgh 16/6/`11
Event Level Parallelism with AthenaMP > Athena.py --nprocs=4 -c EvtMax=100 Jobo.py core-0 WORKER 0: Events: [0, 5, 8,…96] core-1 WORKER 1: Events: [1, 7, 10,…,99] core-2 WORKER 2: Events: [3, 6, 9,…,98] core-3 WORKER 3: Events: [2, 4, 12,…,97] output-tmp files output tmp Output tmp files Maximize the shared memory! firstEvnts init OS-fork merge end Input Files Output Files SERIAL: parent-init-fork SERIAL: parent-merge and finalize PARALLEL: workers event loop Future Computing Edinburgh 16/6/`11
~Same Event Throughput memory limit hit, swapping begins - RDOtoESD 2x4 cores Nehalem, 16GB RAM, Hyperthreading ON Future Computing Edinburgh 16/6/`11
AthenaMP ~0.5 Gb physical memory saved per process Why AthenaMP? Main goal is to reduce overall memory footprint Use Linux fork() to share memory automatically - 8 core HT machine AthenaMP ~0.5 Gb physical memory saved per process Future Computing Edinburgh 16/6/`11
Future Computing Edinburgh Why Multiprocessing? Forking has annoying issues like memory unsharing, file merging, why not go multi-thread? ATLAS HLT started with GaudiMT in ~2002 Same event farming approach, threading code well-hidden Never got beyond proof-of-concept Too much code to port, too many developers to educate Hardly an ATLAS issue: examples of MLOCs applications running reliably in MT are few and far between.... Even with multi-processing took us four years to go from Scott's prototype to (almost) production-quality Future Computing Edinburgh 16/6/`11
Multicore Tweaks: CPU Affinity Linux schedulers will move processes from core to core during the course of a job We can prevent this by pinning a process to a core via its affinity, and gain 20% - 100 events Future Computing Edinburgh 16/6/`11
Multicore Tweaks: Hyper-Threading Nature of instruction pipeline allows instructions to be interleaved OS sees “virtual cores” as just another processor Only effective until one of the threads saturates a shared resource and stalls - AthenaMP - 8 cores, - no affinity pinning HT increases event throughput by 25% Future Computing Edinburgh 16/6/`11
athenaMP in Production Six months of focused effort, not quite there yet Great improvements since Vakho started working on this full time last month. Details in Andy's talk Need of a File Merging Framework Generalize what we are doing with POOL fastmerge Of interest beyond athenaMP Future Computing Edinburgh 16/6/`11
Beyond Event Parallel Many-core is pushing us to go task- parallel Smaller processes Improved memory locality Potentially better caching of asynchronous I/O and other stalls Tracking Tracking Tracking Tracking PID Tracking Monitoring Event I/O Tracking Tracking
Dedicated I/O Processes Event source Read centrally data from disk. Deflate once, do not duplicate buffers Event sink Merge events on the fly, same memory, no fastmerge postprocessing Event worker As before, minus all I/O dependencies (e.g. dictionaries) Predicated on ability to exchange events or data objects in bytestream(-like) format Most promising sub-event parallelization. Work will start this summer (next week!) Future Computing Edinburgh 16/6/`11
Multi-staged, Pipelined Processing
A Lot of Pipes!
Automated Sub-event Pipelining through Dataflow Analysis Core 0 Core 1 D 2 4 M 1 4 J K I 1 1 N 2 P 6 4 One partitioning technique we are starting to experiment with, is dataflow analysis (slide 2). In the top left corner of slide 2 there is a toy data flow graph for athena (the real graph as 20 times more nodes). At run-time we monitor which data (grey boxes) is read and written by every athena Algorithm (the processing modules, red ovals). We then turn this dataflow graph into a "precedence graph" (center) that shows the ordering in which Algorithms must run to satisfy their data dependencies. We will then use this precedence graph and the data flow graph itself to minimize the flow of data between Algorithms, and maximize the locality of the code they run. The challenge is to do this while keeping athenaMP "load-balanced", in other words keeping all cores in the CPU busy as much as possible, maximizing the event-throughput through the entire CPU. We are getting some seed money in FY11 from DOE Office of HEP to do initial studies, but we were also told that further R&D will have to be financed via R&D channels, including the detector upgrade ones. Challenging optimization: Minimize memory traffic, load-balance cores Prototype results promising (2.5x parallel speedup)
Summary Go parallel to save memory resources Basic event task farm will be used in production this summer Famous last words... Sub-event parallelization should improve memory locality and disk access patterns
Future Computing Edinburgh Extra Future Computing Edinburgh 16/6/`11
Communication among sub-tasks Tracking Process PID Process Physics Algorithms Transient Event Transient Event The Tracks Microstreamin g Persistent Event Pipe or shared memory access Implementation under discussion by Peter Van G, Sebastien, and PC Persistent Store