Unified Model I/O Server : Recent Developments and Lessons

Slides:



Advertisements
Similar presentations
Introduction to Grid Application On-Boarding Nick Werstiuk
Advertisements

Multiple Processor Systems
© Crown copyright Met Office Met Office Unified Model I/O Server Paul Selwood.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Lecture 17 Page 1 CS 111 Online Distributed Computing CS 111 On-Line MS Program Operating Systems Peter Reiher.
Migration to Rose and High Resolution Modelling Jean-Christophe Rioual, CRUM, Met Office 09/04/2015.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
Coupling Facility. The S/390 Coupling Facility (CF), the key component of the Parallel Sysplex cluster, enables multisystem coordination and datasharing.
Page 1© Crown copyright 2005 Met Office plans for sea ice model development within a flexible modelling framework Helene Banks Martin Best, Ann Keen and.
Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.
Low Overhead Real-Time Computing General Purpose OS’s can be highly unpredictable Linux response times seen in the 100’s of milliseconds Work around this.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
© Crown copyright Met Office Technical developments at the Met Office Matthew Glover.
SysPlex -What’s the problem Problems are growing faster than uni-processor….1980’s Leads to SMP and loosely coupled Even faster than SMP and loosely coupled.
Nagios Performance Tuning Nick Scott
Experiences and Decisions in Met Office coupled ESM Development
Getting the Most out of Scientific Computing Resources
Getting the Most out of Scientific Computing Resources
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Module 12: I/O Systems I/O hardware Application I/O Interface
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Informatica PowerCenter Performance Tuning Tips
Testing multicomponent multiphysics climate models
Cache Memory Presentation I
Operating System Reliability
Report on Vector Prototype
Cache memory Direct Cache Memory Associate Cache Memory
CSCI 315 Operating Systems Design
CMSC 611: Advanced Computer Architecture
Filesystems 2 Adapted from slides of Hank Levy
Lock Ahead: Shared File Performance Improvements
OPERATING SYSTEMS PROCESSES
Operating System Reliability
Lecture: Cache Innovations, Virtual Memory
I/O Systems I/O Hardware Application I/O Interface
Operating System Concepts
13: I/O Systems I/O hardwared Application I/O Interface
CS703 - Advanced Operating Systems
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Process & its States Lecture 5.
Testing multicomponent multiphysics climate models
Adapted from slides by Sally McKee Cornell University
Shared Memory Programming
Prof. Leonardo Mostarda University of Camerino
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Operating System Reliability
Distance Vector Routing Protocols
CSC Multiprocessor Programming, Spring, 2011
A tutorial on building large-scale services
Operating System Reliability
Module 12: I/O Systems I/O hardwared Application I/O Interface
Operating System Reliability
Presentation transcript:

Unified Model I/O Server : Recent Developments and Lessons Paul Selwood © Crown copyright Met Office

Recent Developments Paul Selwood © Crown copyright Met Office

Increasing Server Parallelism Original Server Parallelism Over units/files Course grain load spreading Units are allowed to ‘hop’ servers based on current load balance of servers Uses a second ‘priority’ protocol queue for fast enquiry operations Hints are provided with file_open() calls to signal that future writes to the unit are dependency from prior ones © Crown copyright Met Office

Parallelism Within Server North-South domain decomposition Not model decomposition Maps N/S model subdomains to servers Pro – reduced message count Con – load imbalance Allows packing parallelism Multiplies effective server FIFO queue size Provides a communication hierarchy © Crown copyright Met Office

Subtlety: Temporal Reliability The operational environment needs to know what's going on But I/O is running in arrears Messages to trigger archiving and pre-processing Model state consistency (which is my last known good state?) Introduce a new generic ‘epoch’ pseudo operation Server owning the epoch must wait until all other servers have processed the epoch – effectively a barrier. Provides limited temporal ordering between independent operations where needed at the expense of stalling one server. © Crown copyright Met Office

Subtlety: Temporal Reliability 4 1 2 3 Existing work Checkpoint Complete! Epoch marker (owner is 4) Temporally critical operation New work Server 4 waiting for epoch on 1 & 2 © Crown copyright Met Office

Now Use Quadruple Buffering! Buffer 1: Aggregation over repeated requests to the same unit Evolved from aggregation over levels in a 3D field Trade off: message counts, bandwidth, latency, utilisation Buffer 2: Client side dispatch queue Allows multiple in flight transactions with server. Insulates against comms latency and a busy server. Buffer 3: I/O Server FIFO Used to facilitate hardware comms offload Buffer 4: Aggregation of write data into well formed I/O blocks Filesystem tuneable © Crown copyright Met Office

Lots of implementation … All deterministic global configurations UKV – 1.5km model (24 node) EURO 4km model (32 node) All ensemble configurations using > 4 nodes Atmosphere-only climate runs on > 4 nodes © Crown copyright Met Office

Reducing Messaging Pressure © Crown copyright Met Office

I/O in ESM OASIS Coupler NEMO Ocean CICE Sea Ice JULES Land Surface UKCA Chemistry UM Atmosphere OASIS Coupler UM I/O Server XIOS NEMO Ocean CICE Sea Ice © Crown copyright Met Office

Deployment Nightmare Component model decomposition tunings and restrictions I/O server tuning Coupled system load-balancing Mapping onto nodes Iterative process potentially using lots of compute time © Crown copyright Met Office

Input “helper threads” Can we use a spare thread to help read costs? Initial conditions, Boundary Conditions, Forcings, … Thread reads blocks ahead of requests Tuneable buffer sizes and number of blocks to read ahead Easier to implement on a new read layer with intrinsic buffering design Initial attempt implemented on I/O server © Crown copyright Met Office

CPU 0 I/O Server Helper Read Block 1 Read Req. Deliver Read Block 2 Scatter Deliver Read Block 2 Read Block 3 Deliver Deliver © Crown copyright Met Office

Read and Distribute Time (s) Apparent Read Rate (MB/s) Read Helper Results N512 forecast model start + timestep 0 incremental analysis update on 16 nodes of p775, 4 threads/task Read and Distribute Time (s) ApparentRead Time (s) Apparent Read Rate (MB/s) Original I/O layer 21.4 13.7 927 New layer, Input on I/O Server, High Polling delay 19.6 3.9 3143 New layer, Input on I/O Server,Low Polling Delay 16.6 4.5 2710 New layer, no I/O Server or helper thread 13.9 5.3 2288 © Crown copyright Met Office

So what is happening? Helper thread is working – read rates are good Latencies between I/O server and model are killing performance Same for larger data sets? Can we aggregate levels to reduce latencies? Helper thread on model cores? New I/O level is an improvement anyway! © Crown copyright Met Office

Lessons Paul Selwood © Crown copyright Met Office

MPI Issues and Opportunities Threading support Need MPI_THREAD_SERIALIZED, prefer MPI_THREAD_MULTIPLE iGatherv Data transfer currently via iSend/iRecv Reduce message rate pressure Message Priority Want QoS features to prioritise messages Available at system level for many IB implementations MPI 3 communicator hints? © Crown copyright Met Office

OpenMP Issues Subtle failures with FLUSH Thread polling load with SMT / HT (backoff) How to use > 2 threads when atmosphere does Packing? Nested OpenMP support uneven. Don’t. Awkward deployment. Input helper thread? © Crown copyright Met Office

Other Issues Haven’t fully appreciated bottlenecks until partial system in place and scale is pushed Memory accounting is important and easy to get wrong Cost overheads exist Message contention NUMA effects Fully asynchronous processes difficult to debug Complex systems get (wrongly) blamed for any failures! © Crown copyright Met Office

Lots of tuneable parameters… Number and spacing of I/O servers Core capability and mapping to hardware Buffering Choices; Where to consume memory FIFO for I/O servers, Client queue size, Aggregation level Platform characteristics, MPI characteristics, output pattern Load balancing options Timing tunings Thread back offs, to avoid spin loops on SMT/HT chips Standard disk I/O tunings (write block size) etc Profiling: lock metering, loading logs, transaction logs, comms timers © Crown copyright Met Office

Server FIFO queues frequently hit max capacity … but how to choose? Fast: 4x14 Slow: 4x2 Time vs. Time step Straight line indicates no impact of model evolution from IO intensive steps Model stalls on IO intensive steps, Servers trailing by up to 600 seconds. Backlog vs. Times Server FIFO queues frequently hit max capacity © Crown copyright Met Office

Results © Crown copyright Met Office

What to measure? In practise we measure “Stall Time” Overall model runtimes Tend not to worry about apparent I/O rates etc Minimising stall time ensures process is as asynchronous as we can get it. © Crown copyright Met Office

Results (PS27) Config Performance (relative No-IOS) P575 P775 1st Operational configuration 8 serial I/O servers 760 Atmosphere tasks (20x38) QU06 Global model (PS27 Code level) 120GB output total 4 x 10.6GB restart files, 25 files total Good performance on POWER6 Migration to POWER7 slows down Atmospheric model faster Exposes previously completely hidden packing time Config Performance (relative No-IOS) P575 P775 IO Disabled 42% No I/O Server 100% Proxy I/O Server 90% Asynchronous I/O Server 73% 154% © Crown copyright Met Office

Results (Parallel Servers) Same QU06 model @24 Nodes on P775. Parallel servers remove the packing blockage For the same floor space, more CPUs allocated to IO gives better performance Moving to 26 Nodes is a super linear speedup. Insignificant observable I/O time mid-run, 20 seconds at run termination whilst the final I/O discharges Config Performance (relative No-IOS) Original 20x38 atmosphere grid 8x1 154% 4x2 114% 2x4 77% 1x8 84% Smaller 22x34 atmosphere grid 10x2 85% 5x4 58% 4x5 63% 2x10 55% © Crown copyright Met Office

Results (N768 ENDGAME) Potential future global model 96 Nodes, 6144 threads 2.25x more grid points Untuned first run! Use 4 servers to reflect rough breakdown of files Config Performance (relative No-IOS) 4x1 <timeout> 4x2 130% 4x4 75% 4x8 62% 4x14 60% © Crown copyright Met Office

Questions and answers © Crown copyright Met Office