Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unified Model I/O Server : Recent Developments and Lessons

Similar presentations


Presentation on theme: "Unified Model I/O Server : Recent Developments and Lessons"— Presentation transcript:

1 Unified Model I/O Server : Recent Developments and Lessons
Paul Selwood © Crown copyright Met Office

2 Recent Developments Paul Selwood © Crown copyright Met Office

3 Increasing Server Parallelism
Original Server Parallelism Over units/files Course grain load spreading Units are allowed to ‘hop’ servers based on current load balance of servers Uses a second ‘priority’ protocol queue for fast enquiry operations Hints are provided with file_open() calls to signal that future writes to the unit are dependency from prior ones © Crown copyright Met Office

4 Parallelism Within Server
North-South domain decomposition Not model decomposition Maps N/S model subdomains to servers Pro – reduced message count Con – load imbalance Allows packing parallelism Multiplies effective server FIFO queue size Provides a communication hierarchy © Crown copyright Met Office

5 Subtlety: Temporal Reliability
The operational environment needs to know what's going on But I/O is running in arrears Messages to trigger archiving and pre-processing Model state consistency (which is my last known good state?) Introduce a new generic ‘epoch’ pseudo operation Server owning the epoch must wait until all other servers have processed the epoch – effectively a barrier. Provides limited temporal ordering between independent operations where needed at the expense of stalling one server. © Crown copyright Met Office

6 Subtlety: Temporal Reliability
4 1 2 3 Existing work Checkpoint Complete! Epoch marker (owner is 4) Temporally critical operation New work Server 4 waiting for epoch on 1 & 2 © Crown copyright Met Office

7 Now Use Quadruple Buffering!
Buffer 1: Aggregation over repeated requests to the same unit Evolved from aggregation over levels in a 3D field Trade off: message counts, bandwidth, latency, utilisation Buffer 2: Client side dispatch queue Allows multiple in flight transactions with server. Insulates against comms latency and a busy server. Buffer 3: I/O Server FIFO Used to facilitate hardware comms offload Buffer 4: Aggregation of write data into well formed I/O blocks Filesystem tuneable © Crown copyright Met Office

8 Lots of implementation …
All deterministic global configurations UKV – 1.5km model (24 node) EURO 4km model (32 node) All ensemble configurations using > 4 nodes Atmosphere-only climate runs on > 4 nodes © Crown copyright Met Office

9 Reducing Messaging Pressure
© Crown copyright Met Office

10 I/O in ESM OASIS Coupler NEMO Ocean CICE Sea Ice JULES Land Surface
UKCA Chemistry UM Atmosphere OASIS Coupler UM I/O Server XIOS NEMO Ocean CICE Sea Ice © Crown copyright Met Office

11 Deployment Nightmare Component model decomposition tunings and restrictions I/O server tuning Coupled system load-balancing Mapping onto nodes Iterative process potentially using lots of compute time © Crown copyright Met Office

12 Input “helper threads”
Can we use a spare thread to help read costs? Initial conditions, Boundary Conditions, Forcings, … Thread reads blocks ahead of requests Tuneable buffer sizes and number of blocks to read ahead Easier to implement on a new read layer with intrinsic buffering design Initial attempt implemented on I/O server © Crown copyright Met Office

13 CPU 0 I/O Server Helper Read Block 1 Read Req. Deliver Read Block 2
Scatter Deliver Read Block 2 Read Block 3 Deliver Deliver © Crown copyright Met Office

14 Read and Distribute Time (s) Apparent Read Rate (MB/s)
Read Helper Results N512 forecast model start + timestep 0 incremental analysis update on 16 nodes of p775, 4 threads/task Read and Distribute Time (s) ApparentRead Time (s) Apparent Read Rate (MB/s) Original I/O layer 21.4 13.7 927 New layer, Input on I/O Server, High Polling delay 19.6 3.9 3143 New layer, Input on I/O Server,Low Polling Delay 16.6 4.5 2710 New layer, no I/O Server or helper thread 13.9 5.3 2288 © Crown copyright Met Office

15 So what is happening? Helper thread is working – read rates are good
Latencies between I/O server and model are killing performance Same for larger data sets? Can we aggregate levels to reduce latencies? Helper thread on model cores? New I/O level is an improvement anyway! © Crown copyright Met Office

16 Lessons Paul Selwood © Crown copyright Met Office

17 MPI Issues and Opportunities
Threading support Need MPI_THREAD_SERIALIZED, prefer MPI_THREAD_MULTIPLE iGatherv Data transfer currently via iSend/iRecv Reduce message rate pressure Message Priority Want QoS features to prioritise messages Available at system level for many IB implementations MPI 3 communicator hints? © Crown copyright Met Office

18 OpenMP Issues Subtle failures with FLUSH
Thread polling load with SMT / HT (backoff) How to use > 2 threads when atmosphere does Packing? Nested OpenMP support uneven. Don’t. Awkward deployment. Input helper thread? © Crown copyright Met Office

19 Other Issues Haven’t fully appreciated bottlenecks until partial system in place and scale is pushed Memory accounting is important and easy to get wrong Cost overheads exist Message contention NUMA effects Fully asynchronous processes difficult to debug Complex systems get (wrongly) blamed for any failures! © Crown copyright Met Office

20 Lots of tuneable parameters…
Number and spacing of I/O servers Core capability and mapping to hardware Buffering Choices; Where to consume memory FIFO for I/O servers, Client queue size, Aggregation level Platform characteristics, MPI characteristics, output pattern Load balancing options Timing tunings Thread back offs, to avoid spin loops on SMT/HT chips Standard disk I/O tunings (write block size) etc Profiling: lock metering, loading logs, transaction logs, comms timers © Crown copyright Met Office

21 Server FIFO queues frequently hit max capacity
… but how to choose? Fast: 4x14 Slow: 4x2 Time vs. Time step Straight line indicates no impact of model evolution from IO intensive steps Model stalls on IO intensive steps, Servers trailing by up to 600 seconds. Backlog vs. Times Server FIFO queues frequently hit max capacity © Crown copyright Met Office

22 Results © Crown copyright Met Office

23 What to measure? In practise we measure
“Stall Time” Overall model runtimes Tend not to worry about apparent I/O rates etc Minimising stall time ensures process is as asynchronous as we can get it. © Crown copyright Met Office

24 Results (PS27) Config Performance (relative No-IOS) P575 P775
1st Operational configuration 8 serial I/O servers 760 Atmosphere tasks (20x38) QU06 Global model (PS27 Code level) 120GB output total 4 x 10.6GB restart files, 25 files total Good performance on POWER6 Migration to POWER7 slows down Atmospheric model faster Exposes previously completely hidden packing time Config Performance (relative No-IOS) P575 P775 IO Disabled 42% No I/O Server 100% Proxy I/O Server 90% Asynchronous I/O Server 73% 154% © Crown copyright Met Office

25 Results (Parallel Servers)
Same QU06 Nodes on P775. Parallel servers remove the packing blockage For the same floor space, more CPUs allocated to IO gives better performance Moving to 26 Nodes is a super linear speedup. Insignificant observable I/O time mid-run, 20 seconds at run termination whilst the final I/O discharges Config Performance (relative No-IOS) Original 20x38 atmosphere grid 8x1 154% 4x2 114% 2x4 77% 1x8 84% Smaller 22x34 atmosphere grid 10x2 85% 5x4 58% 4x5 63% 2x10 55% © Crown copyright Met Office

26 Results (N768 ENDGAME) Potential future global model
96 Nodes, 6144 threads 2.25x more grid points Untuned first run! Use 4 servers to reflect rough breakdown of files Config Performance (relative No-IOS) 4x1 <timeout> 4x2 130% 4x4 75% 4x8 62% 4x14 60% © Crown copyright Met Office

27 Questions and answers © Crown copyright Met Office


Download ppt "Unified Model I/O Server : Recent Developments and Lessons"

Similar presentations


Ads by Google