Download presentation
Published byBryan Wilkerson Modified over 9 years ago
0
Samm Elliott Mentor – Davide Del Vento
WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento
1
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting needs. Used by over 30,000 Scientists around the world. Any Optimizations that can be made will make a significant impact on the WRF community as a whole.
2
Stampede Supercomputer
6400 Nodes (Dell PowerEdge C8220) Two CPU’s Per Node (Intel Xeon E Processors Per Node) 32 GB Memory Per Node 1-2 Coprocessors Per Node (Intel Xeon Phi SE10P)
3
Intel Xeon vs Xeon Phi Architecture
Xeon E CPU 8 Cores 2-Way Hyperthreading 256 Bit Vector Registers 2.7 GHz 32 KB L1 256 KB L2 20 MB Shared L3 32 GB Main Memory Xeon Phi SE10P Coprocessor 61 Cores 4-Way Hyperthreading (2 Hardware Threads Per Core) 512 Bit Vector Registers 1.1 GHz 512 KB L2 No L3 Cache 8 GB Main Memory
4
Intro: WRF Gets Good Performance on Xeon Phi!
Memory Limitations
5
Standard MPI Implementation Strong Scaling
Low Efficiency Fast Time to Solution High Efficiency Slow Time to Solution Compute Bound MPI Bound
6
MPI Tile Decomposition
7
Hybrid Tile Decomposition
WRF does NOT use loop-level OpenMP parallelization MPI Tiles OpenMP Tiles
8
MPI vs. Hybrid Parallelization What do we expect?
Less Halo Layer Communication (White Arrows) Less MPI Overhead and Therefore Faster! No loop level omp par MPI Only Hybrid
9
MPI vs. Hybrid Strong Scaling
Hybrid parallelization of WRF is consistently better than strict MPI and is significantly better in the MPI bound regions.
10
Core Binding Using processor domain binding allows all OpenMP threads within an MPI task to share L3 cache. Socket 0 Socket 1 Rank 0 Rank 1 Socket 0 Rank 0 Remake bigger text Rank 1 Socket 1
11
pNetcdf Serial pNetcdf Separate Output Files Using separate output files (io_form_history=102) makes output write times negligible – very worthwhile!
12
Host Node Optimization Summary
Hybrid Parallelization of WRF consistently gives better performance results than strict MPI Much faster than strict MPI in MPI bound region Using separate output files requires post-processing but kills any time spent in writing history Process/Thread binding is critical for hybrid WRF Take into consideration memory limitations for hybrid WRF and set environment variable OMP_STACKSIZE to avoid memory issues. Change pnetcdf
13
Xeon Phi IO Serial pNetcdf Separate Output Files
14
How Does Varying The Number of MPI Tasks Per Xeon Phi Effect The Performance?
Total Number of Cores = 240 Less MPI Overhead More MPI Overhead Why are we seeing better performance with more MPI overhead? Specify OMP MPI core utilixation These results suggest that there are issues with WRF’s OpenMP strategies
15
Hybrid Tile Decomposition
WRF does NOT use loop-level OpenMP parallelization MPI Tiles OpenMP Tiles
16
Open MP Imbalancing Two types of Imbalancing: Default Tiling Issue:
1. Number of OpenMP Tiles > Number of OpenMP Threads 2. OpenMP Tiles are Different Sizes Default Tiling Issue: When number of threads is equal to any multiple of the number of MPI tile rows
17
OpenMP Imbalance Type 1:
Example – 10 x 10 grid run with 2 MPI tasks and 8 OpenMP threads per MPI task MPI Rank 0 MPI Rank 1 Threads #1 and #2 compute 2 OpenMP tiles each (twice as much work as other threads + context switch overhead)
18
OpenMP Imbalance Type 2:
Example – 4 MPI Tasks, 4 OpenMP Threads Per Task MPI Rank 0 MPI Rank 1 MPI Rank 3 MPI Rank 2 Thread #4 computes twice as many grid cells as all other threads!
19
OpenMP Good Balancing Example: 8 by 8 grid run with 2 MPI Tasks and 8 OpenMP threads per task MPI Rank 0 MPI Rank 1 I was able to resolve this issue and will be fixed in the next version of WRF The logical tiling would be a 2 by 4 OpenMP tiling but WRF does a 3 by 4 creating an unnecessary imbalance
20
WRF OpenMP Tile Strategy
21
What is an Optimal WRF Case For Xeon Phi?
Xeon Phi – Initial Scaling Xeon Phi Approaches Xeon Performance for Large Workload/Core a
22
Scaling Gridsize Xeon Phi Hits Memory Limits
Consistent Balancing for Symmetric Runs Xeon Hits Performance Limit Far Before Xeon Phi Xeon Phi exceeds Xeon Performance for > 30,000 Horizontal Gridpoints/CPU-Coprocessor
23
Xeon Phi Optimization Summary: The Good
Xeon Phi can be more efficient than host CPU’s in extreme high efficiency/slow time to solution region For highly efficient workloads, due to low MPI overhead and constant efficiency it is possible to have well balanced symmetric CPU-Coprocessor WRF runs that are significantly more efficient than running on either homogeneous architecture
24
Xeon Phi Optimization Summary: The Bad
WRF strong scaling performance is significantly less than host node CPU runs. WRF tiling strategies are not well optimized for manycore architectures WRF/Fortran’s array allocation strategies result in much larger memory requirements and limit workloads that otherwise would have high efficiency on Xeon Phi
25
Xeon Phi Optimization Summary: The Ugly
Although Xeon Phi could be used for highly efficient WRF simulations, finding the correct: problem sizes task-thread ratios tile decompositions and workload per core while taking into consideration memory limitations makes Xeon Phi extremely impractical for the vast majority of WRF users.
26
Why is it Important to Continue Researching WRF on Xeon Phi?
Core counts will continue to increase for future HPC architectures – This will require better hybrid strategies for our models. Xeon Phi is very representative of these architectures and is a tool for exposing various issues that otherwise may not be noticed until further down the road.
27
Future Work Create better MPI+OpenMP tile strategies for low workload per core simulations Better understand performance issues with heap allocation and overall memory access patterns Assess any other concerns that may hinder WRF performance on future multicore/manycore architectures
28
Acknowledgements Davide Del Vento – Mentor Dave Gill – WRF Help
Srinath Vadlamani – Xeon Phi/Profiling Mimi Hughes – WRF XSEDE/Stampede – Project Supercomputer Usage SIParCS/Rich Loft – Internship Opportunity Thank you all for giving me so much help throughout this process. I am extremely thankful for my experience here at NCAR!
29
Questions?
30
Memory Issues For Small Core Counts
Memory use by WRF variables is multiplied by the number of threads per MPI task. Not much of an issue for small number of threads per task Ensure OMP_STACKSIZE is set correctly (default may be set much lower than necessary)
31
Memory Issues For Small Core Counts
Memory use by WRF variables is multiplied by the number of threads per MPI task. Temporary Solution: Force heap array allocation (-heap-array) Extremely slow in compute-bound regions (“speeddown” proportional to number of threads per task) Potential Culprits: Cache Coherency? Repeated temporary array allocation?
32
Forcing Heap Array Allocation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.