The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

Slides:

Advertisements

Similar presentations

Parallelism Lecture notes from MKP and S. Yalamanchili.

Advertisements

Distributed Systems CS

Lecture 6: Multicore Systems

Thoughts on Shared Caches Jeff Odom University of Maryland.

XEON PHI. TOPICS What are multicore processors? Intel MIC architecture Xeon Phi Programming for Xeon Phi Performance Applications.

Introductory Courses in High Performance Computing at Illinois David Padua.

Copyright HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill,

OpenFOAM on a GPU-based Heterogeneous Cluster

Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

A Parallel Structured Ecological Model for High End Shared Memory Computers Dali Wang Department of Computer Science, University of Tennessee, Knoxville.

NETL 2014 Workshop on Multiphase Flow Science August 5-6, 2014, Morgantown, WV Accelerating MFIX-DEM code on the Intel Xeon Phi Dr. Handan Liu Dr. Danesh.

Performance Engineering and Debugging HPC Applications David Skinner

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Computing Labs CL5 / CL6 Multi-/Many-Core Programming with Intel Xeon Phi Coprocessors Rogério Iope São Paulo State University (UNESP)

Performance Analysis, Profiling and Optimization of Weather Research and Forecasting (WRF) model Negin Sobhani 1,2, Davide Del Vento2,David Gill2, Sam.

Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.

A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with a similar.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.

GU Junli SUN Yihe 1.  Introduction & Related work  Parallel encoder implementation  Test results and Analysis  Conclusions 2.

History of Microprocessor MPIntroductionData BusAddress Bus

Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,

 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.

Hybrid MPI and OpenMP Parallel Programming

Common software needs and opportunities for HPCs Tom LeCompte High Energy Physics Division Argonne National Laboratory (A man wearing a bow tie giving.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

Innovation for Our Energy Future Opportunities for WRF Model Acceleration John Michalakes Computational Sciences Center NREL Andrew Porter Computational.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.

An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

© 2008 Pittsburgh Supercomputing Center Proposed ideas for consideration under AUS.

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

CS427 Multicore Architecture and Parallel Computing

Scott Michael Indiana University July 6, 2017

Morgan Kaufmann Publishers

IEEE BigData 2016 December 5-8, Washington D.C.

CMAQ PARALLEL PERFORMANCE WITH MPI AND OpenMP George Delic, Ph

Restructuring the multi-resolution approximation for spatial data to reduce the memory footprint and to facilitate scalability Vinay Ramakrishnaiah Mentors:

Integrated Runtime of Charm++ and OpenMP

Hybrid Programming with OpenMP and MPI

Department of Computer Science, University of Tennessee, Knoxville

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Presentation transcript:

Samm Elliott Mentor – Davide Del Vento WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting needs. Used by over 30,000 Scientists around the world. Any Optimizations that can be made will make a significant impact on the WRF community as a whole.

Stampede Supercomputer 6400 Nodes (Dell PowerEdge C8220) Two CPU’s Per Node (Intel Xeon E5-2680 Processors Per Node) 32 GB Memory Per Node 1-2 Coprocessors Per Node (Intel Xeon Phi SE10P)

Intel Xeon vs Xeon Phi Architecture Xeon E5 2680 CPU 8 Cores 2-Way Hyperthreading 256 Bit Vector Registers 2.7 GHz 32 KB L1 256 KB L2 20 MB Shared L3 32 GB Main Memory Xeon Phi SE10P Coprocessor 61 Cores 4-Way Hyperthreading (2 Hardware Threads Per Core) 512 Bit Vector Registers 1.1 GHz 512 KB L2 No L3 Cache 8 GB Main Memory

Intro: WRF Gets Good Performance on Xeon Phi! Memory Limitations

Standard MPI Implementation Strong Scaling Low Efficiency Fast Time to Solution High Efficiency Slow Time to Solution Compute Bound MPI Bound

MPI Tile Decomposition

Hybrid Tile Decomposition WRF does NOT use loop-level OpenMP parallelization MPI Tiles OpenMP Tiles

MPI vs. Hybrid Parallelization What do we expect? Less Halo Layer Communication (White Arrows) Less MPI Overhead and Therefore Faster! No loop level omp par MPI Only Hybrid

MPI vs. Hybrid Strong Scaling Hybrid parallelization of WRF is consistently better than strict MPI and is significantly better in the MPI bound regions.

Core Binding Using processor domain binding allows all OpenMP threads within an MPI task to share L3 cache. Socket 0 Socket 1 Rank 0 Rank 1 Socket 0 Rank 0 Remake bigger text Rank 1 Socket 1

pNetcdf Serial pNetcdf Separate Output Files Using separate output files (io_form_history=102) makes output write times negligible – very worthwhile!

Host Node Optimization Summary Hybrid Parallelization of WRF consistently gives better performance results than strict MPI Much faster than strict MPI in MPI bound region Using separate output files requires post-processing but kills any time spent in writing history Process/Thread binding is critical for hybrid WRF Take into consideration memory limitations for hybrid WRF and set environment variable OMP_STACKSIZE to avoid memory issues. Change pnetcdf

Xeon Phi IO Serial pNetcdf Separate Output Files

How Does Varying The Number of MPI Tasks Per Xeon Phi Effect The Performance? Total Number of Cores = 240 Less MPI Overhead More MPI Overhead Why are we seeing better performance with more MPI overhead? Specify OMP MPI core utilixation These results suggest that there are issues with WRF’s OpenMP strategies

Hybrid Tile Decomposition WRF does NOT use loop-level OpenMP parallelization MPI Tiles OpenMP Tiles

Open MP Imbalancing Two types of Imbalancing: Default Tiling Issue: 1. Number of OpenMP Tiles > Number of OpenMP Threads 2. OpenMP Tiles are Different Sizes Default Tiling Issue: When number of threads is equal to any multiple of the number of MPI tile rows

OpenMP Imbalance Type 1: Example – 10 x 10 grid run with 2 MPI tasks and 8 OpenMP threads per MPI task MPI Rank 0 MPI Rank 1 Threads #1 and #2 compute 2 OpenMP tiles each (twice as much work as other threads + context switch overhead)

OpenMP Imbalance Type 2: Example – 4 MPI Tasks, 4 OpenMP Threads Per Task MPI Rank 0 MPI Rank 1 MPI Rank 3 MPI Rank 2 Thread #4 computes twice as many grid cells as all other threads!

OpenMP Good Balancing Example: 8 by 8 grid run with 2 MPI Tasks and 8 OpenMP threads per task MPI Rank 0 MPI Rank 1 I was able to resolve this issue and will be fixed in the next version of WRF The logical tiling would be a 2 by 4 OpenMP tiling but WRF does a 3 by 4 creating an unnecessary imbalance

WRF OpenMP Tile Strategy

What is an Optimal WRF Case For Xeon Phi? Xeon Phi – Initial Scaling Xeon Phi Approaches Xeon Performance for Large Workload/Core a

Scaling Gridsize Xeon Phi Hits Memory Limits Consistent Balancing for Symmetric Runs Xeon Hits Performance Limit Far Before Xeon Phi Xeon Phi exceeds Xeon Performance for > 30,000 Horizontal Gridpoints/CPU-Coprocessor

Xeon Phi Optimization Summary: The Good Xeon Phi can be more efficient than host CPU’s in extreme high efficiency/slow time to solution region For highly efficient workloads, due to low MPI overhead and constant efficiency it is possible to have well balanced symmetric CPU-Coprocessor WRF runs that are significantly more efficient than running on either homogeneous architecture

Xeon Phi Optimization Summary: The Bad WRF strong scaling performance is significantly less than host node CPU runs. WRF tiling strategies are not well optimized for manycore architectures WRF/Fortran’s array allocation strategies result in much larger memory requirements and limit workloads that otherwise would have high efficiency on Xeon Phi

Xeon Phi Optimization Summary: The Ugly Although Xeon Phi could be used for highly efficient WRF simulations, finding the correct: problem sizes task-thread ratios tile decompositions and workload per core while taking into consideration memory limitations makes Xeon Phi extremely impractical for the vast majority of WRF users.

Why is it Important to Continue Researching WRF on Xeon Phi? Core counts will continue to increase for future HPC architectures – This will require better hybrid strategies for our models. Xeon Phi is very representative of these architectures and is a tool for exposing various issues that otherwise may not be noticed until further down the road.

Future Work Create better MPI+OpenMP tile strategies for low workload per core simulations Better understand performance issues with heap allocation and overall memory access patterns Assess any other concerns that may hinder WRF performance on future multicore/manycore architectures

Acknowledgements Davide Del Vento – Mentor Dave Gill – WRF Help Srinath Vadlamani – Xeon Phi/Profiling Mimi Hughes – WRF XSEDE/Stampede – Project Supercomputer Usage SIParCS/Rich Loft – Internship Opportunity Thank you all for giving me so much help throughout this process. I am extremely thankful for my experience here at NCAR!

Questions?

Memory Issues For Small Core Counts Memory use by WRF variables is multiplied by the number of threads per MPI task. Not much of an issue for small number of threads per task Ensure OMP_STACKSIZE is set correctly (default may be set much lower than necessary)

Memory Issues For Small Core Counts Memory use by WRF variables is multiplied by the number of threads per MPI task. Temporary Solution: Force heap array allocation (-heap-array) Extremely slow in compute-bound regions (“speeddown” proportional to number of threads per task) Potential Culprits: Cache Coherency? Repeated temporary array allocation?

Forcing Heap Array Allocation