Towards a Computationally Bound Numerical Weather Prediction Model Daniel B. Weber, Ph.D. Software Group - 76 SMXG Tinker Air Force Base October 6, 2008.

Slides:



Advertisements
Similar presentations
What’s quasi-equilibrium all about?
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Lecture 6: Multicore Systems
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
One-day Meeting, INI, September 26th, 2008 Role of spectral turbulence simulations in developing HPC systems YOKOKAWA, Mitsuo Next-Generation Supercomputer.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,
Introduction CS 524 – High-Performance Computing.
1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.
EECE476: Computer Architecture Lecture 11: Understanding and Assessing Performance Chapter 4.1, 4.2 The University of British ColumbiaEECE 476© 2005 Guy.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
Performance Analysis, Profiling and Optimization of Weather Research and Forecasting (WRF) model Negin Sobhani 1,2, Davide Del Vento2,David Gill2, Sam.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
What are our Options for Optimizing Numerical Weather Prediction Codes on Modern Supercomputers? Dr. Daniel B. Weber Center for Analysis and Prediction.
Texture Memory -in CUDA Perspective TEXTURE MEMORY IN - IN CUDA PERSPECTIVE VINAY MANCHIRAJU.
Sept COMP60611 Fundamentals of Concurrency Lab Exercise 2 Notes Notes on the finite difference performance model example – for the lab… Graham Riley,
 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.
Numerical Weather Prediction Model Optimization Update Daniel B. Weber and Henry J. Neeman Center for Analysis and Prediction of Storms University of Oklahoma.
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.
Mark Rast Laboratory for Atmospheric and Space Physics Department of Astrophysical and Planetary Sciences University of Colorado, Boulder Kiepenheuer-Institut.
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.
Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
A Prototype Finite Difference Model Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center Presented by Timothy H. Kaiser, Ph.D. San Diego.
Benchmarks of a Weather Forecasting Research Model Daniel B. Weber, Ph.D. Research Scientist CAPS/University of Oklahoma ****CONFIDENTIAL**** August 3,
Model Task 5: Implementing the 2D model ATM 562 Fall 2015 Fovell (see updated course notes, Chapter 13) 1.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Full and Para Virtualization
Optimizing Stencil Computations March 18, Administrative Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings.
Performance Performance
Outline Why this subject? What is High Performance Computing?
Global Winds. Air Movement Wind is the movement of air caused by differences in air pressure Wind ALWAYS moves from areas of high air pressure to areas.
Single Node Optimization Computational Astrophysics.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Sunpyo Hong, Hyesoon Kim
Weather Instruments.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
AIAA th AIAA/ISSMO Symposium on MAO, 09/05/2002, Atlanta, GA 0 AIAA OBSERVATIONS ON CFD SIMULATION UNCERTAINTIES Serhat Hosder, Bernard.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Entropy generation transient analysis of a grassfire event through numerical simulation E. Guelpa V. VERDA (IEEES-9), May 14-17, 2017, Split, Croatia.
CS427 Multicore Architecture and Parallel Computing
ECE 445 – Computer Organization
CSCI1600: Embedded and Real Time Software
17-Nov-18 Parallel 2D and 3D Acoustic Modeling Application for hybrid computing platform of PARAM Yuva II Abhishek Srivastava, Ashutosh Londhe*, Richa.
Memory Hierarchies.
I n t e g r i t y - S e r v i c e - E x c e l l e n c e
Understanding Performance Counter Data - 1
Performance Optimization for Embedded Software
CMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture
Memory System Performance Chapter 3
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Towards a Computationally Bound Numerical Weather Prediction Model Daniel B. Weber, Ph.D. Software Group - 76 SMXG Tinker Air Force Base October 6, 2008

Definitions n Computationally bound: –A significant portion of processing time is spent doing floating point operations (FLOPS) n Memory Bound: –A significant amount of processing time is spent waiting for data from memory

Why should you care about Weather Forecasting and Computational Efficiency?

Because weather forecast are time critical!

Benchmarks

The Problem n Poor efficiency of Numerical Weather Prediction (NWP) models on modern supercomputers degrades the quality of the forecast to the public

The Future n Multicore technology: –Many cores (individual cpu’s) access main memory via one common pipeline n Reduce the bandwidth to each core n Will produce memory bound code whose performance enhancements will be tied to the memory speed, not processing speed (yikes!!!!!)

Forecast Quality n Forecast quality is a function of grid spacing/feature resolution (more grid points are better) n Forecasts using 2 times more grid points in each direction requires 16 times more processing power!!!

The Goal n Use the maximum number of grid points n Obtain a computationally bound model n Result: produce better forecasts faster!

Tools n Code analysis: –Count arrays – assess memory requirements –Calculations –Data reuse etc –Solution techniques (spatial and time differencing methods n Use PAPI (Performance Application Programming Interface)to track FLOPS/cache misses etc n Use PAPI (Performance Application Programming Interface) to track FLOPS/cache misses etc n Define metrics for evaluating solution techniques and predict results

Metrics n Single precision flop to memory bandwidth ratio –peak flop rating/peak main memory bandwidth n Actual bandwidth needed to achieve peak flop rate (simple multiply: a = b*c) –4bytes/variable*3variables/flop*flops/clock*clock /sec n Flops needed to cover the time required to load data from memory –#of 3-D arrays *4bytes/array * required peak flop bandwidth

Research Weather Model n 61 3-D arrays (including 11 temporary arrays (ARPS/WRF has ~150 3-D arrays) n 1200 flops per/cell/iteration (1 big/small step) n 3-time levels required for time dependant variables n Split-time steps –Big time step (temperature, advection, mixing) –Small time step (winds, pressure) Result: ~5% of peak performance…

Solution Approach n Compute computational and turbulent mixing terms for all variables except pressure n Compute advection forcing for all variables n Compute pressure gradient and update variables

Weather Model Equations (PDE’s) n U,V,W represent winds n Theta represents temperature temperature n Pi represents pressure n T – Time n X – east west direction n Y – north south direction n Z – vertical direction n Turb – turbulence terms (what can’t be measured/predicted) n S – Source terms, condensation, evaporation, heating, cooling n D – numerical smoothing n f – Coriolis force (earth’s rotation)

Code Analysis Results n Memory usage: –3 time levels for each predicted variable –11 temporary arrays (1/5 of the memory) n Solution process breaks calculations up into several sections –Compute one term thru the entire grid and then compute the next term n Tiling can help improve the cache reuse but did not make a big difference

Previous Results n Cache misses were significant n Need to reduce cache misses via: –Reduction in overall memory requirements –Increase operations per memory reference –Simplify the code (if possible)

Think outside the box n Recipe: –Not getting acceptable results? (~5% peak) –Develop useful metrics –Check the compiler options –Other numerical solution methods –Using simple loops to achieve peak performance on an instrumented platform –Then apply the results to the full scale model

Revised Code n New time scheme to reduce memory footprint (RK3, no time splitting!) –Reduces memory requirements by 1 3-D array per time dependant variable (reduces footprint by 8 arrays) –More accurate (3 rd order vs 1 st order) n Combine ALL computations into one loop (or directional loops) –Removes need for 11 temporary arrays

Weather Model Equations (PDE’s) n U,V,W represent winds n Theta represents temperature temperature n Pi represents pressure n T – Time n X – east west direction n Y – north south direction n Z – vertical direction n Turb – turbulence terms (what can’t be measured/predicted) n S – Source terms, condensation, evaporation, heating, cooling n D – numerical smoothing n f – Coriolis force (earth’s rotation)

Revised Solution Technique n Reuses data n Reduces intermediate results and loads to/from memory n Sample loops:

2 nd Order U-Velocity Update call PAPIF_flops(real_time, cpu_time, fp_ins, mflops, ierr) DO k=2,nz-2 ! scalar limits u(2) is the q's/forcing. DO k=2,nz-2 ! scalar limits u(2) is the q's/forcing. DO j=2,ny-1 ! scalar limits u(1) is the DO j=2,ny-1 ! scalar limits u(1) is the ! updated/previous u ! updated/previous u DO i=2,nx-1 ! vector limits DO i=2,nx-1 ! vector limits u(i,j,k,2)=-u(i,j,k,2)*rk_constant u(i,j,k,2)=-u(i,j,k,2)*rk_constant c e-w adv : -tema*((u(i+1,j,k,1)+u(i,j,k,1))* : -tema*((u(i+1,j,k,1)+u(i,j,k,1))* : (u(i+1,j,k,1)-u(i,j,k,1)) : (u(i+1,j,k,1)-u(i,j,k,1)) : + (u(i,j,k,1)+u(i-1,j,k,1))* (u(i,j,k,1)-u(i-1,j,k,1))) : + (u(i,j,k,1)+u(i-1,j,k,1))* (u(i,j,k,1)-u(i-1,j,k,1))) c n-s adv : -temb*((v(i,j+1,k,1)+v(i-1,j+1,k,1))* : -temb*((v(i,j+1,k,1)+v(i-1,j+1,k,1))* : (u(i,j+1,k,1)-u(i,j,k,1)) : (u(i,j+1,k,1)-u(i,j,k,1)) : + (v(i,j,k,1)+v(i-1,j,k,1))* (u(i,j,k,1)-u(i,j-1,k,1))) : + (v(i,j,k,1)+v(i-1,j,k,1))* (u(i,j,k,1)-u(i,j-1,k,1))) c vert adv : -temc*((w(i,j,k+1,1)+w(i-1,j,k+1,1))* : -temc*((w(i,j,k+1,1)+w(i-1,j,k+1,1))* : (u(i,j,k+1,1)-u(i,j,k,1)) : (u(i,j,k+1,1)-u(i,j,k,1)) : + (w(i,j,k,1)+w(i-1,j,k,1))*(u(i,j,k,1)-u(i,j,k-1,1))) : + (w(i,j,k,1)+w(i-1,j,k,1))*(u(i,j,k,1)-u(i,j,k-1,1))) c pressure gradient : -temd*(ptrho(i,j,k)+ptrho(i-1,j,k))* : -temd*(ptrho(i,j,k)+ptrho(i-1,j,k))* : (pprt(i,j,k,1)-pprt(i-1,j,k,1)) : (pprt(i,j,k,1)-pprt(i-1,j,k,1)) c compute the second order cmix x terms. : + temg*(((u(i+1,j,k,1)-ubar(i+1,j,k))- : + temg*(((u(i+1,j,k,1)-ubar(i+1,j,k))- : (u(i,j,k,1)-ubar(i,j,k)))- : (u(i,j,k,1)-ubar(i,j,k)))- : ((u(i,j,k,1)-ubar(i,j,k))- (u(i-1,j,k,1)-ubar(i-1,j,k)))) : ((u(i,j,k,1)-ubar(i,j,k))- (u(i-1,j,k,1)-ubar(i-1,j,k)))) ontinuedL c compute the second order cmix y terms. c compute the second order cmix y terms. : + temh*(((u(i,j+1,k,1)-ubar(i,j+1,k))- : + temh*(((u(i,j+1,k,1)-ubar(i,j+1,k))- : (u(i,j,k,1)-ubar(i,j,k)))- : (u(i,j,k,1)-ubar(i,j,k)))- : ((u(i,j,k,1)-ubar(i,j,k))- : ((u(i,j,k,1)-ubar(i,j,k))- : (u(i,j-1,k,1)-ubar(i,j-1,k)))) : (u(i,j-1,k,1)-ubar(i,j-1,k)))) c compute the second order cmix z terms. : + temi*(((u(i,j,k+1,1)-ubar(i,j,k+1))- : + temi*(((u(i,j,k+1,1)-ubar(i,j,k+1))- : (u(i,j,k,1)-ubar(i,j,k)))- : (u(i,j,k,1)-ubar(i,j,k)))- : ((u(i,j,k,1)-ubar(i,j,k))- : ((u(i,j,k,1)-ubar(i,j,k))- : (u(i,j,k-1,1)-ubar(i,j,k-1)))) : (u(i,j,k-1,1)-ubar(i,j,k-1)))) END DO ! 60 calculations... END DO ! 60 calculations... END DO END DO call PAPIF_flops(real_time, cpu_time, fp_ins, mflops, ierr) call PAPIF_flops(real_time, cpu_time, fp_ins, mflops, ierr) print *,'2nd order u' print *,'2nd order u' write (*,101) nx, ny,nz, write (*,101) nx, ny,nz, + real_time, cpu_time, fp_ins, mflops + real_time, cpu_time, fp_ins, mflops 60 flops/7 arrays

4 th order U-Velocity uadv/mix call PAPIF_flops(real_time, cpu_time, fp_ins, mflops, ierr) call PAPIF_flops(real_time, cpu_time, fp_ins, mflops, ierr) DO k=2,nz-2 ! scalar limits u(2) is the q's/forcing. DO k=2,nz-2 ! scalar limits u(2) is the q's/forcing. DO j=2,ny-2 ! scalar limits u(1) is the updated/previous u DO j=2,ny-2 ! scalar limits u(1) is the updated/previous u DO i=3,nx-2 DO i=3,nx-2 u(i,j,k,2)=-u(i,j,k,2)*rk_constant1(n) u(i,j,k,2)=-u(i,j,k,2)*rk_constant1(n) c e-w adv : -tema*((u(i,j,k,1)+u(i+2,j,k,1))*(u(i+2,j,k,1)-u(i,j,k,1)) : -tema*((u(i,j,k,1)+u(i+2,j,k,1))*(u(i+2,j,k,1)-u(i,j,k,1)) : +(u(i,j,k,1)+u(i-2,j,k,1))*(u(i,j,k,1)-u(i-2,j,k,1))) : +(u(i,j,k,1)+u(i-2,j,k,1))*(u(i,j,k,1)-u(i-2,j,k,1))) : +temb*((u(i+1,j,k,1)+u(i,j,k,1))*(u(i+1,j,k,1)-u(i,j,k,1)) : +temb*((u(i+1,j,k,1)+u(i,j,k,1))*(u(i+1,j,k,1)-u(i,j,k,1)) : + (u(i,j,k,1)+u(i-1,j,k,1))*(u(i,j,k,1)-u(i-1,j,k,1))) : + (u(i,j,k,1)+u(i-1,j,k,1))*(u(i,j,k,1)-u(i-1,j,k,1))) : -tema*(((((u(i+2,j,k,1)-ubar(i+2,j,k))-(u(i+1,j,k,1)-ubar(i+1,j,k)))- : ((u(i+1,j,k,1)-ubar(i+1,j,k))-(u(i,j,k,1)-ubar(i,j,k))))- : ((u(i+1,j,k,1)-ubar(i+1,j,k))-(u(i,j,k,1)-ubar(i,j,k))))- : (((u(i+1,j,k,1)-ubar(i+1,j,k))-(u(i,j,k,1)-ubar(i,j,k)))- : (((u(i+1,j,k,1)-ubar(i+1,j,k))-(u(i,j,k,1)-ubar(i,j,k)))- : ((u(i,j,k,1)-ubar(i,j,k))-(u(i-1,j,k,1)-ubar(i-1,j,k)))))- : ((u(i,j,k,1)-ubar(i,j,k))-(u(i-1,j,k,1)-ubar(i-1,j,k)))))- : ((((u(i+1,j,k,1)-ubar(i+1,j,k))-(u(i,j,k,1)-ubar(i,j,k)))- : ((((u(i+1,j,k,1)-ubar(i+1,j,k))-(u(i,j,k,1)-ubar(i,j,k)))- : ((u(i,j,k,1)-ubar(i,j,k))-(u(i-1,j,k,1)-ubar(i-1,j,k))))- : ((u(i,j,k,1)-ubar(i,j,k))-(u(i-1,j,k,1)-ubar(i-1,j,k))))- : (((u(i,j,k,1)-ubar(i,j,k))- (u(i-1,j,k,1)-ubar(i-1,j,k)))- : (((u(i,j,k,1)-ubar(i,j,k))- (u(i-1,j,k,1)-ubar(i-1,j,k)))- : ((u(i-1,j,k,1)-ubar(i-1,j,k))-(u(i-2,j,k,1)-ubar(i-2,j,k)))))) : ((u(i-1,j,k,1)-ubar(i-1,j,k))-(u(i-2,j,k,1)-ubar(i-2,j,k)))))) END DO’s END DO’s Print PAPI results… 52 flops/3 arrays

4 th order W wadv/mix Computation call PAPIF_flops(real_time, cpu_time, fp_ins, mflops, ierr) call PAPIF_flops(real_time, cpu_time, fp_ins, mflops, ierr) DO k=3,nz-2 ! limits 3,nz-2 DO k=3,nz-2 ! limits 3,nz-2 DO j=1,ny-1 DO j=1,ny-1 DO i=1,nx-1 DO i=1,nx-1 w(i,j,k,2)=w(i,j,k,2) w(i,j,k,2)=w(i,j,k,2) c vert adv fourth order : +tema*((w(i,j,k,1)+w(i,j,k+2,1))*(w(i,j,k+2,1)-w(i,j,k,1)) : +tema*((w(i,j,k,1)+w(i,j,k+2,1))*(w(i,j,k+2,1)-w(i,j,k,1)) : +(w(i,j,k-2,1)+w(i,j,k,1))*(w(i,j,k,1)-w(i,j,k-2,1))) : +(w(i,j,k-2,1)+w(i,j,k,1))*(w(i,j,k,1)-w(i,j,k-2,1))) : -temb*((w(i,j,k-1,1)+w(i,j,k,1))*(w(i,j,k,1)-w(i,j,k-1,1)) : -temb*((w(i,j,k-1,1)+w(i,j,k,1))*(w(i,j,k,1)-w(i,j,k-1,1)) : +(w(i,j,k+1,1)+w(i,j,k,1))*(w(i,j,k+1,1)-w(i,j,k,1))) : +(w(i,j,k+1,1)+w(i,j,k,1))*(w(i,j,k+1,1)-w(i,j,k,1))) c compute the fourth order cmix z terms. : -tema*((((w(i,j,k+2,1)-w(i,j,k+1,1))- : -tema*((((w(i,j,k+2,1)-w(i,j,k+1,1))- : (w(i,j,k+1,1)-w(i,j,k,1)))- : (w(i,j,k+1,1)-w(i,j,k,1)))- : ((w(i,j,k+1,1)-w(i,j,k,1))- (w(i,j,k,1)-w(i,j,k-1,1))))- : ((w(i,j,k+1,1)-w(i,j,k,1))- (w(i,j,k,1)-w(i,j,k-1,1))))- : (((w(i,j,k+1,1)-w(i,j,k,1))-(w(i,j,k,1)-w(i,j,k-1,1)))- : (((w(i,j,k+1,1)-w(i,j,k,1))-(w(i,j,k,1)-w(i,j,k-1,1)))- : ((w(i,j,k,1)-w(i,j,k-1,1))-(w(i,j,k-1,1)-w(i,j,k-2,1))))) : ((w(i,j,k,1)-w(i,j,k-1,1))-(w(i,j,k-1,1)-w(i,j,k-2,1))))) END DO ! 35 calculations... END DO ! 35 calculations... END DO END DO call PAPIF_flops(real_time, cpu_time, fp_ins, mflops, ierr) call PAPIF_flops(real_time, cpu_time, fp_ins, mflops, ierr) print *,'wadvz' print *,'wadvz' write (*,101) nx, ny,nz, write (*,101) nx, ny,nz, + real_time, cpu_time, fp_ins, mflops + real_time, cpu_time, fp_ins, mflops 35 flops/2 arrays

Final U Loop call PAPIF_flops(real_time, cpu_time, fp_ins, mflops, ierr) call PAPIF_flops(real_time, cpu_time, fp_ins, mflops, ierr) DO k=2,nz-2 ! complete the u computations DO k=2,nz-2 ! complete the u computations DO j=2,ny-2 DO j=2,ny-2 DO i=2,nx-1 DO i=2,nx-1 u(i,j,k,1) = u(i,j,k,1) + u(i,j,k,2)*rk_constant2(n) u(i,j,k,1) = u(i,j,k,1) + u(i,j,k,2)*rk_constant2(n) END DO END DO call PAPIF_flops(real_time, cpu_time, fp_ins, mflops, ierr) call PAPIF_flops(real_time, cpu_time, fp_ins, mflops, ierr) print *,'ufinal' print *,'ufinal' write (*,101) nx,ny,nz, write (*,101) nx,ny,nz, + real_time, cpu_time, fp_ins, mflops + real_time, cpu_time, fp_ins, mflops 2 flops/2 arrays

Individual Loop Tests n Hardwired array bounds (due to PGI compiler 3.2 version not optimizing when using dynamic array allocation) n Prefetching must be specified n Varied array sizes/memory footprint n Use 3 loops from 2 nd and 4 th order (spatial) solution techniques n Compare flops/timings/metrics

Memory size = 5 arrays *4* nx*ny*nz Chicago = P3 Laptop

Model Tests n Current scheme (Klemp-Wilhelmson method) 2 nd and 4 th order spatial differencing n RK3 scheme: all computations (except temperature) are computed on the small time step (6x more work is performed in this case as in the current scheme) n Show results from various platforms as a function of mflops and percent of peak

Test Setup n 5 sec dtbig, 0.5 sec dtsmall n 1000x1000x250m grid spacing n 600 second warm bubble simulation n No turbulence (ok for building scale flow predictions!) n Dry dynamics only

Flop Count/per Iteration n 4 th Order: –Current code: n 1200 flops (all terms) n ~600 flops for these tests –Revised code: n ~535 flops (w/o terrain, moisture) n 2 nd Order: –260 flops (w/o terrain, moisture)

Summary n Notable improvement in % of peak from reduced memory footprint n Longer vector lengths are better n BUT: RK3 (revised) method still requires more wall clock time (>50%) for a single core, tests are underway to see if this is the case when using multiple cores n Apply this method to the adv/mixing part of the existing code to improve performance (e.g. loop result) n Recommendation: Apply higher order numerics to achieve higher % of peak (almost free)

Multi-Core Tests n Compared current and revised (reduced memory requirement and revised order of computations) weather model n MPI versions n Timings for 1,2,4,8 cores per node on Sooner (OU Xeon-based Supercomputer) n Sooner has two chips/node with 4 cores/chip n Zero-slope line is perfect scaling

Multi-Core Benchmarks

Multi-Core Results Discussion n Contention for the memory bus extends application run time n 2 cores/node is approximately 90% efficient (2-10% overhead due to 2 cores accessing memory) n 4 cores/node produces 25-75% overhead n 8 cores/node produces % overhead (> 2-4 x slower than 1 processor test) – but doing 8x more work

Multi-Core Summary n Multi-core performance scales very well at 2 cores/node but scalability is drastically reduced when using 8 cores/node n Contention for memory becomes significant for memory intensive codes at 8 cores/node (OU Sooner HPC system)

n Credits: –Dr. Henry Neeman (OSCER) –Scott Hill (OU-CAPS PAPI) –PSC (David ONeal) –NCSA –Tinker AFB

Thank You!