MM5 Optimization Experiences and Numerical Sensitivities Found in Convective/Non-Convective Cloud Interactions Carlie J. Coats, Jr., MCNC

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.
Advertisements

Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Computer Organization and Architecture
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
CSCI 4717/5717 Computer Architecture
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
© 2009 IBM Corporation July, 2009 | PADTAD Chicago, Illinois A Proposal of Operation History Management System for Source-to-Source Optimization.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Compiler Challenges for High Performance Architectures
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Chapter 12 Pipelining Strategies Performance Hazards.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Advanced Computer Architectures
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Architecture styles Pipes and filters Object-oriented design Implicit invocation Layering Repositories.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Simplify a rational expression
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.
Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Reduced Instruction Set Computers. Major Advances in Computers(1) The family concept —IBM System/ —DEC PDP-8 —Separates architecture from implementation.
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
5/13/99 Ashish Sabharwal1 Pipelining and Hazards n Hazards occur because –Don’t have enough resources (ALU’s, memory,…) Structural Hazard –Need a value.
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
EKT303/4 Superscalar vs Super-pipelined.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Memory-Aware Compilation Philip Sweany 10/20/2011.
PipeliningPipelining Computer Architecture (Fall 2006)
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Topics to be covered Instruction Execution Characteristics
Advanced Architectures
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Morgan Kaufmann Publishers
ECE/CS 552: Pipelining to Superscalar
COMP4211 : Advance Computer Architecture
Memory Hierarchies.
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Instruction Level Parallelism and Superscalar Processors
STUDY AND IMPLEMENTATION
Equations Containing Decimals
Solving Equations Containing Fractions
A rational expression is a quotient of two polynomials
Lecture 5: Pipeline Wrap-up, Static ILP
COMPUTER ORGANIZATION AND ARCHITECTURE
Presentation transcript:

MM5 Optimization Experiences and Numerical Sensitivities Found in Convective/Non-Convective Cloud Interactions Carlie J. Coats, Jr., MCNC John N. McHenry, MCNC Elizabeth Hayes, SGI

Introduction MM5 Optimization for Microprocessor/Parallel Systems Started from MM5V2.[7,12]-GSPBL Speedups so far: 1.4 on SGI, 1.9 on Linux/X86, 2.36 on IBM SP Tiny numerical changes cause gross changes in the output –(but these seem to be unbiased) Causative mechanisms include convective triggering –inherent problem; this is ill-conditioned in nature Need to be careful with algorithmic formulations and optimizations –will not be fixed simply by improved compiler technology

Optimization For Microprocessor/Parallel Processor characteristics: –Pipelining and Superscalarity—need lots of independent work –Hierarchical memory organization with registers and caches Solutions: –Data structure transformations –Logic and loop re-factoring –Expensive-operation avoidance –Minimize and optimize memory traffic

Pipelining and Superscalarity Modern microprocessors try to have multiple instructions in different stages of execution on each FPU or ALU at the same time. Dependencies between instructions (where one needs to complete before another can start) stall the system. Current technology: instructions "in flight" at one time; even more (50+?) instructions in the future. Standard solutions: need lots of “independent work” to fill pipelines –Loop unrolling for vectorizable loops (some compilers can do this) –Loop jamming, so that there are long loop bodies with lots of independent work (some compilers can do some of this) –Logic refactoring, so that IFs are outside the loops, not inside (compilers can NOT do this)

Caches and Memory Traffic Memory traffic a prime predictor for performance –McCalpin's "STREAM" benchmarks Want stride 1 data access, especially for “store” sequences Want small data structures that “live in cache” or (where possible) even scalars that “live in registers.” Parallel cache-line conflicts "can cost 100X performance"--SGI Standard solutions: –Loop unrolling and loop jamming lead to value re-use (some compilers can do some of this) –Loop refactoring and data structure reorganization (some compilers can do loop refactoring but none do major data structure reorganization)

Expensive Operations Use of X**0.5 instead of SQRT(X) (this is also less accurate) use of divides and reciprocals –we can see even examples of X=A/B/C/D in the code, instead of X=A/(B*C*D) –use RPS* variables –rationalize fractions EXP(A)*EXP(B) vs. EXP(A+B) (happens in LWRAD ) repeated calculations of the same trig or log functions (happens in SOUND )

Logic Re-Factoring Simplified example adapted from MRFPBL : DO K=1,KL DO I=1,ILX QX(I,K) =QVB(I,J,K)*RPSB(I,J) QCX(I,K)=0. QIX(I,K)=0. END DO IF ( IMOIST(IN).NE.1)THEN DO K=1,KL DO I=1,ILX QCX(I,K)=QCB(I,J,K)*RPSB(I,J) IF(IICE.EQ.1)QIX(I,K)=QIB(I,J,K)*RPSB(I,J) END DO END IF

IF ( IMOIST(IN).EQ.1)THEN DO K=1,KL DO I=1,ILX QX(I,K) =QVB(I,J,K)*RPSB(I,J) QCX(I,K)=0. QIX(I,K)=0. END DO ELSE IF ( IICE.NE.1)THEN! where imoist.ne.1: DO K=1,KL DO I=1,ILX QX(I,K) =QVB(I,J,K)*RPSB(I,J) QCX(I,K)=QCB(I,J,K)*RPSB(I,J) QIX(I,K)=0. END DO ELSE ! imoist.ne.1 and iice.eq.1 DO K=1,KL DO I=1,ILX QX(I,K) =QVB(I,J,K)*RPSB(I,J) QCX(I,K)=QCB(I,J,K)*RPSB(I,J) QIX(I,K)=QIB(I,J,K)*RPSB(I,J) END DO END IF

EXMOISS Optimizations Inside the (innermost) miter loop: RGV(K) =AMAX1( RGV(K)/DSIGMA(K), RGV(K-1)/DSIGMA(K-1) )*DSIGMA(K) RGVC(K)=AMAX1(RGVC(K)/DSIGMA(K),RGVC(K-1)/DSIGMA(K-1) )*DSIGMA(K) Equivalent to DSRAT(K)=DSIGMA(K)/DSIGMA(K-1)) !! K-only pre-calculation ….. RGV(K) =AMAX1( RGV(K), RGV(K-1)*DSRAT(K) ) RGVC(K)=AMAX1(RGVC(K), RGVC(K-1)*DSRAT(K) ) Rewrite loop structure and arrays as follows: –outermost I -loop, enclosing – sequence of K -loops, then – miter loop, enclosing internal K -loop –working arrays subscripted by K only (or are scalars, when possible)

EXMOISS Optimizations, cont’d Rain-accumulation numerics: –original adds one miter-step of one layer of rain to 2-D array of cumulative rain totals; serious truncation error for long runs. –Optimized version adds up vertical-column advection-step total in a scalar; then adds that scalar to the cumulative total—better round-off, less memory traffic. New version is twice as fast, greatly reduced round-off errors Generates noticeably different MM5 model results –no evident bias in the changed results –caused/amplified by interaction with convective cloud parameterizations? –See plots to come

Other Routines Routines: SOUND, SOLVE3, EXMOISS, GSPBL, LWRAD, MRFPBL, HADV, VADV Typical speedup factors for these routines – on SGI, – (but 2.54 for GSPBL ) on IBM SP Frequently, optimized versions have reduced round-off Some optimizations will improve both vector a microprocessor performance Side effects: reduced cache footprint in EXMOISS, MRFPBL caused 5-8% speedup in SOUND, SOLVE3 on SGI Octane! (less effect on O-2000)

Food for Thought What does all this—especially the numerical sensitivities—say for future model formulations such as WRF? –Double-precision-only model? (and best-available values for physics constants!) –Ensemble forecasts? (These are very easy to achieve with the current MM5—just multiply some state variable by PSA, then by RPSA ! ) –(Most radically) stochastic models that predict cell means and variances instead of deterministic point-values? (Due to theorems in integral operator theory, these have better stability and continuity properties than today’s deterministic models but sub-gridscale processes will be a challenge to formulate!)