Performance Issues Application Programmers View John Cownie HPC Benchmark Engineer.

Performance Issues Application Programmers View John Cownie HPC Benchmark Engineer

2 Agenda Running 64bit 32bit codes under AMD64 (Suse Linux) FPU and Memory performance issues OS and memory layout (1P, 2P, 4P) Spectrum of application needs memory cpu Benchmark examples STREAM, HPL Some real applications Conclusions

3 64-bit OS & Application Interaction 32-bit Compatibility Mode 64-bit OS runs existing 32-bit APPs with leading edge performance No recompile required, 32-bit code directly executed by CPU 64-bit OS provides 32-bit libraries and “thunking” translation layer for 32-bit system calls. 64-bit Mode Migrate only where warranted, and at the user’s pace to fully exploit AMD64 64-bit OS requires all kernel-level programs & drivers to be ported. Any program that is linked or plugged in to a 64-bit program (ABI-level) must be ported to 64-bits. USER KERNEL 64-bit Operating System 64-bit Device Drivers Translation 32-bit thread 32-bit Application 4GB expanded address space 64-bit Application 64-bit thread 512GB (or 8TB) address space

4 Increased Memory for 32-bit Applications 32-bit server, 4 GB DRAM 64-bit server, 12 GB DRAM 0 GB 2 GB 4 GB 0 GB 2 GB 4 GB Shared 32-bit OS 32-bit App 32-bit App 32-bit OS Virtual Memory 4GB DRAM Virtual Memory 256 TB 32-bit App 0 GB 4 GB 0 GB 4 GB 32-bit App 256 TB Not shared Not shared Not shared 64-bit OS 64-bit OS Virtual Memory Virtual Memory 12GB DRAM OS & App share small 32-bit VM space 32-bit OS & applications all share 4GB DRAM Leads to small dataset sizes & lots of paging App has exclusive use of 32-bit VM space 64-bit OS can allocate each application large dedicated portions of 12GB DRAM OS uses VM space way above 32-bits Leads to larger dataset sizes & reduced paging

5 AMD64 Developer Tools GNU compilers –GCC 3.2.2 - 32-bit and 64-bit –GCC 3.3 - optimized 64-bit CompilerOSBasePeak Intel C/C++ 7.0Windows Server 200310951170 Intel C/C++ 7.0Linux/x86-6410811108 Intel C/C++ 7.0Linux (32-bit)10621100 GCC 3.3 (64-bit)Linux/x86-641045 GCC 3.3 (32-bit)Linux/x86-64980 GCC 3.3 (32-bit)Linux (32-bit)960 1.8 MHz AMD Opteron Processor– SPECint2000 Optimized Compilers Are Reaching Production Quality PGI Workstation 5.0 beta –Windows and 64-bit Linux compilers http://www.aceshardware.com/ Optimized Fortran 77/90, C,C++ Good flags –O2-fastsse

6 Running 32bit and 64bit codes 64bit addresses memory can now the big (>4Gbytes…) boris@quartet4:~> more /proc/meminfo total: used: free: shared: buffers: cached: Mem: 7209127936 285351936 6923776000 0 45547520 140636160 Swap: 1077469184 0 1077469184 MemTotal: 7040164 kB OS has both 32bit and 64bit libraries… boris@quartet4:/usr> ls X11 bin games lib local sbin src x86_64-suse-linux X11R6 include lib64 pgi share tmp For gcc 64bit addressing is the default use –m32 for 32bit… (Don’t confuse 64bit floating point data operations with addressing and pointer lengths..)

7 Running 32bit program boris@quartet4:~/c> gcc –m32 -o test_32bit sizeof_test.c boris@quartet4:~/c>./test_32bit char is 1 short is 2 int is 4 long is 4 long long is 8 unsigned long long is 8 float is 4 double is 8 int * is 4

8 Running 64bit program Pointers are now 64bits long boris@quartet4:~/c> gcc -o test_64bit sizeof_test.c boris@quartet4:~/c>./test_64bit char is 1 short is 2 int is 4 long is 8 long long is 8 unsigned long long is 8 float is 4 double is 8 int * is 8

9 Compilers and flags Intel icc/ifc 32bit code compiled on 64bit OS use –W,-m elf_i386 to tell linker to use 32bit libraries. Intel icc/ifc avoid –xaW (tests cpu id) use –xW to enable SSE PGI pgcc/pgf90 Vector generates prefetch and –Mnontemporal streaming store instructions Absoft f90 looks promising GNU g77 front end limited but gnu backend is good GNU gcc 3.3 is best good (rpm gcc33 perf has faster libraries ?) GNU g++ good common code generator GNU gcc 3.2 good The more compilers the better !

10 PGI Compiler - Additional Features Plans to bundle useful libraries See www.spec.org for SPEC numbers…www.spec.org MPI-CH - Pre-configured libraries and utilities for ethernet- based x86 and AMD64/Linux clusters PBS – Portable Batch System batch-queuing from NASA Ames and MRJ Technologies ScaLAPACK - Pre-compiled distributed-memory parallel Math Library ACML – The AMD Core Math Library is planned to be included Training – Tutorials (OSC), exercises, examples and benchmarks for MPI, OpenMP and HPF programming The Portland Group Compiler Technology

11 Open Source Tools 64-bit Tools TypeAvailableComments ATLAS 3.5.0 Developer ReleaseLibrary http://math-atlas.sourceforge.net/ Optimized BLAS (Basic Linear Algebra Subroutines) library Blackdown Java Platform 2 Version 1.4.2 Linux JAVA http://www.blackdown.com/java-linux/java2-status/jdk1.4-status.html SUN Java products ported to Linux by Blackdown group GNU binutilsUtilities http://www.gnu.org/software/binutils/ GNU collection of binary tools including GNU linker, GNU assembler GNU C++ (g++) 3.2 GNU C (gcc) 3.2 GNU C (gcc) 3.3 (optimized) Compilers http://gcc.gnu.org/ GNU Collection of Compilers (gcc) is a full-featured ANSI C compiler GNU Debugger (GDB)Debugger Analysis tool for debugging programs - included with SuSE SLES 8 GNU glibc 2.2.5 GNU glibc 2.3.2 (optimized) C Library http://www.gnu.org/software/libc/libc.html GNU C Library Other GNU ToolsVarious bash, csb, ksb, strace, libtool - included with SuSE SLES 8 MPICHLibrary Open Source message passing interface for Linux clusters PERL, Python, Ruby, Tcl/TkLanguage Scripting languages - included with SuSE SLES 8 GNU means GNU's Not UNIX TM " and is the primary project of by the Free Software Foundation (FSF), a non-profit organization committed to the creation of a large body of useful, free, source- code-available software. TOP

12 32-bit vs 64-bit App Performance

13 BLAS libraries 3 different BLAS libraries support 32bit and 64bit code 1.ACML (includes FFTs) 2.ATLAS 3.Goto Currently Goto has fastest DGEMM. ~88% of peak on 1P HPL Compare with BLASBENCH and pick best one for your application. For FFTs also consider FFTW

14 Optimized Numerical Libraries: ACML AMD and The Numerical Algorithms Group (NAG) joint development the AMD Core Math Library (ACML) ACML includes Basic Linear Algebra Subroutines (BLAS) levels 1, 2 and 3 A wide variety of Fast Fourier Transforms (FFTs) Linear Algebra Package (LAPACK) ACML has: Fortran and C Interfaces Highly optimized routines for the AMD64 Instruction Set Ability to address single-, double-, single-complex and double-complex data types Will be available for commercially available OSs ACML is freely downloadable from www.developwithamd.com/acmlwww.developwithamd.com/acml

15 DGEMM relative performance K Goto DGEMM 88% of peak FPU performance

Floating Point and Memory Performance

17 Register Differences: x86-64 / x86-32 x86-64 –64-bit integer registers –48-bit Virtual Address –40-bit Physical Address REX - Register Extensions –Sixteen 64-bit integer registers –Sixteen 128-bit SSE registers SSE2 Instruction Set –Double precision scalar and vector operations –16x8, 8x16 way vector MMX operations –SSE1 already added with AMD Athlon MP RAX 63 GPRGPRGPRGPR x87x87x87x87 079 31 AH EAX AL 0715 In x86 MMX0 SSESSESSESSE 1270 MMX7 EAX EIP Added by x86-64 EDI XMM8 MMX8 MMX15 R8 R15 TOP

18 Floating point hardware 4 cycle deep pipeline Separate multiply and add paths 64bit (double precision) 2flops/cycle (1mul + 1add) 32bit (single precision) 4flops/cycle (2muls + 2 adds) 2.0Ghz clock 1 cycle = 0.5ns gives... Theoretical Peak 4Gflops double precision SSE registers 128 bit wide…but packed instructions only help single precision Pipeline depth and separate mul add mean that even register to register are helped by loop unrolling.

19 AMD Opteron™ Processor Architecture 3.2 GB/s per direction @ 800 MHz Dual Data Rate 6.4 GB/s @ 1600 MT/s Data Rate 3.2 GB/s per direction @ 1600 MT/s Data Rate DRAM XBAR MCT CPU SRQ 5.3 GB/s 128-bit DDR333 Coherent HyperTransport TM Coherent Hyper- Transport TM TOP

20 Main Memory hardware Dual on-chip memory controllers Remote memory systems accesses via hyper-transport Bandwidth scales with more processors Latency is very good 1P,2P,4P (cache probes on 2P, 4P) Local memory latency is less than remote memory 2P machine 1 hop (worst case) 4P machine 2 hops (worst case) Memory DIMMS 333Mhz (2700) or 266Mhz (2100) Can interleave memory banks (BIOS) Can interleave processor memory (BIOS)

21 Integrated Memory Controller Performance –Peak Bandwidth and Latency –Performance improves by almost 20% compared to AMD Athlon™ topology Peak Memory Bandwidth 64-Bit DCT128-Bit DCT DDR200 PC16001.6GB/s3.2GB/s DDR266 PC21002.1GB/s4.2GB/s DDR333 PC2700 2.7GB/s 5.33GB/s Idle Latencies to First Data 1P System: <80ns 0-Hop in DP System: <80ns 0-Hop in 4P System: ~100ns 1-Hop in MP System: <115ns 2-Hop in MP System: <150ns 3-Hop in MP System: <190ns TOP

22 P0 P3 P1 P2 0- hop P0 P3 P1 P2 1-hop 0 Hop: Local memory access 1 Hop: Remote 1 memory access 2 Hop: Remote 2 memory access Probe: Coherency request between nodes Diameter: Maximum hop count between any pair of nodes Average distance: Average hop count between nodes P0 P3 P1 P2 2-hops Integrated Memory Controller Local versus Remote Memory Access TOP

23 MP Memory Bandwidth Scalability TOP

24 How should the OS allocate memory ? To maximize local accesses ? To get best bandwidth across all processors ? Different needs from different applications Scientific MPI codes already have model of networked 1P machines. Enterprise codes, databases, web servers lots of threads maybe throughput and scrambled memory is best ? SMP kernel plus processor interleaving NUMA kernel plus memory bank interleaving Suse NUMACTL utility allows policy choice per process.

25 MPI libraries (shmem buffers where ?) Argonne MPICH compile with gcc –O2 –funroll-all-loops Myrinet, Quadrics, Dolphin, also have MPI libraries based on MPICH On MP machine uses shared memory in the box. Where do MPI message buffers go ? Currently just malloc one chunk of space for all processors OK for 2P … not so good for 4P machine (2 hops worst.. contention) Scope to improve MPI performance on 4P NUMA machine with better buffer placement…

Synthetic Benchmarks

27 Spectrum of application needs Some codes are memory limited (BIG data) Others are CPU bound (kernels and SMALL data) ALL memory codes like low latency to memory ! Example in BLAS libraries BLAS1 vector-vector … Memory Intensive …STREAM BLAS2 matrix-vector BLAS3 matrix-matrix … CPU Intensive … DGEMM, HPL Faster CPU will not help memory bound problems… PREFETCH can sometimes hide memory latency (on predictable memory access patterns.

28 LM Bench (memory latency) –LM bench has published comparative numbers –Fastest latency is in 1P Opteron machine – no HT cache probes –Note that HP sometimes reports a “within cache line stride” latency number of 110.6. Opteron’s “within cache line stride” latencies are 25ns for 16 bytes and 53ns for 32 bytes. TOP

29 Measured Memory latency Physical limit.. Speed of light almost crosses board in 1ns 2.0Ghz 0.5ns clock ticks… L1 CACHE 1.5ns L2 CACHE 7.7ns Main Memory 89ns Try to hide the big hit of main memory access. Codes with predictable access patterns use PREFETCH (in various flavours) to hide latency.

30 High Performance Linpack A CPU bound problem (memory system not stressed) Solves A x = b using LU factorization Results Peak Gflops rate achieved for matrix size NxN Almost all time is spent in DGEMM Use MPI message passing Larger N the better – fill memory per node Used in www.top500.org ranking of supercomputerswww.top500.org N (half) size what Gflops is half Nmax a measure of overhead Current number one machine is Earth Simulator (40Tflops) Cray Red Storm (10,000 Opterons) has comparable peak

31 Quartet: 4u4P AMD Opteron TM MP Processor Platform Agenda TOP

32 MPI vs threaded BLAS ? BLAS libraries can do thread level parallelism to exploit MP node MPI can treat MP node processors as separate machines talking via shmem. Which is best ? NUMA kernel allocates memory locally for each process But in the box MPI on 4P has memory issues. MPI and single threaded BLAS performs best with NUMA kernel. Mixing OpenMP and MPI is possible and maybe sensible. Generally static upfront decomposition feels better ?

33 High Performance Linpack GOTO Library Results AMD Opteron™ system #P#P Rmax (GFlops) Nmax (order) N1/2 (order) Rpeak (GFlops) GFLOP/ Proc Rmax / Rpeak 4P AMD Opteron 1.8GHz 2GB/proc PC2700 8GB Total412.0628000 1008 14.43.0283.8% 2P AMD Opteron 1.8GHz 2GB/proc PC2700 4GB Total26.2220617 672 7.23.1186.4% 1P AMD Opteron 1.8GHz 2GB PC270013.1415400 336 3.63.1487.1% High-Performance BLAS by Kazushige Goto Optimized http://www.cs.utexas.edu/users/flame/goto GOTO results were with 64-bit SuSE 8.1 Linux Professional Edition with NUMA kernel and Myrinet MPIch-gm-1.2.5..10 message passing library.

34 HPL on 16P (4x4P) Opteron Cluster Machine 4x4P 1.8Ghz 2Gb/processor with single myrinet and gigabit ethernet per box. Goto DGEMM (single threaded) and MPICH-GM Myrinet 8 processors N=40000 N(half)=2252 81.3% peak Myrinet 16 processors N=41288 N(half)=3584 80.5% peak Ethernet 16 processors N=54616 N(half)=5768 78.1% peak A big run to show >4Gbytes/processor working 4P 1.8Ghz 8Gb/processor (32Gbytes in all …266Mhz memory) 4 processors N=60144 N(half)=1123 80.56%peak

35 STREAM Measures sustainable memory bandwidth (in MB/s) Simple kernels on large vectors (~50Mbytes) Vectors must be much bigger than cache Machine balance defined as: peak floating ops/cycle / sustained memory ops/cycle

36 Compiling STREAM PGI compiler recognises Open MP threads directives and generates prefetch instructions and streaming stores pgf77 -Mnontemporal -O3 -fastsse -Minfo=all -Bstatic -mp - c -o stream_d_f.o stream_d.f 180, Parallel loop activated; static block iteration allocation Generating sse code for inner loop Generated prefetch instructions for 2 loads

37 STREAM results 4P Opteron 2.0Ghz with 333Mhz memory Compiled pgf77 –O3 –mp –fastsse -Mnontemporal 64bit word flops Rates in Mbytes/sec Triad ~310Mflops /processor (About the same as a CrayYMP ?)

Application Benchmarks

39 Greedy – Travelling Salesman Problem Solves the traveling salesman problem and is sensitive to memory latency and bandwidth. An example of increasing importance of memory performance as problems grows from small to large. TOP

40 OPA -Parallel Ocean Model –A large scientific code from France… –The uses LAM-MPI. –Compiled in France with ifc 7.1 binary run in USA using the same 32bit version of LAM-MPI, ifc runtime and LD_LIBRARY_PATH settings.

41 PARAPYR Direct Numerical Simulation of Turbulent Combustion Single processor performance on 1P system Suse AMD64 Linux NUMA kernel Problem size (small) 340x340 Mflops are double precision (64bits) Run identical statically linked binary on Intel P4 2.8Ghz (dell) Compiled with ifc 7.1 –r8 -ip Opteron 1.8Ghz 420 Mflops P4 2.8Ghz 351 Mflops Best Opteron results (vs beta Absoft and PGI compilers) 435Mflops ifc –r8 –xW –static 437Mflops ifc -r8 -O2 -xW -ipo -static -FDO (-FDO pass1 = -prof_gen pass2=-prof_use

42 PARAPYR Best 1.8Ghz Opteron ifc results 435Mflops ifc –r8 –xW –static 437Mflops ifc -r8 -O2 -xW -ipo -static -FDO (-FDO pass1 = -prof_gen pass2=-prof_use)

43 Conclusions –Use AMD64 64bit OS with NUMA support –32bit compiled applications run well Know you applications memory or cpu needs (so you know what to expect) 64bit compilers need work (as ever) –Competitive processors Itanium 1.5Ghz and Xeon 3.0Ghz have higher peak FLOPs but relatively poor memory scaling Highly tuned benchmarks which make heavy use of floating point and which fit in cache or make low use of the memory system perform better on Itanium Xeon memory system does not scale well Opteron has excellent memory system performance and scalability – both bandwidth and latency Codes that depend on memory latency or bandwidth perform better on Opteron Codes with a mix of integer and floating point will perform better on Opteron Code that is not highly tuned will likely perform better on Opteron More Upside from MPI 4P memory layout (MPICH2 ?) and 64bit compilers TOP

44 AMD, the AMD Arrow logo, AMD Athlon, AMD Opteron, 3DNow! and combinations thereof, AMD-8100, AMD-8111 and AMD-8131 AMD- 8151 are trademarks of Advanced Micro Devices, Inc. HyperTransport is a licensed trademark of the HyperTransport Technology Consortium. Microsoft and Windows are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdictions. Pentium and MMX are registered trademarks of Intel Corporation in the U.S. and/or other jurisdictions. SPEC and SPECfp are registered trademarks of Standard Performance Evaluation Corporation in the U.S. and/or other jurisdictions. Other product and company names used in this presentation are for identification purposes only and may be trademarks of their respective companies.

Performance Issues Application Programmers View John Cownie HPC Benchmark Engineer.

Similar presentations

Presentation on theme: "Performance Issues Application Programmers View John Cownie HPC Benchmark Engineer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance Issues Application Programmers View John Cownie HPC Benchmark Engineer.

Similar presentations

Presentation on theme: "Performance Issues Application Programmers View John Cownie HPC Benchmark Engineer."— Presentation transcript:

Similar presentations

About project

Feedback