Scientific codes on a cluster of Itanium-based nodes Joseph Pareti HPTC Consultant HP Germany
page 2 Overview Itanium 2 features for performance HP-XC 6000, the HP cluster solution based on Linux Software development tools and a case study (pfmon) hptcClusterPack for HP-UX
page 3 HP and Intel together developed Itanium® EPIC, the winning approach to 64bit enterprise computing Legacy free platform, multi-OS support Massive resources: 11 exec. unit, 2 bundles/clock, large register files, 3-tiered cache, TLB, ALAT, RSE, PMU, … ILP means the compiler has to extract parallelism, through predication, control and data speculation, software pipelining, register rotation, branch prediction efficient support for modular,oo 95 % of peak on DGEMM Last not least: design is only ONE part of the equation (the other part is manufacturing/process prowess
page 4 Predication source code non-optimized code ITANIUM code if (r1) cmp.eq p1,p2=r1,r0 cmp.ne p1,p2 =r1,0;; (p1) br.cond else_clause r2 = r3 + r4 add r2=r3,r4 (p1) add r2=r3,r4 br end_if else else_clause: r7 = r6 – r5 sub r7=r6,r5 (p2)sub r7=r6,r5 end_if: end if 5 cycles, including 2 cycles potential branch misprediction(30%of 10 cycles)
page 5 Speculation Original code Optimized code (speculation) speculative load control or data dependency control or data dependency (e.g. store) (e.g. store) original load check for exceptions or memory conflict use of load
page 6 Software Pipelining iteration 1 iteration 2 iteration 3 iteration 4 iteration 5 cycle ld4 X X+1 add ld4 X+2 st4add ld4 X+3 st4add ld4X+4 st4add X+5 st4addX+6 st4X+7 Prolog Epilog
page 7 Memory hierarchy L1I/L1D cache:16 KB, 4-way associative, 64-byte line size, 1- cycle latency. L1I (L1D) is dual(quad) ported. L2 cache: 256 KB unified, 8-way associative, 128-byte line size, 16 banks; latency is 5(int load) to11 cycles. - Itanium supports 4 loads, 2 stores, 2 FMA’s /cycle - 32 bytes/cycle, 2 cache line per 8 cycles - daxpy can run 32 flops/9 cycles (9-th for lfetch) - That means 3555 out of 4000 MFLOPS peak (1GHz clock) L3 cache (3 to 6 MB, unified, 12-way associative, 128-byte line size, one bank, latency is 12 (integer miss) to 18 (instruction miss).
page 8 HP XC 6000 Scalable HPTC Linux product HP XC 6000 based on rx2600 nodes HP engineered, fully integrated scalable cluster architecture Best of breed interconnects (Quadrics for HP XC 6000) – Turn-key/factory-integrated – HP engineered scalable system software, with single system view – Best of breed development software – Grid enabled Sold with hp standard warrantees and service – HP product qualification and support Extensive regression testing, debugging and field testing
page 9 HP XC node specialization and interconnect(s) QsNet interconnect Compute nodes Login & Mgt nodes I/O and storage nodes
page 10 HP XC Software Stack Red Hat™ Linux Advanced Server (V2.1 today) Node-specialization; application and service (e.g. I/O) nodes Platform LSF (V5.*) integrated with resource mgmt. Shared user file systems via NFS; Lustre file system (version 2) Centralized system installation and management – Single shared root file system (system files) – eliminates duplication and propagation – Including remote console support – Tree-structured boot for very fast scalable booting Grid enabled
page 11 File Access in XC V1 System Files Local Files Remote Files Application Node Admin Master Admin Service External NFS server NFS redirect (cache) NFS redirect NFS App Local
User / applicat environment: File service Service Node App. Node File Service NFS served
XC Design Focus: RAS Wire once manual failover of service nodes Some OS services provide automatic failover Application node failure and isolation Service operations on administrative network Soon: Further automation of service node failover Service
page 14 development tools – Compilers gnu Intel C/C++ v6.0 good compatibility with gnu OpenMP 1.0, Fortran Intel Fortran 95 compiler OpenMP 1.1 mixed language support, gnu compatibility cross-platform code generation for IPF, so development system can be an IA-32 platform, even for IPF Cluster Intel compilers benefited from collaboration with HP and Compaq, particularly for IPF
page 15 development environment Debuggers Etnus TotalView Intel ldb Performance Tools Gnu gprof/prof Intel Vtune Performance hp pfmon hardware counter monitor Pallas Vampir event trace analyzer Intel KAI Software Lab tools for OpenMP Intel Math Kernel Library (MKL) HP’s Math Library (MLIB)- extremely useful for IPF MPI/MPICH
page 16 pfmon “drill down” Configurable profiler, reads PMU, 4 counters active All Itanium 2 stalls can be attributed to higher- level mechanisms: – I-cache stalls, D-cache stalls, Branch mispredictions, RSE save/restore, GR scoreboarding, etc. As an example in most SPECINT jobs I see: – Problem no. 1: D-cache stalls – Problem no. 2: Branch mispredictions Next question: – How can we learn more about D-cache stalls ?
page 17 pfmon (value) IA-64 Performance Counters – Extreme useful to drill down into “problem areas” – Issues: Few h/w counters may make it time- consuming – Coherent information obtained from CPU counters, Stall counters, and Memory counters – Nevertheless: Not always immediately obvious how to IMPROVE performance Coupling with “code inspection” (source and compiler output) seems natural to make rapid progress
page 18 HP ServiceControl Manager (SCM) Platform Computing Clusterware Pro™ HP Systems Inventory Manager (SIM) HP Application Restart (AppRS) HP Cluster Management Utility Zone Provide simple commands to manage the cluster Cluster Workload Manager and Monitor Check the system consistency of the cluster Framework for Single Point of Cluster Management Drive Application Checkpoint restart in the cluster hptc/ClusterPack Core Components IPFilter/9000 IP Aliasing Enabler For further info, please visit
page 19 For more information, please visit For more information hptc Itanium2 hp-ux cluster solution, please visit and click on Itanium cluster technologywww.hp.com/go/hptc For hptc/ClusterPack information and on-line tutorial, please visit
page 20 Summary HP has a strategic focus on IA64 –based systems and software HP XC is designed to leverage on Itanium application performance and leading edge technologies, while retaining the ease and confort of a fully supported HP solution (RAS) World-wide service and support for the HPTC market