Scientific codes on a cluster of Itanium-based nodes Joseph Pareti HPTC Consultant HP Germany

Slides:



Advertisements
Similar presentations
Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
Advertisements

ARCHER Tips and Tricks A few notes from the CSE team.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Agenda … HPC Technology & Trends HPC Platforms & Roadmaps HP Supercomputing Vision HP Today.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Introduction to DBA.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
What’s New in BMC ProactiveNet 9.5?
Parallel/Concurrent Programming on the SGI Altix Conley Read January 25, 2007 UC Riverside, Department of Computer Science.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
NYU DARPA DIS kick-off September 24, Comparing IA-64 and HPL-PD NYU.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
IA-64 Architecture (Think Intel Itanium) also known as (EPIC – Extremely Parallel Instruction Computing) a new kind of superscalar computer HW 5 - Due.
Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
©2010 Check Point Software Technologies Ltd. | [Unrestricted] For everyone Endpoint Security Current portfolio and looking forward October 2010.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
Chapter 5 Roles and features. objectives Performing management tasks using the Server Manager console Understanding the Windows Server 2008 roles Understanding.
Bob Thome, Senior Director of Product Management, Oracle SIMPLIFYING YOUR HIGH AVAILABILITY DATABASE.
 Demand Technology Software, Inc. Windows XP Performance and Tuning: An Update Demand Technology Software 1020 Eighth Avenue South, Suite 6, Naples,
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
Ch Review1 Review Chapter Microcomputer Systems Hardware, Software, and the Operating System.
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Comparing High-End Computer Architectures for Business Applications Presentation: 493 Track: HP-UX Dr. Frank Baetke HP.
The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส
Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.
Copyright © 2002, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Performance of mathematical software Agner Fog Technical University of Denmark
Srihari Makineni & Ravi Iyer Communications Technology Lab
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
© © 2003, Hewlett Packard 6-Mar-03 Some Thoughts on HP & HPTC & IA-64 SOS7 6-Mar-03 Richard Kaufmann.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Cluster Software Overview
Transmeta’s New Processor Another way to design CPU By Wu Cheng
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Software Performance Monitoring Daniele Francesco Kruse July 2010.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.
Operational and Application Experiences with the Infiniband Environment Sharon Brunett Caltech May 1, 2007.
Next Generation of Apache Hadoop MapReduce Owen
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
CNAF - 24 September 2004 EGEE SA-1 SPACI Activity Italo Epicoco.
EPIC 64-bit Architecture: The Itanium and Itanium 2 CSCE 380 5/10/2000, 12/11/2003, 5/7/2004 Draft.
SQL Database Management
15-740/ Computer Architecture Lecture 3: Performance
Visit for more Learning Resources
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CRESCO Project: Salvatore Raia
Henk Corporaal TUEindhoven 2009
The HP OpenVMS Itanium® Calling Standard
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
IA-64 Microarchitecture --- Itanium Processor
Yingmin Li Ting Yan Qi Zhao
SiCortex Update IDC HPC User Forum
Henk Corporaal TUEindhoven 2011
Sampoorani, Sivakumar and Joshua
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
VLIW direct descendant of horizontal microprogramming
Presentation transcript:

Scientific codes on a cluster of Itanium-based nodes Joseph Pareti HPTC Consultant HP Germany

page 2 Overview Itanium 2 features for performance HP-XC 6000, the HP cluster solution based on Linux Software development tools and a case study (pfmon) hptcClusterPack for HP-UX

page 3 HP and Intel together developed Itanium® EPIC, the winning approach to 64bit enterprise computing Legacy free platform, multi-OS support Massive resources: 11 exec. unit, 2 bundles/clock, large register files, 3-tiered cache, TLB, ALAT, RSE, PMU, … ILP means the compiler has to extract parallelism, through predication, control and data speculation, software pipelining, register rotation, branch prediction efficient support for modular,oo 95 % of peak on DGEMM Last not least: design is only ONE part of the equation (the other part is manufacturing/process prowess

page 4 Predication source code non-optimized code ITANIUM code if (r1) cmp.eq p1,p2=r1,r0 cmp.ne p1,p2 =r1,0;; (p1) br.cond else_clause r2 = r3 + r4 add r2=r3,r4 (p1) add r2=r3,r4 br end_if else else_clause: r7 = r6 – r5 sub r7=r6,r5 (p2)sub r7=r6,r5 end_if: end if 5 cycles, including 2 cycles potential branch misprediction(30%of 10 cycles)

page 5 Speculation Original code Optimized code (speculation) speculative load control or data dependency control or data dependency (e.g. store) (e.g. store) original load check for exceptions or memory conflict use of load

page 6 Software Pipelining iteration 1 iteration 2 iteration 3 iteration 4 iteration 5 cycle ld4 X X+1 add ld4 X+2 st4add ld4 X+3 st4add ld4X+4 st4add X+5 st4addX+6 st4X+7 Prolog Epilog

page 7 Memory hierarchy L1I/L1D cache:16 KB, 4-way associative, 64-byte line size, 1- cycle latency. L1I (L1D) is dual(quad) ported. L2 cache: 256 KB unified, 8-way associative, 128-byte line size, 16 banks; latency is 5(int load) to11 cycles. - Itanium supports 4 loads, 2 stores, 2 FMA’s /cycle - 32 bytes/cycle, 2 cache line per 8 cycles - daxpy can run 32 flops/9 cycles (9-th for lfetch) - That means 3555 out of 4000 MFLOPS peak (1GHz clock) L3 cache (3 to 6 MB, unified, 12-way associative, 128-byte line size, one bank, latency is 12 (integer miss) to 18 (instruction miss).

page 8 HP XC 6000 Scalable HPTC Linux product HP XC 6000 based on rx2600 nodes HP engineered, fully integrated scalable cluster architecture Best of breed interconnects (Quadrics for HP XC 6000) – Turn-key/factory-integrated – HP engineered scalable system software, with single system view – Best of breed development software – Grid enabled Sold with hp standard warrantees and service – HP product qualification and support Extensive regression testing, debugging and field testing

page 9 HP XC node specialization and interconnect(s) QsNet interconnect Compute nodes Login & Mgt nodes I/O and storage nodes

page 10 HP XC Software Stack Red Hat™ Linux Advanced Server (V2.1 today) Node-specialization; application and service (e.g. I/O) nodes Platform LSF (V5.*) integrated with resource mgmt. Shared user file systems via NFS; Lustre file system (version 2) Centralized system installation and management – Single shared root file system (system files) – eliminates duplication and propagation – Including remote console support – Tree-structured boot for very fast scalable booting Grid enabled

page 11 File Access in XC V1 System Files Local Files Remote Files Application Node Admin Master Admin Service External NFS server NFS redirect (cache) NFS redirect NFS App Local

User / applicat environment: File service Service Node App. Node File Service NFS served

XC Design Focus: RAS Wire once manual failover of service nodes Some OS services provide automatic failover Application node failure and isolation Service operations on administrative network Soon: Further automation of service node failover Service

page 14 development tools – Compilers gnu Intel C/C++ v6.0  good compatibility with gnu  OpenMP 1.0, Fortran Intel Fortran 95 compiler  OpenMP 1.1  mixed language support, gnu compatibility  cross-platform code generation for IPF, so development system can be an IA-32 platform, even for IPF Cluster  Intel compilers benefited from collaboration with HP and Compaq, particularly for IPF

page 15 development environment Debuggers Etnus TotalView Intel ldb Performance Tools Gnu gprof/prof Intel Vtune Performance hp pfmon hardware counter monitor Pallas Vampir event trace analyzer Intel KAI Software Lab tools for OpenMP Intel Math Kernel Library (MKL) HP’s Math Library (MLIB)- extremely useful for IPF MPI/MPICH

page 16 pfmon “drill down” Configurable profiler, reads PMU, 4 counters active All Itanium 2 stalls can be attributed to higher- level mechanisms: – I-cache stalls, D-cache stalls, Branch mispredictions, RSE save/restore, GR scoreboarding, etc. As an example in most SPECINT jobs I see: – Problem no. 1: D-cache stalls – Problem no. 2: Branch mispredictions Next question: – How can we learn more about D-cache stalls ?

page 17 pfmon (value) IA-64 Performance Counters – Extreme useful to drill down into “problem areas” – Issues: Few h/w counters may make it time- consuming – Coherent information obtained from CPU counters, Stall counters, and Memory counters – Nevertheless: Not always immediately obvious how to IMPROVE performance Coupling with “code inspection” (source and compiler output) seems natural to make rapid progress

page 18 HP ServiceControl Manager (SCM) Platform Computing Clusterware Pro™ HP Systems Inventory Manager (SIM) HP Application Restart (AppRS) HP Cluster Management Utility Zone Provide simple commands to manage the cluster Cluster Workload Manager and Monitor Check the system consistency of the cluster Framework for Single Point of Cluster Management Drive Application Checkpoint restart in the cluster hptc/ClusterPack Core Components IPFilter/9000 IP Aliasing Enabler For further info, please visit

page 19 For more information, please visit For more information hptc Itanium2 hp-ux cluster solution, please visit and click on Itanium cluster technologywww.hp.com/go/hptc For hptc/ClusterPack information and on-line tutorial, please visit

page 20 Summary HP has a strategic focus on IA64 –based systems and software HP XC is designed to leverage on Itanium application performance and leading edge technologies, while retaining the ease and confort of a fully supported HP solution (RAS) World-wide service and support for the HPTC market